MoE PnP架構設計

一、前言#

因為我目前的黑洞逆問題科研工作，主要的思想是想要將很多個Prior先驗（模型），目前我為了怎麼將 MoE 去整合先驗模型進行思考設計，但是也不確定是不是具體我們想要的結果，關於 V3 的設計是針對 MoE 的 Router 層進行微調訓練，需要很多個先驗模型才能知道效果。

這裡的原始項目是參考 Inverse Bench 論文開源的項目 Github repo，進行改進。

二、原始的DPS實現代碼#

2.1 代碼實現#

DPS的算法偽代碼：

dps.py：

1
import torch
2
from tqdm import tqdm
3
from .base import Algo
4
from utils.scheduler import Scheduler
5
import numpy as np
6

7
class DPS(Algo):
8

9
    '''
10
    DPS algorithm implemented in EDM framework.
11
    '''
12

13
    def __init__(self,
14
                 net,
15
                 forward_op,
16
                 diffusion_scheduler_config,
17
                 guidance_scale,
18
                 sde=True):
19
        super(DPS, self).__init__(net, forward_op)
20
        self.scale = guidance_scale
21
        self.diffusion_scheduler_config = diffusion_scheduler_config
22
        self.scheduler = Scheduler(**diffusion_scheduler_config)
23
        self.sde = sde
24

25
    def inference(self, observation, num_samples=1, **kwargs):
26
        device = self.forward_op.device
27
        if num_samples > 1:
28
            observation = observation.repeat(num_samples, 1, 1, 1)
29
        # 初始化 xN
30
        x_initial = torch.randn(num_samples, self.net.img_channels, self.net.img_resolution, self.net.img_resolution, device=device) * self.scheduler.sigma_max
31
        x_next = x_initial
32
        x_next.requires_grad = True
33

34
        pbar = tqdm(range(self.scheduler.num_steps))
35

36
        for i in pbar:
37
            x_cur = x_next.detach().requires_grad_(True)
38

39
            sigma, factor, scaling_factor = self.scheduler.sigma_steps[i], self.scheduler.factor_steps[i], self.scheduler.scaling_factor[i]
40

41
            # 網絡預測 sθ
42
            denoised = self.net(x_cur / self.scheduler.scaling_steps[i], torch.as_tensor(sigma).to(x_cur.device))
43
            gradient, loss_scale = self.forward_op.gradient(denoised, observation, return_loss=True)
44

45
            ll_grad = torch.autograd.grad(denoised, x_cur, gradient)[0]
46
            ll_grad = ll_grad * 0.5 / torch.sqrt(loss_scale)
47

48
            # 計算 x̂0
49
            score = (denoised - x_cur / self.scheduler.scaling_steps[i]) / sigma ** 2 / self.scheduler.scaling_steps[i]
50
            pbar.set_description(f'Iteration {i + 1}/{self.scheduler.num_steps}. Data fitting loss: {torch.sqrt(loss_scale)}')
51

52
            if self.sde:
53
                # 採樣噪聲 z
54
                epsilon = torch.randn_like(x_cur)
55
                # 採樣 x′_{i-1}
56
                x_next = x_cur * scaling_factor + factor * score + np.sqrt(factor) * epsilon
57
            else:
58
                # 採樣 x′_{i-1}
59
                x_next = x_cur * scaling_factor + factor * score * 0.5
60
            # 梯度下降
61
            x_next -= ll_grad * self.scale
62
        # 返回 x̂0
63
        return x_next

2.2 實驗結果#

分別用 Cifar10 和 TCIR（颱風）數據集訓練出來的 Prior 先驗分別進行驗證：

cifar10: 'psnr': 9.191744312139141

1
Final metric results: {'cp_chi2': 73.6760071182251, 'cp_chi2_std': 135.51811159539577, 'camp_chi2': 799.1351531076431, 'camp_chi2_std': 1740.545352802215, 'psnr': 9.191744312139141, 'psnr_std': 1.4698939952499275, 'blur_psnr (f=10)': 9.191744508743286, 'blur_psnr (f=10)_std': 1.4698940834610004, 'blur_psnr (f=15)': 10.603785195350646, 'blur_psnr (f=15)_std': 1.623055096434544, 'blur_psnr (f=20)': 11.29933590888977, 'blur_psnr (f=20)_std': 1.8209445867400313}...

tcir：'psnr': 8.946716022116307

1
Final metric results: {'cp_chi2': 52.93387850999832, 'cp_chi2_std': 120.19381272641012, 'camp_chi2': 302.9758258509636, 'camp_chi2_std': 1354.8764523472619, 'psnr': 8.946716022116307, 'psnr_std': 1.4229784549637379, 'blur_psnr (f=10)': 8.94671626329422, 'blur_psnr (f=10)_std': 1.4229785366944958, 'blur_psnr (f=15)': 10.188623633384704, 'blur_psnr (f=15)_std': 1.5537114323159416, 'blur_psnr (f=20)': 10.882783169746398, 'blur_psnr (f=20)_std': 1.6727785936316104}...

三、V1—加入 MoE 架構#

3.1 流程架構圖#

3.2 代碼實現#

加入 MoE，並且有路由（沒有額外加入 Attention, LayerNom）

1
# algo/dps.py (日誌功能內置於MoEPrior的最終版本)
2

3
import torch
4
import torch.nn as nn
5
import torch.nn.functional as F
6
from tqdm import tqdm
7
from .base import Algo
8
from utils.scheduler import Scheduler
9
import numpy as np
10

11
# ==============================================================================
12
# 1. MoE 核心組件 (修改點：日誌邏輯內置)
13
# ==============================================================================
14

15
class ExpertPrior(nn.Module):
16
    def __init__(self, pretrained_net):
17
        super().__init__()
18
        self.net = pretrained_net
19
        for param in self.net.parameters():
20
            param.requires_grad = False
21
        self.net.eval()
22

23
    def forward(self, x_t_scaled, sigma):
24
        return self.net(x_t_scaled, sigma)
25

26
class Router(nn.Module):
27
    def __init__(self, input_channels, num_experts, hidden_dim=256):
28
        super().__init__()
29
        self.feature_extractor = nn.Sequential(
30
            nn.Conv2d(input_channels, 32, kernel_size=5, stride=2, padding=2),
31
            nn.GELU(),
32
            nn.Conv2d(32, 64, kernel_size=3, stride=2, padding=1),
33
            nn.GELU(),
34
            nn.AdaptiveAvgPool2d((1, 1)),
35
            nn.Flatten(),
36
        )
37
        self.gate_mlp = nn.Sequential(
38
            nn.Linear(64 + 1, hidden_dim),
39
            nn.GELU(),
40
            nn.Linear(hidden_dim, num_experts)
41
        )
42

43
    def forward(self, x_t, sigma):
44
        img_features = self.feature_extractor(x_t)
45
        if sigma.dim() == 0:
46
            sigma_tensor = sigma.repeat(x_t.size(0))
47
        else:
48
            sigma_tensor = sigma
49

50
        sigma_features = (sigma_tensor.float() / 80.0).view(-1, 1)
51
        combined_features = torch.cat([img_features, sigma_features], dim=-1)
52
        logits = self.gate_mlp(combined_features)
53
        return logits
54

55
class MoEPrior(nn.Module):
56
    def __init__(self, expert_nets: list, top_k: int = 2, log_every_n_steps: int = 100):
57
        super().__init__()
58
        assert len(expert_nets) > 0, "專家列表不能為空"
59
        self.num_experts = len(expert_nets)
60

61
        self.img_channels = expert_nets[0].img_channels
62
        self.img_resolution = expert_nets[0].img_resolution
63

64
        self.experts = nn.ModuleList([ExpertPrior(net) for net in expert_nets])
65
        self.router = Router(input_channels=self.img_channels, num_experts=self.num_experts)
66
        self.top_k = min(top_k, self.num_experts)
67

68
        # --- 修改點 1: 添加內部計步器和日誌頻率 ---
69
        self.internal_step = 0
70
        self.log_every_n_steps = log_every_n_steps
71

72
    def forward(self, x_t_scaled, sigma):
73
        # --- 修改點 2: 在forward內部實現日誌打印 ---
74
        # 每次調用，計步器加一
75
        self.internal_step += 1
76

77
        router_logits = self.router(x_t_scaled, sigma)
78
        top_k_logits, top_k_indices = torch.topk(router_logits, self.top_k, dim=-1)
79
        gating_weights = F.softmax(top_k_logits, dim=-1)
80

81
        # 根據頻率決定是否打印日誌
82
        if self.internal_step % self.log_every_n_steps == 0:
83
            # 由於tqdm的存在，加一個換行符讓輸出更整潔
84
            print(f"\n[MoE Log @ Internal Step {self.internal_step}]: Activated experts -> {top_k_indices.tolist()}")
85

86
        final_denoised = torch.zeros_like(x_t_scaled)
87

88
        for i in range(x_t_scaled.size(0)):
89
            sample_final_output = torch.zeros_like(x_t_scaled[i])
90
            for k_idx in range(self.top_k):
91
                expert_index = top_k_indices[i, k_idx]
92
                weight = gating_weights[i, k_idx]
93

94
                chosen_expert = self.experts[expert_index]
95

96
                current_sigma = sigma[i] if sigma.dim() > 0 else sigma
97
                expert_output = chosen_expert(x_t_scaled[i].unsqueeze(0), current_sigma.unsqueeze(0) if current_sigma.dim() == 0 else current_sigma)
98
                sample_final_output += weight * expert_output.squeeze(0)
99

100
            final_denoised[i] = sample_final_output
101

102
        # --- 修改點 3: 保持原始的返回簽名 ---
103
        return final_denoised
104

105
# ==============================================================================
106
# 2. 您原始的 DPS 類 (保持不變)
107
# ==============================================================================
108

109
class DPS(Algo):
110
    # ... 此處代碼保持不變 ...
111
    def __init__(self, net, forward_op, diffusion_scheduler_config, guidance_scale, sde=True):
112
        super(DPS, self).__init__(net, forward_op)
113
        self.scale = guidance_scale
114
        self.diffusion_scheduler_config = diffusion_scheduler_config
115
        self.scheduler = Scheduler(**diffusion_scheduler_config)
116
        self.sde = sde
117
    def inference(self, observation, num_samples=1, **kwargs):
118
        device = self.forward_op.device
119
        if num_samples > 1:
120
            observation = observation.repeat(num_samples, 1, 1, 1)
121
        x_initial = torch.randn(num_samples, self.net.img_channels, self.net.img_resolution, self.net.img_resolution, device=device) * self.scheduler.sigma_max
122
        x_next = x_initial
123
        x_next.requires_grad = True
124
        pbar = tqdm(range(self.scheduler.num_steps))
125
        for i in pbar:
126
            x_cur = x_next.detach().requires_grad_(True)
127
            sigma, factor, scaling_factor = self.scheduler.sigma_steps[i], self.scheduler.factor_steps[i], self.scheduler.scaling_factor[i]
128
            denoised = self.net(x_cur / self.scheduler.scaling_steps[i], torch.as_tensor(sigma).to(x_cur.device))
129
            gradient, loss_scale = self.forward_op.gradient(denoised, observation, return_loss=True)
130
            ll_grad = torch.autograd.grad(denoised, x_cur, gradient)[0]
131
            ll_grad = ll_grad * 0.5 / torch.sqrt(loss_scale)
132
            score = (denoised - x_cur / self.scheduler.scaling_steps[i]) / sigma ** 2 / self.scheduler.scaling_steps[i]
133
            pbar.set_description(f'Iteration {i + 1}/{self.scheduler.num_steps}. Data fitting loss: {torch.sqrt(loss_scale)}')
134
            if self.sde:
135
                epsilon = torch.randn_like(x_cur)
136
                x_next = x_cur * scaling_factor + factor * score + np.sqrt(factor) * epsilon
137
            else:
138
                x_next = x_cur * scaling_factor + factor * score * 0.5
139
            x_next -= ll_grad * self.scale
140
        return x_next
141

142
# ==============================================================================
143
# 3. 為 MoE 修改的新 DPS 類 (恢復簡潔)
144
# ==============================================================================
145

146
class DPS_MoE(Algo):
147
    def __init__(self,
148
                 expert_nets,
149
                 forward_op,
150
                 diffusion_scheduler_config,
151
                 guidance_scale,
152
                 sde=True,
153
                 moe_top_k=2,
154
                 log_every_n_steps=100): # 仍然可以從配置傳入日誌頻率
155

156
        super(DPS_MoE, self).__init__(expert_nets[0], forward_op)
157
        device = self.forward_op.device
158

159
        # 將log_every_n_steps傳遞給MoEPrior
160
        self.moe_prior_net = MoEPrior(
161
            expert_nets=expert_nets,
162
            top_k=moe_top_k,
163
            log_every_n_steps=log_every_n_steps
164
        ).to(device)
165

166
        self.scale = guidance_scale
167
        self.diffusion_scheduler_config = diffusion_scheduler_config
168
        self.scheduler = Scheduler(**diffusion_scheduler_config)
169
        self.sde = sde
170

171
    def inference(self, observation, num_samples=1, **kwargs):
172
        device = self.forward_op.device
173
        if num_samples > 1:
174
            observation = observation.repeat(num_samples, 1, 1, 1)
175
        x_initial = torch.randn(num_samples, self.net.img_channels, self.net.img_resolution, self.net.img_resolution, device=device) * self.scheduler.sigma_max
176
        x_next = x_initial
177
        x_next.requires_grad = True
178
        pbar = tqdm(range(self.scheduler.num_steps))
179

180
        for i in pbar:
181
            x_cur = x_next.detach().requires_grad_(True)
182
            sigma, factor, scaling_factor = self.scheduler.sigma_steps[i], self.scheduler.factor_steps[i], self.scheduler.scaling_factor[i]
183

184
            # --- 修改點 4: 調用方式恢復原樣，日誌在內部處理 ---
185
            denoised = self.moe_prior_net(x_cur / self.scheduler.scaling_steps[i], torch.as_tensor(sigma).to(x_cur.device))
186

187
            gradient, loss_scale = self.forward_op.gradient(denoised, observation, return_loss=True)
188
            ll_grad = torch.autograd.grad(denoised, x_cur, gradient)[0]
189
            ll_grad = ll_grad * 0.5 / torch.sqrt(loss_scale)
190
            score = (denoised - x_cur / self.scheduler.scaling_steps[i]) / sigma ** 2 / self.scheduler.scaling_steps[i]
191
            pbar.set_description(f'Iteration {i + 1}/{self.scheduler.num_steps}. Data fitting loss: {torch.sqrt(loss_scale)}')
192

193
            if self.sde:
194
                epsilon = torch.randn_like(x_cur)
195
                x_next = x_cur * scaling_factor + factor * score + np.sqrt(factor) * epsilon
196
            else:
197
                x_next = x_cur * scaling_factor + factor * score * 0.5
198

199
            x_next -= ll_grad * self.scale
200

201
        return x_next

3.3 實驗結果#

目前 MoE 使用兩個 Prior （DPS、TCIR），Top-K是2，也就是兩個先驗模型都會用到。

結果：'psnr': 9.097164859027906

1
[2025-07-08 17:53:04,992][utils.helper][INFO] - Final metric results: {'cp_chi2': 65.43580444931985, 'cp_chi2_std': 139.26828480535823, 'camp_chi2': 472.58429634332657, 'camp_chi2_std': 1316.0285415933506, 'psnr': 9.097164859027906, 'psnr_std': 1.3572267099574475, 'blur_psnr (f=10)': 9.097165064811707, 'blur_psnr (f=10)_std': 1.3572268146419655, 'blur_psnr (f=15)': 10.42996124267578, 'blur_psnr (f=15)_std': 1.5274257912230074, 'blur_psnr (f=20)': 11.110666754245758, 'blur_psnr (f=20)_std': 1.7034609014416104}...

這結果明顯不如預期，應該是只是設計了 router，沒有額外在 Router 上加上一些層。

四、V2—MoE 的 Router 加上 Transformer Encoder#

4.1 流程架構圖#

4.2 代碼實現#

路由加上 attention 和 LayerNorm 等：

1
# algo/dps.py (集成Attention和LayerNorm到Router的最終版本)
2

3
import torch
4
import torch.nn as nn
5
import torch.nn.functional as F
6
from tqdm import tqdm
7
from .base import Algo
8
from utils.scheduler import Scheduler
9
import numpy as np
10

11
# ==============================================================================
12
# 1. MoE 核心組件 (ExpertPrior保持不變, Router被修改)
13
# ==============================================================================
14

15
class ExpertPrior(nn.Module):
16
    """
17
    一個簡單的包裝器，將預訓練好的先驗模型（如UNet）視為一個專家。
18
    """
19
    def __init__(self, pretrained_net):
20
        super().__init__()
21
        self.net = pretrained_net
22
        for param in self.net.parameters():
23
            param.requires_grad = False
24
        self.net.eval()
25

26
    def forward(self, x_t_scaled, sigma):
27
        return self.net(x_t_scaled, sigma)
28

29
# --- 修改開始: 增強版Router ---
30
class Router(nn.Module):
31
    """
32
    增強版路由器：使用一個小型的Transformer Encoder來增強決策能力。
33
    """
34
    def __init__(self, input_channels, num_experts,
35
                 feature_dim=128, num_attn_heads=4, num_attn_layers=1):
36
        super().__init__()
37
        # 卷積層提取局部特徵，並降維
38
        # 假設輸入是 64x64, 經過兩次stride=2的卷積後，特徵圖大小變為 16x16
39
        self.feature_extractor = nn.Sequential(
40
            nn.Conv2d(input_channels, feature_dim // 2, kernel_size=3, stride=2, padding=1), # 64x64 -> 32x32
41
            nn.GELU(),
42
            nn.Conv2d(feature_dim // 2, feature_dim, kernel_size=3, stride=2, padding=1),    # 32x32 -> 16x16
43
        )
44

45
        # 16x16 = 256 個空間位置，即序列長度
46
        num_patches = 16 * 16
47

48
        # 為空間特徵序列添加可學習的位置編碼
49
        self.pos_embedding = nn.Parameter(torch.randn(1, num_patches, feature_dim))
50

51
        # 使用標準的 Transformer Encoder Layer，它內部已包含Multi-Head Attention和LayerNorm
52
        encoder_layer = nn.TransformerEncoderLayer(
53
            d_model=feature_dim,
54
            nhead=num_attn_heads,
55
            dim_feedforward=feature_dim * 4, # FFN中間層的維度
56
            dropout=0.1,
57
            activation='gelu',
58
            batch_first=True, # 確保輸入格式是 (Batch, Sequence, Feature)
59
            norm_first=True   # 使用Pre-LN結構，更穩定
60
        )
61
        self.transformer_encoder = nn.TransformerEncoder(encoder_layer, num_layers=num_attn_layers)
62

63
        # 最終的門控MLP
64
        # 輸入維度是 Transformer 處理後的特徵維度 + 1 (用於時間步 sigma)
65
        self.gate_mlp = nn.Sequential(
66
            nn.LayerNorm(feature_dim + 1), # 在MLP前也加一個LayerNorm
67
            nn.Linear(feature_dim + 1, feature_dim),
68
            nn.GELU(),
69
            nn.Linear(feature_dim, num_experts)
70
        )
71

72
    def forward(self, x_t, sigma):
73
        # 1. 提取空間特徵: (B, C, 64, 64) -> (B, D, 16, 16)
74
        features = self.feature_extractor(x_t)
75

76
        # 2. 為Transformer準備序列: (B, D, 16, 16) -> (B, 256, D)
77
        b, d, h, w = features.shape
78
        features_seq = features.flatten(2).permute(0, 2, 1)
79

80
        # 3. 添加位置編碼
81
        features_seq += self.pos_embedding
82

83
        # 4. 通過Transformer Encoder處理 (內部已包含Attention和LayerNorm)
84
        attended_features = self.transformer_encoder(features_seq)
85

86
        # 5. 使用平均池化得到全局表示
87
        global_feature = attended_features.mean(dim=1) # (B, D)
88

89
        # 6. 拼接時間並做出最終決策
90
        if sigma.dim() == 0:
91
            sigma_tensor = sigma.repeat(b)
92
        else:
93
            sigma_tensor = sigma
94

95
        sigma_features = (sigma_tensor.float() / 80.0).view(-1, 1) # 簡單標準化
96
        combined_features = torch.cat([global_feature, sigma_features], dim=-1)
97

98
        logits = self.gate_mlp(combined_features)
99
        return logits
100
# --- 修改結束 ---
101

102

103
class MoEPrior(nn.Module):
104
    def __init__(self, expert_nets: list, top_k: int = 2, log_every_n_steps: int = 100):
105
        super().__init__()
106
        assert len(expert_nets) > 0, "專家列表不能為空"
107
        self.num_experts = len(expert_nets)
108

109
        self.img_channels = expert_nets[0].img_channels
110
        self.img_resolution = expert_nets[0].img_resolution
111

112
        self.experts = nn.ModuleList([ExpertPrior(net) for net in expert_nets])
113

114
        # --- 修改點：實例化新的Router ---
115
        self.router = Router(
116
            input_channels=self.img_channels,
117
            num_experts=self.num_experts,
118
            # 您可以在此處或通過配置文件調整路由器的超參數
119
            feature_dim=128,
120
            num_attn_heads=4,
121
            num_attn_layers=2
122
        )
123
        self.top_k = min(top_k, self.num_experts)
124

125
        self.internal_step = 0
126
        self.log_every_n_steps = log_every_n_steps
127

128
    def forward(self, x_t_scaled, sigma):
129
        self.internal_step += 1
130
        router_logits = self.router(x_t_scaled, sigma)
131

132
        top_k_logits, top_k_indices = torch.topk(router_logits, self.top_k, dim=-1)
133
        gating_weights = F.softmax(top_k_logits, dim=-1)
134

135
        if self.internal_step % self.log_every_n_steps == 0:
136
            print(f"\n[MoE Log @ Internal Step {self.internal_step}]: Activated experts -> {top_k_indices.tolist()}")
137

138
        final_denoised = torch.zeros_like(x_t_scaled)
139

140
        for i in range(x_t_scaled.size(0)):
141
            sample_final_output = torch.zeros_like(x_t_scaled[i])
142
            for k_idx in range(self.top_k):
143
                expert_index = top_k_indices[i, k_idx]
144
                weight = gating_weights[i, k_idx]
145
                chosen_expert = self.experts[expert_index]
146
                current_sigma = sigma[i] if sigma.dim() > 0 else sigma
147
                expert_output = chosen_expert(x_t_scaled[i].unsqueeze(0), current_sigma.unsqueeze(0) if current_sigma.dim() == 0 else current_sigma)
148
                sample_final_output += weight * expert_output.squeeze(0)
149
            final_denoised[i] = sample_final_output
150

151
        return final_denoised
152

153

154
# ==============================================================================
155
# 其他類（DPS, DPS_MoE）保持不變，此處省略以保持簡潔
156
# ...
157
# 您的 DPS 和 DPS_MoE 類代碼放在這裡，無需任何修改
158
# ==============================================================================
159
class DPS(Algo):
160
    def __init__(self, net, forward_op, diffusion_scheduler_config, guidance_scale, sde=True):
161
        super(DPS, self).__init__(net, forward_op)
162
        self.scale = guidance_scale
163
        self.diffusion_scheduler_config = diffusion_scheduler_config
164
        self.scheduler = Scheduler(**diffusion_scheduler_config)
165
        self.sde = sde
166
    def inference(self, observation, num_samples=1, **kwargs):
167
        device = self.forward_op.device
168
        if num_samples > 1:
169
            observation = observation.repeat(num_samples, 1, 1, 1)
170
        x_initial = torch.randn(num_samples, self.net.img_channels, self.net.img_resolution, self.net.img_resolution, device=device) * self.scheduler.sigma_max
171
        x_next = x_initial
172
        x_next.requires_grad = True
173
        pbar = tqdm(range(self.scheduler.num_steps))
174
        for i in pbar:
175
            x_cur = x_next.detach().requires_grad_(True)
176
            sigma, factor, scaling_factor = self.scheduler.sigma_steps[i], self.scheduler.factor_steps[i], self.scheduler.scaling_factor[i]
177
            denoised = self.net(x_cur / self.scheduler.scaling_steps[i], torch.as_tensor(sigma).to(x_cur.device))
178
            gradient, loss_scale = self.forward_op.gradient(denoised, observation, return_loss=True)
179
            ll_grad = torch.autograd.grad(denoised, x_cur, gradient)[0]
180
            ll_grad = ll_grad * 0.5 / torch.sqrt(loss_scale)
181
            score = (denoised - x_cur / self.scheduler.scaling_steps[i]) / sigma ** 2 / self.scheduler.scaling_steps[i]
182
            pbar.set_description(f'Iteration {i + 1}/{self.scheduler.num_steps}. Data fitting loss: {torch.sqrt(loss_scale)}')
183
            if self.sde:
184
                epsilon = torch.randn_like(x_cur)
185
                x_next = x_cur * scaling_factor + factor * score + np.sqrt(factor) * epsilon
186
            else:
187
                x_next = x_cur * scaling_factor + factor * score * 0.5
188
            x_next -= ll_grad * self.scale
189
        return x_next
190

191
class DPS_MoE(Algo):
192
    def __init__(self,
193
                 expert_nets,
194
                 forward_op,
195
                 diffusion_scheduler_config,
196
                 guidance_scale,
197
                 sde=True,
198
                 moe_top_k=2,
199
                 log_every_n_steps=100):
200
        super(DPS_MoE, self).__init__(expert_nets[0], forward_op)
201
        device = self.forward_op.device
202
        self.moe_prior_net = MoEPrior(
203
            expert_nets=expert_nets,
204
            top_k=moe_top_k,
205
            log_every_n_steps=log_every_n_steps
206
        ).to(device)
207
        self.scale = guidance_scale
208
        self.diffusion_scheduler_config = diffusion_scheduler_config
209
        self.scheduler = Scheduler(**diffusion_scheduler_config)
210
        self.sde = sde
211

212
    def inference(self, observation, num_samples=1, **kwargs):
213
        device = self.forward_op.device
214
        if num_samples > 1:
215
            observation = observation.repeat(num_samples, 1, 1, 1)
216
        x_initial = torch.randn(num_samples, self.net.img_channels, self.net.img_resolution, self.net.img_resolution, device=device) * self.scheduler.sigma_max
217
        x_next = x_initial
218
        x_next.requires_grad = True
219
        pbar = tqdm(range(self.scheduler.num_steps))
220
        for i in pbar:
221
            x_cur = x_next.detach().requires_grad_(True)
222
            sigma, factor, scaling_factor = self.scheduler.sigma_steps[i], self.scheduler.factor_steps[i], self.scheduler.scaling_factor[i]
223
            denoised = self.moe_prior_net(x_cur / self.scheduler.scaling_steps[i], torch.as_tensor(sigma).to(x_cur.device))
224
            gradient, loss_scale = self.forward_op.gradient(denoised, observation, return_loss=True)
225
            ll_grad = torch.autograd.grad(denoised, x_cur, gradient)[0]
226
            ll_grad = ll_grad * 0.5 / torch.sqrt(loss_scale)
227
            score = (denoised - x_cur / self.scheduler.scaling_steps[i]) / sigma ** 2 / self.scheduler.scaling_steps[i]
228
            pbar.set_description(f'Iteration {i + 1}/{self.scheduler.num_steps}. Data fitting loss: {torch.sqrt(loss_scale)}')
229
            if self.sde:
230
                epsilon = torch.randn_like(x_cur)
231
                x_next = x_cur * scaling_factor + factor * score + np.sqrt(factor) * epsilon
232
            else:
233
                x_next = x_cur * scaling_factor + factor * score * 0.5
234
            x_next -= ll_grad * self.scale
235
        return x_next

4.3 實驗結果#

目前 MoE 使用兩個 Prior （DPS、TCIR），Top-K是2，也就是兩個先驗模型都會用到。

運行結果：'psnr': 9.227320871705553

1
[2025-07-09 01:47:47,485][utils.helper][INFO] - Final metric results: {'cp_chi2': 70.84749164938927, 'cp_chi2_std': 182.86828415721504, 'camp_chi2': 266.2178825187683, 'camp_chi2_std': 986.2303840604698, 'psnr': 9.227320871705553, 'psnr_std': 1.3129285786605203, 'blur_psnr (f=10)': 9.227321200370788, 'blur_psnr (f=10)_std': 1.312928712148772, 'blur_psnr (f=15)': 10.530946354866028, 'blur_psnr (f=15)_std': 1.4752522557387036, 'blur_psnr (f=20)': 11.241626224517823, 'blur_psnr (f=20)_std': 1.6327269666531752}...

猜想這個transformer encoder layer可能是有用的，結果指標提升了，但是因為還是隨機參數，因此可信度有待商榷

4.4 代碼問題#

MoE+DPS 代碼的改進建議和注意點，主要從功能正確性、訓練／推理效率和可維護性三個角度來展開。

4.4.1 功能正確性和數值穩定性#

Router 輸出的 Logits 與 Softmax
- 你先用 torch.topk(router_logits) 然後對 top_k_logits 做 softmax，這樣相當於在截斷後的子空間裡歸一化，理論上沒問題。但要確保在極端情況下不會出現所有選中 logits 都是非常小的負數，從而導致 softmax 溢出／下溢。
- 小貼士：在 F.softmax(top_k_logits, dim=-1) 前，可以減去 top_k_logits.max(dim=-1, keepdim=True)[0] 進一步穩定數值。
sigma 標準化
- 你把 sigma 除以 80 固定標準化，這個常數是否適用於你整個噪聲 schedule？如果 schedule 的最大 sigma 不是 80，那這裡最好動態使用 scheduler.sigma_max。
- 並且注意當 sigma 為標量時 sigma.dim()==0 分支裡，current_sigma.unsqueeze(0) 會報錯——你可以統一先把 sigma 處理成 (B,) 維度，再進來。
梯度 Flow 的控制
- 由於專家網絡被 eval() 且 requires_grad=False，你在後驗階段不會更新它們。這符合你“先驗凍結”的設計，但注意此時 router 也只接收來自 MoEPrior→forward 的梯度信號。
- 如果想讓 router 更好地適配後驗（DPS）裡 data consistency 的反饋，可以考慮在 DPS_MoE 的訓練或微調階段，同樣讓 router 參與梯度更新。

4.4.2 推理效率和矢量化#

當前 MoEPrior.forward 裡你對每個樣本又對每個 Top‑K 專家做 Python 循環，這在 batch 較大時會拖慢推理速度。可以考慮：

1
# 假設 x_t_scaled: (B, C, H, W)
2
# top_k_indices: (B, K)
3
# gating_weights: (B, K)
4
# experts 輸出 shape 都是 (B, C, H, W)
5

6
# 1. 事先對所有 K 個 expert 分別並行推理，得到一個 tensor of shape (K, B, C, H, W)
7
expert_outputs = []
8
for expert in selected_experts:  # selected_experts 長度=K
9
    expert_outputs.append(expert(x_t_scaled, sigma))
10
# expert_outputs -> list of (B,C,H,W)
11
stacked = torch.stack(expert_outputs, dim=1)  # (B, K, C, H, W)
12

13
# 2. gating_weights: (B, K)  -> (B, K, 1, 1, 1)
14
weights = gating_weights.view(B, K, 1, 1, 1)
15

16
# 3. 加權求和
17
final = (stacked * weights).sum(dim=1)  # (B, C, H, W)

這樣可以避免雙重 Python 循環，大幅度提升 GPU 並行效率。
如果選中的專家對所有 batch 都一樣（即 top_k_indices 同一行內都是常量），還能進一步優化；即先 gather 一次 .index_select(dim=0)。

4.4.3 Router 模塊的微調#

位置編碼
- 你用的是可學習的位置編碼 self.pos_embedding，它的維度是 (1,256,feature_dim)。如果你的圖像分辨率、patch 數發生變化（比如 128×128 → 32×32 → 1024 patches），就要同步改代碼或做插值。
- 更通用的做法是用正餘弦位置編碼，或者在 forward 時根據當前特徵圖大小動態插值 pos_embedding。
Feature Extractor
- 你現在只下采兩次，16×16 輸出對 64×64 來說還好，但如果分辨率更大，開更多層或 AMP（混合精度）可能更穩。
- 也可以把 feature_extractor 裡的 Conv→GELU→Conv 換成一個小型 ResBlock，增強表達而不會顯著增加參數。

4.4.4 訓練與日誌#

Load‑Balancing Loss

如前面建議，可以給 router 加一個 auxiliary loss，讓它在整個 batch 裡盡量平均分配給各專家。

1
importance = router_logits.softmax(dim=-1).mean(dim=0)        # (num_experts,)
2
load       = (router_logits.argmax(dim=-1) == torch.arange(num_experts).view(1,-1)).float().mean(dim=0)
3
balance_loss = torch.sum(importance * load) * coeff
4
total_loss = main_loss + balance_loss

Log 頻率與可視化
- 除了 print，你可以在 TensorBoard／Weights & Biases 上畫一張“專家激活頻率”柱狀圖，一圖掌握路由行為。

4.4.5 小結#

功能上，架構已經很清晰：Router → Top‑K → ExpertPrior → 加權 → DPS Data Consistency。
性能上，重點在於矢量化 expert 調用，以及避免循環裡的張量拷貝。
可維護性上，Router 的位置編碼和 feature extractor 可以做成更通用的模塊，以便不同分辨率和任務復用。

五、V3—加入 Router 微調#

5.1 流程架構圖#

此流程圖中，將整個過程分成兩個階段：Router 微調程和推理過程，兩個過程都會經過完整的 DPS_MoE 算法，但是不同的是 Router 會利用 MoE 模塊生成的 loss 產物進行 backward 糾正，進而訓練出 Router 模型；推理過程則不進行 backward 過程。

5.2 代碼實現#

進一步改進 dps_moe.py：

1
import torch
2
import torch.nn as nn
3
import torch.nn.functional as F
4
from tqdm import tqdm
5
from .base import Algo
6
from utils.scheduler import Scheduler
7
import numpy as np
8

9
# ==============================================================================
10
# 1. MoE 核心組件 (全面升級)
11
# ==============================================================================
12

13
class ExpertPrior(nn.Module):
14
    def __init__(self, pretrained_net):
15
        super().__init__()
16
        self.net = pretrained_net
17
        for param in self.net.parameters():
18
            param.requires_grad = False
19
        self.net.eval()
20

21
    def forward(self, x_t_scaled, sigma):
22
        return self.net(x_t_scaled, sigma)
23

24
# 新增：一個簡單的殘差塊，用於增強特徵提取器
25
class ResBlock(nn.Module):
26
    def __init__(self, channels):
27
        super().__init__()
28
        self.conv1 = nn.Conv2d(channels, channels, kernel_size=3, padding=1)
29
        self.norm1 = nn.GroupNorm(8, channels) # GroupNorm對batch size不敏感
30
        self.conv2 = nn.Conv2d(channels, channels, kernel_size=3, padding=1)
31
        self.norm2 = nn.GroupNorm(8, channels)
32

33
    def forward(self, x):
34
        h = F.gelu(self.norm1(self.conv1(x)))
35
        h = self.norm2(self.conv2(h))
36
        return F.gelu(x + h)
37

38
class Router(nn.Module):
39
    def __init__(self, input_channels, num_experts,
40
                 feature_dim=128, num_attn_heads=4, num_attn_layers=2,
41
                 sigma_max=80.0): # 新增sigma_max用於標準化
42
        super().__init__()
43
        self.sigma_max = sigma_max
44
        self.feature_dim = feature_dim
45

46
        # 增強的Feature Extractor，使用ResBlock
47
        self.feature_extractor = nn.Sequential(
48
            nn.Conv2d(input_channels, feature_dim // 2, kernel_size=3, stride=2, padding=1),
49
            ResBlock(feature_dim // 2),
50
            nn.Conv2d(feature_dim // 2, feature_dim, kernel_size=3, stride=2, padding=1),
51
            ResBlock(feature_dim),
52
        )
53

54
        self.pos_embedding = nn.Parameter(torch.randn(1, 16 * 16, feature_dim)) # 假設默認16x16
55

56
        encoder_layer = nn.TransformerEncoderLayer(
57
            d_model=feature_dim, nhead=num_attn_heads, dim_feedforward=feature_dim * 4,
58
            dropout=0.1, activation='gelu', batch_first=True, norm_first=True
59
        )
60
        self.transformer_encoder = nn.TransformerEncoder(encoder_layer, num_layers=num_attn_layers)
61

62
        self.gate_mlp = nn.Sequential(
63
            nn.LayerNorm(feature_dim + 1),
64
            nn.Linear(feature_dim + 1, feature_dim),
65
            nn.GELU(),
66
            nn.Linear(feature_dim, num_experts)
67
        )
68

69
    def forward(self, x_t, sigma):
70
        features = self.feature_extractor(x_t)
71
        b, d, h, w = features.shape
72
        features_seq = features.flatten(2).permute(0, 2, 1)
73

74
        # 動態插值位置編碼，以適應不同分辨率
75
        if features_seq.shape[1] != self.pos_embedding.shape[1]:
76
            pos_embedding_resized = F.interpolate(
77
                self.pos_embedding.permute(0, 2, 1).view(1, d, 16, 16), # 假設原始是16x16
78
                size=(h, w), mode='bilinear', align_corners=False
79
            )
80
            pos_embedding_resized = pos_embedding_resized.flatten(2).permute(0, 2, 1)
81
            features_seq += pos_embedding_resized
82
        else:
83
            features_seq += self.pos_embedding
84

85
        attended_features = self.transformer_encoder(features_seq)
86
        global_feature = attended_features.mean(dim=1)
87

88
        # 統一處理sigma維度
89
        if sigma.dim() == 0:
90
            sigma_tensor = sigma.repeat(b)
91
        else:
92
            sigma_tensor = sigma
93

94
        # 使用動態的sigma_max進行標準化
95
        sigma_features = (sigma_tensor.float() / self.sigma_max).view(-1, 1)
96
        combined_features = torch.cat([global_feature, sigma_features], dim=-1)
97

98
        logits = self.gate_mlp(combined_features)
99
        return logits
100

101
class MoEPrior(nn.Module):
102
    def __init__(self, expert_nets: list, scheduler, top_k: int = 2,
103
                 log_every_n_steps: int = 100, aux_loss_weight: float = 1e-2):
104
        super().__init__()
105
        # ... 初始化代碼 ...
106
        self.num_experts = len(expert_nets)
107
        self.img_channels = expert_nets[0].img_channels
108
        self.experts = nn.ModuleList([ExpertPrior(net) for net in expert_nets])
109
        self.router = Router(input_channels=self.img_channels, num_experts=self.num_experts, sigma_max=scheduler.sigma_max)
110
        self.top_k = min(top_k, self.num_experts)
111
        self.aux_loss_weight = aux_loss_weight
112
        self.internal_step = 0
113
        self.log_every_n_steps = log_every_n_steps
114

115
    def _calculate_aux_loss(self, router_logits):
116
        # 實現負載平衡損失
117
        router_probs = F.softmax(router_logits, dim=-1)
118
        # f_i: 每個專家處理的token比例的期望值 (這裡用概率近似)
119
        f_i = router_probs.mean(dim=0)
120
        # P_i: 路由器分配給每個專家的總概率質量
121
        P_i = router_probs.sum(dim=0) / len(router_logits)
122

123
        # 論文中的 loss = alpha * N * sum(f_i * P_i)
124
        loss = self.num_experts * torch.sum(f_i * P_i)
125
        return loss * self.aux_loss_weight
126

127
    def forward(self, x_t_scaled, sigma):
128
        self.internal_step += 1
129
        router_logits = self.router(x_t_scaled, sigma)
130

131
        aux_loss = self._calculate_aux_loss(router_logits) # 計算輔助損失
132

133
        # 數值穩定的Softmax
134
        top_k_logits, top_k_indices = torch.topk(router_logits, self.top_k, dim=-1)
135
        stable_logits = top_k_logits - top_k_logits.max(dim=-1, keepdim=True).values
136
        gating_weights = F.softmax(stable_logits, dim=-1)
137

138
        # ... 後續的向量化前向傳播代碼不變 ...
139
        if self.internal_step % self.log_every_n_steps == 0:
140
            print(f"\n[MoE Log @ Internal Step {self.internal_step}]: Activated experts -> {top_k_indices.tolist()}")
141

142
        final_denoised = torch.zeros_like(x_t_scaled)
143
        flat_expert_indices = top_k_indices.flatten()
144

145
        for i in range(self.num_experts):
146
            mask = (flat_expert_indices == i)
147
            if not mask.any():
148
                continue
149

150
            masked_indices = mask.nonzero(as_tuple=True)[0]
151
            original_batch_indices = masked_indices // self.top_k
152
            expert_inputs = x_t_scaled[original_batch_indices]
153

154
            if sigma.dim() > 0:
155
                expert_sigma = sigma[original_batch_indices]
156
            else:
157
                expert_sigma = sigma
158

159
            expert_output = self.experts[i](expert_inputs, expert_sigma)
160

161
            expert_weights = gating_weights.flatten()[masked_indices]
162
            weighted_output = expert_output * expert_weights.view(-1, 1, 1, 1)
163

164
            final_denoised.index_add_(0, original_batch_indices, weighted_output)
165

166
        return final_denoised, aux_loss # 返回預測結果和輔助損失
167

168
# ==============================================================================
169
# 修改 DPS_MoE 以處理輔助損失
170
# ==============================================================================
171
class DPS_MoE(Algo):
172
    def __init__(self, expert_nets, forward_op, diffusion_scheduler_config, guidance_scale, sde=True, moe_top_k=2, log_every_n_steps=100, aux_loss_weight=1e-2):
173
        super(DPS_MoE, self).__init__(expert_nets[0], forward_op)
174
        device = self.forward_op.device
175
        self.scheduler = Scheduler(**diffusion_scheduler_config) # 先實例化scheduler
176
        self.moe_prior_net = MoEPrior(
177
            expert_nets=expert_nets,
178
            scheduler=self.scheduler, # 傳入scheduler
179
            top_k=moe_top_k,
180
            log_every_n_steps=log_every_n_steps,
181
            aux_loss_weight=aux_loss_weight # 傳入輔助損失權重
182
        ).to(device)
183
        self.scale = guidance_scale
184
        self.sde = sde
185

186
    def inference(self, observation, num_samples=1, **kwargs):
187
        # 推理時，我們通常不關心輔助損失，所以可以忽略它
188
        # ...
189
        # 修改點：只取denoised結果
190
        denoised, _ = self.moe_prior_net(x_cur / self.scheduler.scaling_steps[i], torch.as_tensor(sigma).to(x_cur.device))
191
        # ...
192
        # (完整的inference代碼省略，因為它的邏輯不變，只是接收返回值的方式變了)
193
        device = self.forward_op.device
194
        if num_samples > 1:
195
            observation = observation.repeat(num_samples, 1, 1, 1)
196
        x_initial = torch.randn(num_samples, self.net.img_channels, self.net.img_resolution, self.net.img_resolution, device=device) * self.scheduler.sigma_max
197
        x_next = x_initial
198
        x_next.requires_grad = True
199
        pbar = tqdm(range(self.scheduler.num_steps))
200
        for i in pbar:
201
            x_cur = x_next.detach().requires_grad_(True)
202
            sigma, factor, scaling_factor = self.scheduler.sigma_steps[i], self.scheduler.factor_steps[i], self.scheduler.scaling_factor[i]
203
            denoised, _ = self.moe_prior_net(x_cur / self.scheduler.scaling_steps[i], torch.as_tensor(sigma).to(x_cur.device))
204
            gradient, loss_scale = self.forward_op.gradient(denoised, observation, return_loss=True)
205
            ll_grad = torch.autograd.grad(denoised, x_cur, gradient)[0]
206
            ll_grad = ll_grad * 0.5 / torch.sqrt(loss_scale)
207
            score = (denoised - x_cur / self.scheduler.scaling_steps[i]) / sigma ** 2 / self.scheduler.scaling_steps[i]
208
            pbar.set_description(f'Iteration {i + 1}/{self.scheduler.num_steps}. Data fitting loss: {torch.sqrt(loss_scale)}')
209
            if self.sde:
210
                epsilon = torch.randn_like(x_cur)
211
                x_next = x_cur * scaling_factor + factor * score + np.sqrt(factor) * epsilon
212
            else:
213
                x_next = x_cur * scaling_factor + factor * score * 0.5
214
            x_next -= ll_grad * self.scale
215
        return x_next

dps_moe_main.py：

1
# dps_moe_main.py (最終整合版)
2

3
import os
4
from omegaconf import OmegaConf, ListConfig
5
import pickle
6
import hydra
7
from hydra.utils import instantiate
8

9
import torch
10
import torch.nn.functional as F
11
from torch.utils.data import DataLoader
12
import wandb
13

14
from utils.helper import open_url, create_logger
15
# 確保您的MoE算法類在這個路徑下
16
from algo.dps import DPS_MoE
17

18
@hydra.main(version_base="1.3", config_path="configs", config_name="config")
19
def main(config):
20
    # =================================================================================
21
    # 1. 初始化環境和配置
22
    # =================================================================================
23
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
24
    if config.tf32:
25
        torch.set_float32_matmul_precision("high")
26
    torch.manual_seed(config.seed)
27

28
    # 根據模式（訓練或推理）設置實驗目錄
29
    mode = 'train_router' if config.get('train_router', False) else 'inference'
30
    exp_dir = os.path.join(config.problem.exp_dir, config.algorithm.name, f"{config.exp_name}_{mode}")
31
    os.makedirs(exp_dir, exist_ok=True)
32

33
    logger = create_logger(exp_dir)
34
    OmegaConf.save(config, os.path.join(exp_dir, 'config.yaml'))
35

36
    if config.wandb:
37
        wandb.init(project=config.problem.name, group=config.algorithm.name,
38
                   config=OmegaConf.to_container(config), name=f"{config.exp_name}_{mode}",
39
                   reinit=True, settings=wandb.Settings(start_method="fork"))
40
        config = OmegaConf.create(dict(wandb.config))
41

42
    # =================================================================================
43
    # 2. 數據加載和前向模型
44
    # =================================================================================
45
    forward_op = instantiate(config.problem.model, device=device)
46

47
    # =================================================================================
48
    # 3. 加載多個專家先驗模型
49
    # =================================================================================
50
    logger.info("--- Loading Expert Priors ---")
51
    if 'expert_priors' not in config.problem or not isinstance(config.g`et('problem.expert_priors'), ListConfig):
52
        raise ValueError("Configuration error: `config.problem.expert_priors` must be a list in your problem config.")
53

54
    expert_nets = []
55
    logger.info(f"Loading {len(config.problem.expert_priors)} expert models...")
56
    for ckpt_path in config.problem.expert_priors:
57
        logger.info(f"  Loading expert from {ckpt_path}...")
58
        # 這裡我們實例化一個基礎網絡結構，然後加載權重
59
        # 請確保 config.pretrain.model 能正確實例化您的專家模型架構
60
        net = instantiate(config.pretrain.model)
61
        ckpt = torch.load(ckpt_path, map_location=device)
62

63
        # 根據您的checkpoint格式加載權重
64
        if 'ema' in ckpt:
65
            net.load_state_dict(ckpt['ema'])
66
        elif 'net' in ckpt:
67
            net.load_state_dict(ckpt['net'])
68
        else:
69
            net.load_state_dict(ckpt)
70

71
        net = net.to(device)
72
        net.eval()
73
        expert_nets.append(net)
74
        logger.info(f"  Successfully loaded expert.")
75
    del ckpt
76
    logger.info("All expert models loaded.")
77

78
    # =================================================================================
79
    # 4. 實例化MoE算法
80
    # =================================================================================
81
    logger.info("--- Instantiating MoE Algorithm ---")
82
    algo = instantiate(config.algorithm.method,
83
                       forward_op=forward_op,
84
                       expert_nets=expert_nets)
85
    logger.info(f"Algorithm '{config.algorithm.name}' instantiated successfully.")
86

87
    # =================================================================================
88
    # 5. 根據模式執行不同任務 (訓練或推理)
89
    # =================================================================================
90

91
    # ------------------ 訓練路由器模式 ------------------
92
    if config.get('train_router', False):
93
        logger.info("--- Starting Router Training ---")
94

95
        # 準備訓練數據集
96
        trainset = instantiate(config.problem.data, train=True) # 假設您的Dataset類支持train=True模式
97
        trainloader = DataLoader(trainset, batch_size=config.train.batch_size, shuffle=True)
98

99
        # 設置只訓練路由器參數的優化器
100
        optimizer = torch.optim.Adam(algo.moe_prior_net.router.parameters(), lr=config.train.lr)
101

102
        # 簡單的訓練循環框架
103
        for epoch in range(config.train.epochs):
104
            pbar = tqdm(trainloader)
105
            for i, data in enumerate(pbar):
106
                # 準備數據
107
                if isinstance(data, dict):
108
                    x_gt = data['target'].to(device)
109
                else:
110
                    x_gt = data.to(device)
111

112
                observation = forward_op(x_gt)
113

114
                # 執行一次完整的反向過程來得到重建結果
115
                # 注意：訓練時的num_samples應為1
116
                recon, aux_loss = algo.inference_for_training(observation) # 假設我們修改了inference來返回aux_loss
117

118
                # 計算損失
119
                reconstruction_loss = F.mse_loss(recon, x_gt)
120
                total_loss = reconstruction_loss + aux_loss
121

122
                # 更新路由器
123
                optimizer.zero_grad()
124
                total_loss.backward()
125
                optimizer.step()
126

127
                pbar.set_description(f"Epoch {epoch+1}, Loss: {total_loss.item():.4f}, Recon Loss: {reconstruction_loss.item():.4f}, Aux Loss: {aux_loss.item():.4f}")
128

129
            # 每個epoch後保存一次路由器的權重
130
            router_save_path = os.path.join(exp_dir, f'router_epoch_{epoch+1}.pt')
131
            torch.save(algo.moe_prior_net.router.state_dict(), router_save_path)
132
            logger.info(f"Saved trained router to {router_save_path}")
133

134
    # ------------------ 推理模式 ------------------
135
    elif config.get('inference', True):
136
        logger.info("--- Starting Inference ---")
137

138
        # 加載訓練好的路由器權重 (如果提供)
139
        if config.get('router_ckpt_path', None):
140
            logger.info(f"Loading trained router from {config.router_ckpt_path}...")
141
            algo.moe_prior_net.router.load_state_dict(torch.load(config.router_ckpt_path, map_location=device))
142

143
        testset = instantiate(config.problem.data)
144
        testloader = DataLoader(testset, batch_size=1, shuffle=False)
145
        evaluator = instantiate(config.problem.evaluator, forward_op=forward_op)
146

147
        for i, data in enumerate(testloader):
148
            # ... (此處的推理和評估邏輯與您原始代碼一致) ...
149
            if isinstance(data, torch.Tensor):
150
                data = data.to(device)
151
            elif isinstance(data, dict):
152
                assert 'target' in data.keys(), "'target' must be in the data dict"
153
                for key, val in data.items():
154
                    if isinstance(val, torch.Tensor):
155
                        data[key] = val.to(device)
156
            data_id = testset.id_list[i]
157
            save_path = os.path.join(exp_dir, f'result_{data_id}.pt')
158

159
            observation = forward_op(data)
160
            target = data['target']
161

162
            logger.info(f'Running inference on test sample {data_id}...')
163
            recon = algo.inference(observation, num_samples=config.num_samples)
164

165
            result_dict = {
166
                'observation': observation,
167
                'recon': forward_op.unnormalize(recon).cpu(),
168
                'target': forward_op.unnormalize(target).cpu(),
169
            }
170
            torch.save(result_dict, save_path)
171

172
            metric_dict = evaluator(pred=result_dict['recon'], target=result_dict['target'], observation=result_dict['observation'])
173
            logger.info(f"Metric results for sample {data_id}: {metric_dict}...")
174

175
        logger.info("Evaluation completed...")
176
        metric_state = evaluator.compute()
177
        logger.info(f"Final aggregated metric results: {metric_state}...")
178
        if config.wandb:
179
            wandb.log(metric_state)
180
            wandb.finish()
181

182
if __name__ == "__main__":
183
    main()

config.yaml：

1
# `defaults`列表定義了默認加載的配置組。
2
# 命令行參數可以輕鬆覆蓋這些默認值。
3
# 例如，運行時使用 `problem=dps_moe_blackhole` 將會替換掉下面的 `problem: blackhole`。
4
defaults:
5
  - _self_
6
  - algorithm: dps_moe      # 默認使用基礎的dps算法配置
7
  - problem: dps_moe_blackhole  # 默認使用基礎的blackhole問題配置
8
  - pretrain: blackhole # 加載預訓練相關的配置
9

10
# --- 模式控制 ---
11
# 設為True以進入訓練路由器模式，False或不設置則為推理模式
12
finetune_router: True
13
inference: False
14

15
# --- 訓練路由器所需的超參數 ---
16
train:
17
  batch_size: 1
18
  lr: 1e-4
19
  epochs: 50
20

21
# --- 推理時加載已訓練好的路由器權重 ---
22
# 示例: "exps/inference/dps_moe_blackhole/your_exp_name_train_router/router_epoch_50.pt"
23
router_ckpt_path: exps/inference/dps_moe_blackhole/your_exp_name_train_router/router_epoch_50.pt
24

25
# --- 其他全局配置 ---
26
tf32: True
27
num_samples: 1
28
compile: False
29
seed: 0
30
wandb: False
31
exp_name: default

problem/dps_moe_blackhole.yaml：

1
name: blackhole
2

3
# --- 修改開始 ---
4

5
# 將單一的 'prior' 鍵註釋掉或刪除
6
# prior: checkpoints/blackhole-50k.pt
7

8
# 添加一個新的 'expert_priors' 列表
9
# 您需要將這裡的路徑替換成您真實的、作為專家的多個預訓練模型 checkpoint 路徑
10
expert_priors:
11
  - checkpoints/cifar10_100k.pt  # 示例：可能是針對特定物理結構的先驗
12
  - checkpoints/tcir_100k.pt  # 示例：可能是針對另一種觀測數據特性的先驗
13
# - checkpoints/prior_model_C.pt  # 示例：一個通用的、魯棒性強的先驗
14
# ... 您可以根據需要添加更多專家
15

16
# --- 修改結束 ---
17

18

19
model:
20
  _target_: inverse_problems.blackhole.BlackHoleImaging
21
  # ... 其他配置保持不變 ...
22
  root: /home/chy/hbx/blackhole/measure
23
  imsize: 64
24
  observation_time_ratio: 1.0
25
  noise_type: 'eht'
26
  w1: 0
27
  w2: 1
28
  w3: 1
29
  w4: 0.5
30
  sigma_noise: 0.0
31
  unnorm_scale: 0.5
32
  unnorm_shift: 1.0
33

34
data:
35
  _target_: training.dataset.BlackHole
36
  # ... 其他配置保持不變 ...
37
  root: /home/chy/hbx/blackhole/test
38
  resolution: 64
39
  original_resolution: 64
40
  random_flip: False
41
  zoom_in_out: False
42
  id_list: 0-99
43

44
evaluator:
45
  _target_: eval.BlackHoleEvaluator
46

47
exp_dir: exps/inference/dps_moe_blackhole

algorithm/dps_moe.yaml：

1
name: DPS_MoE
2

3
method:
4
  # _target_ 指向我們最終版的、整合了所有功能的DPS_MoE類
5
  _target_: algo.dps_moe.DPS_MoE
6

7
  # 擴散過程的相關配置 (保持不變)
8
  diffusion_scheduler_config:
9
    num_steps: 1000
10
    schedule: 'vp'
11
    timestep: 'vp'
12
    scaling: 'vp'
13

14
  # 數據一致性項的指導強度 (保持不變)
15
  guidance_scale: 10.0
16

17
  # 是否使用隨機微分方程 (保持不變)
18
  sde: True
19

20
  # --- MoE 相關的超參數 ---
21
  # 每次選擇K個專家
22
  moe_top_k: 2
23

24
  # 在推理/訓練循環中，每隔多少步打印一次專家選擇日誌
25
  log_every_n_steps: 100
26

27
  # 負載平衡輔助損失的權重係數 (alpha)
28
  # 這是一個非常重要的超參數，用於平衡重建任務和專家負載均衡任務
29
  # 1e-2 是一個常見且合理的初始值
30
  aux_loss_weight: 0.01

5.3 實驗結果#

我因為自己設計的MoE_DPS的V3版本Router因為之前要訓練Prior，所以一直沒有資源可以調試驗證，今天難得就把V3跑跑看能不能訓練出Router，發現會爆顯存但是師兄已經說過我設計的架構其實是不可用的，我們實際想要的不是MoE的思想，而是類似集成學習（Ensemble larning），但是又不是集成學習，所以暫時也不打算繼續調試V3，進度很趕，但是我還是順便把V2給測試了：

V2的測試（Router有Transformer Encoder） 我使用的Prior有以下，這些Prior都提前手動resize到64*64訓出來的：

1
expert_priors:
2
  - checkpoints/cifar10_100k.pt
3
  - checkpoints/tcir_100k.pt
4
  - checkpoints/Flower_Classification_V2_100k.pt
5
  - checkpoints/flowers_100k.pt
6
  - checkpoints/Flowers_Dataset_100k.pt
7
  - checkpoints/Fruit_Classification_100k.pt
8
  - checkpoints/Galaxy_zoo_split_100k.pt
9
  - checkpoints/galaxy_zoo2_100k.pt
10
  - checkpoints/Human_Faces_100k.pt
11
  - checkpoints/Pretty_Face_100k.pt
12
  - checkpoints/Star_Galaxy_Classification_Data_100k.pt
13
  - checkpoints/Stress_Detection_Through_Iris_v1_100k.pt

Top-K	psnr	Final metric results
2	9.604681921308353	[2025-07-13 18:35:32,206][utils.helper][INFO] - Final metric results: {'cp_chi2': 57.72336516141891, 'cp_chi2_std': 128.85988815672584, 'camp_chi2': 448.9530269742012, 'camp_chi2_std': 1465.880910127826, 'psnr': 9.604681921308353, 'psnr_std': 1.5628210516575156, 'blur_psnr (f=10)': 9.604682188034058, 'blur_psnr (f=10)_std': 1.562821134822804, 'blur_psnr (f=15)': 11.0180721616745, 'blur_psnr (f=15)_std': 1.735264194542514, 'blur_psnr (f=20)': 11.803821592330932, 'blur_psnr (f=20)_std': 1.9313266876374349}...
3	9.384628724100963	[2025-07-13 18:52:20,662][utils.helper][INFO] - Final metric results: {'cp_chi2': 66.08063945055008, 'cp_chi2_std': 130.43754026674176, 'camp_chi2': 588.2249845147132, 'camp_chi2_std': 1582.8932770596857, 'psnr': 9.384628724100963, 'psnr_std': 1.8851630062741853, 'blur_psnr (f=10)': 9.384628887176513, 'blur_psnr (f=10)_std': 1.8851631329054186, 'blur_psnr (f=15)': 10.81467140197754, 'blur_psnr (f=15)_std': 1.999706829276188, 'blur_psnr (f=20)': 11.592675695419311, 'blur_psnr (f=20)_std': 2.1566037509003237}...
4	9.653339646736235	[2025-07-13 19:35:18,437][utils.helper][INFO] - Final metric results: {'cp_chi2': 67.73039571523667, 'cp_chi2_std': 153.0329317845099, 'camp_chi2': 208.92955838203432, 'camp_chi2_std': 778.3560173957392, 'psnr': 9.653339646736235, 'psnr_std': 1.7299705422418872, 'blur_psnr (f=10)': 9.65333996772766, 'blur_psnr (f=10)_std': 1.7299707265484745, 'blur_psnr (f=15)': 10.972465562820435, 'blur_psnr (f=15)_std': 1.882743000045943, 'blur_psnr (f=20)': 11.730897061824798, 'blur_psnr (f=20)_std': 2.043529332985071}...
5	9.420167746069437	[2025-07-14 00:08:35,209][utils.helper][INFO] - Final metric results: {'cp_chi2': 60.64260640859604, 'cp_chi2_std': 121.21446586857735, 'camp_chi2': 459.2475729942322, 'camp_chi2_std': 1326.5611478303344, 'psnr': 9.420167746069437, 'psnr_std': 1.7715246437122023, 'blur_psnr (f=10)': 9.420168051719665, 'blur_psnr (f=10)_std': 1.77152472766505, 'blur_psnr (f=15)': 10.722562670707703, 'blur_psnr (f=15)_std': 1.9167290163454769, 'blur_psnr (f=20)': 11.444581513404847, 'blur_psnr (f=20)_std': 2.077268711432607}...

結論：未經過微調的Router，只能保證至少選擇的專家是當前比較好的，當專家激活數越多，效果也沒有說越好越壞，另外我也發現如果使用Top-k策略其實也不合理，top-k大部分步驟中會偏好選更好的專家，導致其它閒置的專家沒有用處，其實這時也能重新再思考一下單純的係數相乘是否合理，我在這裡打上問號

六、結語#

決定使用 MoE 去整合先驗模型，是因為我們目前研究進度卡在怎麼將很多先驗融合在一起，而不是單純的模型計算乘以權重相加，然後得到去噪聲參數。但是使用 MoE 的話又有一些問題，按照師兄的說法，我們的研究不考慮參數量過大等的計算效率還有稀疏性，只考慮正確率，我設計的 MoE Top-K 稀疏性不是我們想要的，師兄認為所有的先驗都有用，但是要怎麼取每個先驗的特長也不是很清楚，無法具體化到每個模型龐大的部分參數。所以即使換到使用集成學習（ensemble learing）也可能無法達到我們想要的架構。

所以這部分還是需要師兄去思考怎麼不去用 MoE，去設計MLP層。但既然我都已經設計出來了，並且有時實際代碼落地，每個架構流程合理性非常足，不測試一下就可惜了。