Overview
For whatever reason, I can not seem to reproduce the results that are shown in the paper. I have a feeling that I am doing something specific that is messing up results as I can not get the same accuracy for UNet, FNO, nor CNO. I am using my own setup, so not a clone of the repo, but I am using the same models and setup, so I should get the same results. Unfortunately, my results for any of the models for the NS case is minimum 10% error, and not 3% error like shown in the paper. In any case, I will outline what exactly is my setup and if there is a glaring mistake, I would appreciate any feedback.
Data
Dataset: Navier Stokes 64x64 dataset. (Assume this is the case for everything below)
Data splits: I split up the data into training and testing like what you do :
# Import data
f = h5py.File('../NavierStokes_64x64_IN.h5', 'r')
x = []
y = []
for key in f.keys():
x.append(f[key]['input'])
y.append(f[key]['output'])
X = np.array(x)
y = np.array(y)
# Permute axis
X = X[:, np.newaxis, ...]
y = y[:, np.newaxis, ...]
# Split data up into train, val, test
X_train, y_train = X[:768], y[:768]
X_valid, y_valid = X[768:768 + 128], y[768:768 + 128]
X_test, y_test = X[768 + 128:768 + 128*2], y[768 + 128:768 + 128*2]
# Transform data (it seems you this this normalization based off your code)
min_data, max_data = np.min(X_train), np.max(X_train)
#min_model, max_model = np.min(y_train), np.max(y_train) (this is not used - I only apply the transformation on X)
class NormalizeMinMax(torch.nn.Module):
def __init__(self, img_min, img_max):
self.img_min = img_min
self.img_max = img_max
def __call__(self, img):
new_img = (img - self.img_min) / (self.img_max - self.img_min)
return new_img
transform = transforms.Compose(
[
NormalizeMinMax(min_data, max_data),
]
)
# Convert to tensors
X_train = torch.tensor(X_train, dtype=torch.float32)
y_train = torch.tensor(y_train, dtype=torch.float32)
X_valid = torch.tensor(X_valid, dtype=torch.float32)
y_valid = torch.tensor(y_valid, dtype=torch.float32)
X_train = transform(X_train)
X_valid = transform(X_valid)
Hyperparameters
Here, I use the models supplied in the repository with a PyTorch Lightning setup. Below, I will provide some psuedocode to explain the hyperparameter/optimization setups. I apply the LR scheduler per epoch so each epoch the LR becomes LR * 0.98.
loss_function = nn.L1Loss() # CNO and UNet
loss_function = nn.SmoothL1Loss() # FNO
batch_size = 32 # CNO and FNO
batch_size = 10 # UNet
optimizer = torch.optim.AdamW(self.parameters(), lr=1e-3, weight_decay=1e-6) # FNO
optimizer = torch.optim.AdamW(self.parameters(), lr=1e-3, weight_decay=1e-10) # CNO
optimizer = torch.optim.AdamW(self.parameters(), lr=5e-4, weight_decay=1e-6) # UNet
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=1, gamma=0.98) #unet, cno, and fno
Model Setup
Here are some code snippets that show how I initialize each of the models. Each of these are using the best performing hyperparameters you mention in the paper.
# CNO
class CNO2d(nn.Module):
def __init__(self,
in_dim = 1, # Number of input channels.
out_dim = 1, # Number of input channels.
size = 64, # Input and Output spatial size (required )
N_layers = 3, # Number of (D) or (U) blocks in the network
N_res = 1, # Number of (R) blocks per level (except the neck)
N_res_neck = 8, # Number of (R) blocks in the neck
channel_multiplier = 32, # How the number of channels evolve?
use_bn = False, # Add BN? We do not add BN in lifting/projection layer
):
# FNO
class FNO2d(nn.Module):
def __init__(self, in_channels = 1, out_channels = 1, device=None):
super(FNO2d, self).__init__()
"""
The overall network. It contains 4 layers of the Fourier layer.
1. Lift the input to the desire channel dimension by self.fc0 .
2. 4 layers of the integral operators u' = (W + K)(u).
W defined by self.w; K defined by self.conv .
3. Project from the channel space to the output space by self.fc1 and self.fc2 .
input: the solution of the coefficient function and locations (a(x, y), x, y)
input shape: (batchsize, x=s, y=s, c=3)
output: the solution
output shape: (batchsize, x=s, y=s, c=1)
"""
self.modes1 = 16 #16
self.modes2 = 16 #16
self.width = 128 #64
self.n_layers = 5 #5
self.retrain_fno = 4 #4
self.padding = 0 #0
self.include_grid = 1 #1
self.input_dim = in_channels
self.act = nn.LeakyReLU()
self.device = device
# UNet
class UNet(nn.Module):
def __init__(self, n_channels = 1, n_classes = 1, bilinear=False):
super(UNet, self).__init__()
self.n_channels = n_channels
self.n_classes = n_classes
self.bilinear = bilinear
self.inc = (DoubleConv(n_channels, 64))
self.down1 = (Down(64, 128))
self.down2 = (Down(128, 256))
self.down3 = (Down(256, 512))
factor = 2 if bilinear else 1
self.down4 = (Down(512, 1024 // factor))
self.up1 = (Up(1024, 512 // factor, bilinear))
self.up2 = (Up(512, 256 // factor, bilinear))
self.up3 = (Up(256, 128 // factor, bilinear))
self.up4 = (Up(128, 64, bilinear))
self.outc = (OutConv(64, n_classes))
Error Calculations
Finally, I compute the Relative L1 Error as mentioned in your paper. I Apply the below function for each sample in my test set via a simple enumerated for loop, iterating through y_predicted
.
X_test = torch.tensor(X_test, dtype=torch.float32)
X_test = transform(X_test)
test_loader = DataLoader(list(X_test), shuffle=False, batch_size=1)
y_predicted = trainer.predict(network, test_loader)
def relative_l1_error(y_pred, y_true):
return torch.mean(torch.abs(y_pred - y_true)) / np.mean(np.abs(y_true))
Results
Now, comes the questions. As explained above, I can not get results lower than 10% on any of the models. I am attempting to get the 2.5-3.5% Relative L1 errors outlined in the paper for all CNO, FNO, and UNet.
My training accuracy for the SmoothL1Loss
goes down to 1e-5
and the validation accuracy goes down to 0.005
; however, if I use the L1Loss
, I can not seem to break 10% and the loss just seems to stagnate, even with the hyperparameters given.
All models were trained without early stopping and up to 1000 epochs, but most of their loss stopped converging so I stopped it much sooner.
Conclusion
If there is anything within this setup that is clearly an issue, I would appreciate the feedback. If my understanding of any of the results are also incorrect, that would be good to know as well. My main questions personally lie within the model setups as the code in the repo and the paper were quite different, so I was not sure which setup was intended. In this case, I assumed the paper to be the correct setup and followed each table regardless of the code in the repo. Thank you very much for taking the time to read this until now - I hope that the fix is quite simple and that it was just a simple mistake on my part - but regardless, I am interested to hear what you all say. Have a great rest of your day :)