Using Conv2D with kernel size = 1 layers instead of Linear layers in MLP implementation may result in x2 speed-up

In my experiment I observed that one of my network with around 400k parameters including convolutional layer was training approximately 3x faster than another network designed for the same task with only linear layers and 60k parameters (the latter is using the MLP implementation of mtan_s1s2_classif).

I therefore wrote the following benchmarking script:

import torch
from einops import rearrange

nb_features = 128
batch_size = 500
width = 64
conv_layer = torch.nn.Conv2d(nb_features, nb_features, 1, device="cuda")
linear_layer = torch.nn.Linear(nb_features, nb_features, device="cuda")

nb_conv_params = sum([len(params) for params in conv_layer.parameters()])
nb_linear_params = sum([len(params) for params in linear_layer.parameters()])

print(f"{nb_conv_params=}")
print(f"{nb_linear_params=}")

conv_times = []
linear_times = []

for t in range(0, 100):
    in_data = torch.rand((batch_size, nb_features, width, width), device="cuda")
    # Conv layer
    torch.cuda.synchronize()
    conv_start = time.time()
    _ = conv_layer(in_data)
    torch.cuda.synchronize()
    conv_stop = time.time()
    conv_times.append(conv_stop - conv_start)
    # Linear layer
    torch.cuda.synchronize()
    linear_start = time.time()
    _ = linear_layer(rearrange(in_data, "b c w h -> (b w h) c"))
    torch.cuda.synchronize()
    linear_stop = time.time()
    linear_times.append(linear_stop - linear_start)

avg_conv_time = sum(conv_times) / len(conv_times)
avg_linear_time = sum(linear_times) / len(linear_times)

percent = [100 * l / c for (l, c) in zip(linear_times, conv_times)]

mean_percent = sum(percent) / len(percent)

print(f"{avg_linear_time=:.4f} s")
print(f"{avg_conv_time=:.4f} s")
print(
    f"Speed-up: min={min(percent):.2f}%, max={max(percent):.2f}%, mean={mean_percent:.2f}%"
)

Note the following line:

_ = linear_layer(rearrange(in_data, "b c w h -> (b w h) c"))

Which we do a lot when manipulating sits. Sometimes we also fold the time dimension.

Output (on trex interactive nodes gpu):

$ python time_gpu.py 
nb_conv_params=256
nb_linear_params=256
avg_linear_time=0.0179 s
avg_conv_time=0.0086 s
Speed-up: min=191.91%, max=288.15%, mean=197.67%

So using Conv2d with kernel size (1,1) instead of Linear layers in MLP implementation would result in a 2 to 3 time speed-up in training.

Edited Apr 17, 2024 by MICHEL Julien