While building out a miniGPT, I was curious about the forms of concatenation used. While query, key, value operations use a dot product, which involves a weighted stack and sum approach, the actual combinations of the attention heads were simply combining the shorter latent vectors in the first dimension, greatly reducing the size of the attention head dimension, and thus requiring the feed-forward network following MHA, to build head_size / num_heads segmentations. This felt a bit constricted. I tested out whether keeping the size the same and just stacking and summing in the original dimension would improve training loss. As seen above, it did result in minor improvements, which was pretty cool. Note: this is less an actual experiment and more just fun PyTorch experimentation as I familiarized myself with the architecture. The graph is from 1 training run (although this effect is constant across many training runs). As well, LLMs tested across such small training epochs, batch sizes, and dimensions are hardly representative of actual results. Lastly, to note, this keeps the learnable parameters a factor of (head_size / num_heads) as large on the attention heads dimensions. So its definitely possible improvements can be due to this. When I can, I want to scale the default miniGPT parameters in a non attention head part of the model to observe the difference. As well as see how stacking smaller dimensional (head_size / num_heads) heads affects performance. It is interesting though, that a simple sum over a very small dimension (32 in this project) doesnt trash performance. Vectors are cool.
note this is based off the existing implication of MiniGPT by Andrej Karpathy. The file miniGPTMyMethod.py has my modificaiton to the Attention Head added. The code mostly follows the pattern written by Andrej.