Training an LLM in Rust (trying to)
After spending the first part of my parental leave half-awake and reading about some of the foundations of ML, I thought (arrogantly), ‘how hard could it be to implement a simple LLM from scratch’. Mind you, I had really no idea what that meant at the time, having only just discovered tokenizers, and basic neural nets. This project was sort’ve the result of that, and as such it is a messy slop show, but hey, I learned a ton along the way!
Karpathy’s video series on propogation was basically my first start, I followed along in python did the chain rule math and built out my own little net. So there it was, my sights were set on training a full LLM, with zero knowledge of transformers (lmao) and I thought, well, lets do it in rust, that would be fun (lmao).
Yeah so that was a stupid ass idea. I then spent about 2 weeks trying to learn about transformers from videos on youtube and having Claude sort of hand walk me through an implementation. The problem with this, was Claude ‘knew’ way more about transformers then I did, and as I now know, there is a variety of transformer architectures. I spent some time learning about BERT and GPT as well as flash, paged, and sparse attention. When I say learned, I spent about a day reading about each. My goal for the project was to quickly get some breathe, so I could dig deeper where I wanted to afterwards.
I wanted to go simple and standard, so the project is just an attempt at a simple GPT style transformer using multi-attention. I didnt know about attention masking until near the end of the project when I was sitting around reading while waiting for the model to train, lol yes I know.
The first go around used 0 CUDA code, but I realized that if I was going to train on a real corpus, I was going to need to offload some of the compute, or wait forever for the model to train. Sense my intent was always to train it on cloud hardware, I figured this was a good step. I have a 3090 to play around and test on, so I gave it a shot. Well, gave it a shot after fighting with Arch to install the nvidia drivers… lol. I have written some C in my life so that part of the learning curve was quick, but trying to think of things from the POV of a gpu and not a cpu was a little bit of a mindset shift. I cant say I totally shifted 100% by the end of the project, and playing around with CUDA more was something I made note of to explore, not just for deep learning and inference cases, but to explore it as an alternative method of computation.
The goal was to train a simple model that could respond to chats, so I went looking for a solid corpus to train on. This was yet another area I had limited knowledge on, but had seen folks on X talking about the value of AI gen datasets. This led me to the open-phi/textbooks corpus, which is a dataset generated by LLMs in an attempt to create a “library”. its about 135MB, so fairly small, but I figured it would do.
My first training run was on my local machine, just off of my 3090. I wasnt about to waste cloud compute if it didnt produce any outputs of course! I wasn’t entirely suprised at the speed of the training being slow, but just 3 epochs took quite a while. My loss was a little high, but generally usable from what I had read, so I figured it would be good enough, especially at such a small pass size.
$ cargo run --release -- ./textbooks.txt 3
Compiling llm-rs v0.1.0 (/home/zek/workspace/llm-rs)
Finished `release` profile [optimized] target(s) in 8.33s
Running `target/release/llm-rs ./textbooks.txt 3`
batch 50 | loss 3.824255
batch 100 | loss 3.310094
batch 150 | loss 3.125167
batch 200 | loss 3.083753
batch 250 | loss 3.069662
batch 300 | loss 2.924012
batch 350 | loss 2.872879
..blah ..blah..Now, what I should’ve done at this point, is try to run inference on the result model weights. But like a chimp, I didnt. Instead, I went on vast.ai, got an h100, and ripped 50 epochs. At the time, this was pretty cheap, about $1.20 per gpu per hour, so like $1.40 total for the rented machine. The first go around, the machine got powered off, or switched, not really sure honestly, mid run since it was not a dedicated node, so I wasnt able to finish to run. The second go around ran for about 8 hours, so including the flat rental fee, it cost me about $20 in compute. Not bad. In hindsight, I shouldve trained until convergence, rather then a fixed number of epochs, but again, something I wasnt aware of at the time. More fun learnings.
Ok so, I had my model weights, and I was ready to go try out my model for the first time. And it didnt work. Ah damn, I messed something up. Well, a couple of things actually. Firstly, my linear weights are stored as [in, out], which is not the expected format for pytorch. Secondly you’ll note my parm keys are non-standardly named, so loaders were failing to map them. And finally my bias embeddings are saved as [1, embed_dim] rather then just saved as [embed_dim], this wasnt as big of a deal ofc. Rather then retraining, I wrote a small script to transpose the weights and fix this stuff to be usable. As I learned, there are some considerations with respect to inference.
Overall, the project helped me get my hands very dirty very quickly, in just a few weeks I had built something (albiet shitty) that provided a good launch into some other directions. Inclduing going back and relearning some of the fundementals, and actuall spending the time to read about LLM implementations in practice.