A hands-on lab for understanding, implementing, and optimizing transformer training from first principles.
TEL is where I build decoder language models, training loops, and systems-level experiments to study how transformers behave in practice — not just how they look on paper.
This repo is part of my effort to:
- implement transformer components from scratch
- understand training dynamics deeply
- benchmark optimization and memory techniques
- build intuition through code, ablations, and measurement
The focus is not just “train a model,” but to answer questions like:
- What actually happens inside attention, dropout, and backprop?
- How do batch size, gradient accumulation, and LR scaling interact?
- What do activation checkpointing and compilation really buy us?
- How should we measure throughput, memory, and training quality together?
TEL currently focuses on:
- decoder-style language model training
- modular experiment configuration
- dataset / tokenizer / model / logger adapters
- micro-batching and gradient accumulation
- token-aware gradient scaling
- bf16 autocast and
torch.compile - checkpointing and artifact tracking
- step, epoch, and validation metrics
- reproducible experimentation
The broader goal of TEL is to create a strong experimental spine for studying transformer efficiency across:
- optimization
- memory usage
- throughput
- architectural tradeoffs
- training stability
- scaling behavior
This repo is designed to be a research-and-engineering sandbox rather than a polished framework.
A few principles guide this project:
- Build from scratch to understand the mechanics.
- Measure everything — loss alone is not enough.
- Prefer clear abstractions over magic.
- Keep experiments reproducible and easy to compare.
- Use the repo as a lab notebook for real learning.
At a high level:
- load corpus
- build tokenizer + vocab
- create train/val loaders
- build model
- resolve batching + learning rate
- train with metrics, checkpointing, and validation
- save final artifacts
- LR vs Batch Size Empirical Sweep
- Experiments with Weight Decay
- Activation Checkpointing
- Micro-batching + Gradient Accumulation
Most learning resources explain transformers at a high level. Fewer force you to confront the practical details:
- loss reduction choice
- padding-aware token accounting
- when gradients should be scaled
- how optimizer state behaves
- what “effective batch size” really means
- which optimizations help, and which only sound good
TEL exists to close that gap.
Active and evolving.
This is an experimental repo, so expect:
- frequent iteration
- changing APIs
- ablation-heavy code
- implementation notes tied to ongoing experiments