Transformer Efficiency Lab (TEL)

A hands-on lab for understanding, implementing, and optimizing transformer training from first principles.

TEL is where I build decoder language models, training loops, and systems-level experiments to study how transformers behave in practice — not just how they look on paper.

What this repo is

This repo is part of my effort to:

implement transformer components from scratch
understand training dynamics deeply
benchmark optimization and memory techniques
build intuition through code, ablations, and measurement

The focus is not just “train a model,” but to answer questions like:

What actually happens inside attention, dropout, and backprop?
How do batch size, gradient accumulation, and LR scaling interact?
What do activation checkpointing and compilation really buy us?
How should we measure throughput, memory, and training quality together?

Current scope

TEL currently focuses on:

decoder-style language model training
modular experiment configuration
dataset / tokenizer / model / logger adapters
micro-batching and gradient accumulation
token-aware gradient scaling
bf16 autocast and torch.compile
checkpointing and artifact tracking
step, epoch, and validation metrics
reproducible experimentation

Project goals

The broader goal of TEL is to create a strong experimental spine for studying transformer efficiency across:

optimization
memory usage
throughput
architectural tradeoffs
training stability
scaling behavior

This repo is designed to be a research-and-engineering sandbox rather than a polished framework.

Repo philosophy

A few principles guide this project:

Build from scratch to understand the mechanics.
Measure everything — loss alone is not enough.
Prefer clear abstractions over magic.
Keep experiments reproducible and easy to compare.
Use the repo as a lab notebook for real learning.

Training pipeline at a glance

At a high level:

load corpus
build tokenizer + vocab
create train/val loaders
build model
resolve batching + learning rate
train with metrics, checkpointing, and validation
save final artifacts

Related notes and writeups

TEL overview

Core Phase_0 tracks

Baseline experiments

Why this exists

Most learning resources explain transformers at a high level. Fewer force you to confront the practical details:

loss reduction choice
padding-aware token accounting
when gradients should be scaled
how optimizer state behaves
what “effective batch size” really means
which optimizations help, and which only sound good

TEL exists to close that gap.

Status

Active and evolving.

This is an experimental repo, so expect:

frequent iteration
changing APIs
ablation-heavy code
implementation notes tied to ongoing experiments

Name		Name	Last commit message	Last commit date
Latest commit History 78 Commits
.vscode		.vscode
data		data
datasets		datasets
experiments		experiments
src		src
tests		tests
.DS_Store		.DS_Store
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
repl-lm-chat.py		repl-lm-chat.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Transformer Efficiency Lab (TEL)

What this repo is

Current scope

Project goals

Repo philosophy

Training pipeline at a glance

Related notes and writeups

TEL overview

Core Phase_0 tracks

Baseline experiments

Why this exists

Status

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Transformer Efficiency Lab (TEL)

What this repo is

Current scope

Project goals

Repo philosophy

Training pipeline at a glance

Related notes and writeups

TEL overview

Core Phase_0 tracks

Baseline experiments

Why this exists

Status

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages