🌸 Run LLMs at home, BitTorrent-style. Fine-tuning and inference up to 10x faster than offloading
-
Updated
Sep 7, 2024 - Python
🌸 Run LLMs at home, BitTorrent-style. Fine-tuning and inference up to 10x faster than offloading
InternEvo is an open-sourced lightweight training framework aims to support model pre-training without the need for extensive dependencies.
Slicing a PyTorch Tensor Into Parallel Shards
LLM inference engine from scratch — paged KV cache, continuous batching, chunked prefill, prefix caching, speculative decoding, CUDA graph, tensor parallelism, OpenAI-compatible serving
Decentralized LLMs fine-tuning and inference with offloading
Large scale 4D parallelism pre-training for 🤗 transformers in Mixture of Experts *(still work in progress)*
gLLM: Global Balanced Pipeline Parallelism System for Distributed LLM Serving with Token Throttling
JORA: JAX Tensor-Parallel LoRA Library (ACL 2024)
A distributed training framework for large language models powered by Lightning.
Fast and easy distributed model training examples.
Tensor Parallelism with JAX + Shard Map
GPU Memory Calculator for LLM Training - Calculate GPU memory requirements for training Large Language Models with support for multiple training engines including PyTorch DDP, DeepSpeed ZeRO, Megatron-LM, and FSDP.
Multi-GPU tensor/context parallel diffusion on AMD ROCm — with the patch that makes it actually work.
This repository focuses on distributed and parallel computing with PyTorch, covering model parallelism, data parallelism, and advanced optimization techniques. It provides resources for scaling AI training and inference efficiently across multiple devices.
A production-grade, native Rust speculative inference engine for Apple Silicon with Metal GPU acceleration and paged attention.
Production-grade LLM inference API built from scratch. NestJS gateway + Python GPU workers. Scheduling, batching, KV cache, tensor parallelism, multi-modal — all against real GPUs.
A reference implementation of Matrix Multiplication algorithms for ML on UPMEM PIM - a processing-in-memory platform
vLLM - High-throughput, memory-efficient LLM inference engine with PagedAttention, continuous batching, CUDA/HIP optimization, quantization (GPTQ/AWQ/INT4/INT8/FP8), tensor/pipeline parallelism, OpenAI-compatible API, multi-GPU/TPU/Neuron support, prefix caching, and multi-LoRA capabilities
Trains a 7B-parameter GPT model using NVIDIA Megatron-LM with full 3D parallelism across a 64-GPU InfiniBand cluster. Communication is profiled at multiple levels: PyTorch Profiler traces, Nsight Systems captures, a dedicated NCCL C++ benchmark, a Rust GPU memory monitor.
Add a description, image, and links to the tensor-parallelism topic page so that developers can more easily learn about it.
To associate your repository with the tensor-parallelism topic, visit your repo's landing page and select "manage topics."