tensor-parallelism

Here are 27 public repositories matching this topic...

bigscience-workshop / petals

🌸 Run LLMs at home, BitTorrent-style. Fine-tuning and inference up to 10x faster than offloading

Updated Sep 7, 2024
Python

InternLM / InternEvo

InternEvo is an open-sourced lightweight training framework aims to support model pre-training without the need for extensive dependencies.

pytorch multi-modal gemma pipeline-parallelism transformers-models tensor-parallelism llava llm-training internlm flash-attention zero3 llm-framework sequence-parallelism internlm2 ring-attention deepspeed-ulysses llama3 910b

Updated Aug 21, 2025
Python

kaiyuyue / torchshard

Star

Slicing a PyTorch Tensor Into Parallel Shards

pytorch model-parallelism tensor-parallelism

Updated Jun 7, 2025
Python

psmarter / mini-infer

Star

LLM inference engine from scratch — paged KV cache, continuous batching, chunked prefill, prefix caching, speculative decoding, CUDA graph, tensor parallelism, OpenAI-compatible serving

machine-learning cuda inference pytorch transformer triton moe quantization language-model inference-engine kv-cache tensor-parallelism llm speculative-decoding pagedattention continuous-batching

Updated Apr 24, 2026
Python

ai-decentralized / BloomBee

Star

Decentralized LLMs fine-tuning and inference with offloading

distributed-systems machine-learning deep-learning pytorch llama pipeline-parallelism tensor-parallelism

Updated May 23, 2026
Python

xrsrke / pipegoose

Star

Large scale 4D parallelism pre-training for 🤗 transformers in Mixture of Experts *(still work in progress)*

transformers moe data-parallelism distributed-optimizers model-parallelism megatron mixture-of-experts pipeline-parallelism huggingface-transformers megatron-lm tensor-parallelism large-scale-language-modeling 3d-parallelism zero-1 sequence-parallelism

Updated Dec 14, 2023
Python

gty111 / gLLM

Star

gLLM: Global Balanced Pipeline Parallelism System for Distributed LLM Serving with Token Throttling

pipeline-parallelism tensor-parallelism llm-serving llm-inference pagedattention continuous-batching qwen3 token-throttling chunked-prefill

Updated May 24, 2026
Python

aniquetahir / JORA

Star

JORA: JAX Tensor-Parallel LoRA Library (ACL 2024)

machine-learning lora jax tensor-parallelism large-language-models

Updated Apr 25, 2024
Python

ShinoharaHare / LLM-Training

Star

A distributed training framework for large language models powered by Lightning.

transformer llama distributed-training fine-tuning pre-training tensor-parallelism llm instruction-tuning llm-training llm-finetuning phi-3

Updated Jul 31, 2025
Python

AlibabaPAI / FlashModels

Star

Fast and easy distributed model training examples.

deep-learning pytorch zero data-parallelism model-parallelism distributed-training xla tensor-parallelism llm fsdp sequence-parallelism

Updated Nov 26, 2024
Python

fattorib / transformer_shmap

Star

Tensor Parallelism with JAX + Shard Map

transformers gpt tpu jax tensor-parallelism pjit shmap

Updated Sep 29, 2023
Python

George614 / gpu-mem-calculator

Star

GPU Memory Calculator for LLM Training - Calculate GPU memory requirements for training Large Language Models with support for multiple training engines including PyTorch DDP, DeepSpeed ZeRO, Megatron-LM, and FSDP.

Updated Jan 26, 2026
Python

Dev-next-gen / diffusers-rocm-parallel

Star

Multi-GPU tensor/context parallel diffusion on AMD ROCm — with the patch that makes it actually work.

flux amd pytorch text-to-image multi-gpu rocm tensor-parallelism diffusers rdna3 ring-attention

Updated Apr 19, 2026
Python

NiuHuangxiaozi / Deep-Learning-Parallelism

Star

This repository outlines a comprehensive guide for training a distributed deep learning model.

pytorch ps ddp allreduce pipline deepspeed tensor-parallelism

Updated Jul 2, 2024
Python

SuZeAI / DP

Star

This repository focuses on distributed and parallel computing with PyTorch, covering model parallelism, data parallelism, and advanced optimization techniques. It provides resources for scaling AI training and inference efficiently across multiple devices.

parallel parallel-computing distributed ddp parallel-data tensor-parallelism fsdp

Updated Jun 29, 2025
Jupyter Notebook

developertogo / velo-core

Star

A production-grade, native Rust speculative inference engine for Apple Silicon with Metal GPU acceleration and paged attention.

metal gpu-acceleration systems-programming apple-silicon openai-api tensor-parallelism llm-inference speculative-decoding paged-attention continuous-batching prefix-caching disaggregated-serving

Updated May 9, 2026
Rust

rajatady / Inference-Stack

Star

Production-grade LLM inference API built from scratch. NestJS gateway + Python GPU workers. Scheduling, batching, KV cache, tensor parallelism, multi-modal — all against real GPUs.

gpu grpc transformers inference multi-modal nestjs kv-cache dynamic-batching tensor-parallelism llm

Updated Mar 12, 2026
TypeScript

polrolnik2 / upmem-tensor-multiplication-runtime

Star

A reference implementation of Matrix Multiplication algorithms for ML on UPMEM PIM - a processing-in-memory platform

c distributed-systems machine-learning cmake hpc ml pim upmem processing-in-memory tensor-parallelism

Updated Mar 9, 2026
C

nshkrdotcom / vllm

Sponsor

Star

vLLM - High-throughput, memory-efficient LLM inference engine with PagedAttention, continuous batching, CUDA/HIP optimization, quantization (GPTQ/AWQ/INT4/INT8/FP8), tensor/pipeline parallelism, OpenAI-compatible API, multi-GPU/TPU/Neuron support, prefix caching, and multi-LoRA capabilities

Updated Apr 23, 2026
Elixir

Awrsha / Megatron-LM

Star

Trains a 7B-parameter GPT model using NVIDIA Megatron-LM with full 3D parallelism across a 64-GPU InfiniBand cluster. Communication is profiled at multiple levels: PyTorch Profiler traces, Nsight Systems captures, a dedicated NCCL C++ benchmark, a Rust GPU memory monitor.

cuda nccl megatron-lm tensor-parallelism 3d-parallelism

Updated May 22, 2026
Python

Improve this page

Add a description, image, and links to the tensor-parallelism topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the tensor-parallelism topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tensor-parallelism

Here are 27 public repositories matching this topic...

bigscience-workshop / petals

InternLM / InternEvo

kaiyuyue / torchshard

psmarter / mini-infer

ai-decentralized / BloomBee

xrsrke / pipegoose

gty111 / gLLM

aniquetahir / JORA

ShinoharaHare / LLM-Training

AlibabaPAI / FlashModels

fattorib / transformer_shmap

George614 / gpu-mem-calculator

Dev-next-gen / diffusers-rocm-parallel

NiuHuangxiaozi / Deep-Learning-Parallelism

SuZeAI / DP

developertogo / velo-core

rajatady / Inference-Stack

polrolnik2 / upmem-tensor-multiplication-runtime

nshkrdotcom / vllm

Awrsha / Megatron-LM

Improve this page

Add this topic to your repo