#

continuous-batching

Here are 32 public repositories matching this topic...

cubist38 / mlx-openai-server

A high-performance API server that provides OpenAI-compatible endpoints for MLX models. Developed using Python and powered by the FastAPI framework, it provides an efficient, scalable, and user-friendly solution for running MLX-based vision and language models locally with an OpenAI-compatible interface.

flux queue speech-recognition image-generation whisper vision-api mlx fastapi multi-models apple-silicon continuous-batching tool-calling structured-outputs mlx-lm mlx-vlm openai-compatible

Updated Jun 29, 2026
Python

psmarter / mini-infer

LLM inference engine from scratch — paged KV cache, continuous batching, chunked prefill, prefix caching, speculative decoding, CUDA graph, tensor parallelism, OpenAI-compatible serving

machine-learning cuda inference pytorch transformer triton moe quantization language-model inference-engine kv-cache tensor-parallelism llm speculative-decoding pagedattention continuous-batching

Updated Apr 24, 2026
Python

lumia431 / photon_infer

A High-Performance LLM Inference Engine with vLLM-Style Continuous Batching

modern-cpp inference-engine ai-infra vllm llm-inference paged-attention continuous-batching

Updated Jan 2, 2026
C++

gty111 / gLLM

An Efficient and Versatile Inference Engine for Distributed LLM Serving

pipeline-parallelism tensor-parallelism llm-serving llm-inference pagedattention continuous-batching qwen3 token-throttling chunked-prefill

Updated Jul 3, 2026
Python

CarolBaggins2023 / baby-vllm

A lightweight, educational LLM inference engine for studying continuous batching, paged KV cache, chunked prefill, and online serving.

llm-serving llm-inference continuous-batching chunked-prefill paged-kv-cache

Updated Jun 20, 2026
Python

caimari / vtts

Continuous batching for TTS — like vLLM, but for voice. Serve 10+ simultaneous text-to-speech requests on a single GPU.

text-to-speech pytorch tts speech-synthesis voice-synthesis voice-cloning voice-agent gpu-inference vllm continuous-batching real-time-tts qwen3-tts

Updated Mar 15, 2026
Python

pjdurden / nanoserve

An AI inference engine from scratch. Like nanoGPT, but for serving.

machine-learning cuda inference pytorch triton llama from-scratch kv-cache llm vllm llm-inference paged-attention continuous-batching

Updated Jul 2, 2026
Python

developertogo / velo-core

A production-grade, native Rust speculative inference engine for Apple Silicon with Metal GPU acceleration and paged attention.

metal gpu-acceleration systems-programming apple-silicon openai-api tensor-parallelism llm-inference speculative-decoding paged-attention continuous-batching prefix-caching disaggregated-serving

Updated Jun 13, 2026
Rust

Mihawii / mini-vllm

An educational LLM inference engine built from scratch: custom decoding loop, KV cache, continuous batching, paged memory with preemption, prefix caching, speculative decoding, OpenAI-compatible API, real benchmarks.

inference pytorch fastapi kv-cache llm continuous-batching

Updated Jun 12, 2026
Python

AICL-Lab / hetero-paged-infer

High-Performance LLM Inference Engine with PagedAttention & Continuous Batching in Rust

rust machine-learning high-performance inference transformer gpu-computing production-ready systems-programming inference-engine serving kv-cache llm vllm llm-inference paged-attention continuous-batching

Updated Jun 29, 2026
Rust

laywens / vllm-mlx

Fork of OpenAI and Anthropic compatible server for Apple Silicon. Native MLX backend, 500+ tok/s. Run LLMs and vision-language models with continuous batching, MCP tool calling, and multimodal support.

inference-server mlx multimodal apple-silicon llm vllm local-ai continuous-batching tool-calling openai-compatible

Updated May 20, 2026
Python

iFurySt / nanoLLMServe

🌱 A tiny, readable LLM serving engine with vLLM/SGLang-style features.

Updated May 28, 2026
Python

sarmakska / forge-infer

A minimal LLM inference server with a real paged KV-cache, continuous batching and speculative decoding, by sarmalinux

rust inference transformer kv-cache llm vllm speculative-decoding continuous-batching sarmalinux

Updated Jun 7, 2026
Rust

achi9629 / llm-inference-engine

A from scratch LLM inference engine build in PyTorch with custom GPT2 transformers, kv cache, paged kv cache, continuous batching and A100 benchmarks

nlp deep-learning transformers autoregressive inference-engine model-serving fastapi gpt2 kv-cache llm llm-serving llm-inference paged-attention mistral-7b continuous-batching paged-kv-cache

Updated May 8, 2026
Python

maxime-dlabai / mlx-continuous-batching

OpenAI-compatible server with continuous batching for MLX on Apple Silicon

macos inference text-generation mlx apple-silicon openai-api llm continuous-batching

Updated Dec 4, 2025
Python

sushildalavi / nanoserve

OpenAI-compatible LLM serving engine built from scratch on Apple Silicon. Continuous batching, paged KV-cache, prefix caching, INT8/INT4 quantization, Prometheus/Grafana observability. Benchmarked against llama.cpp and HuggingFace baselines.

grafana prometheus pytorch mps quantization observability fastapi kv-cache apple-silicon llm continuous-batching openai-compatible

Updated Jun 17, 2026
Python

nagababumo / Efficiently-Serving-LLMs

batching lora quantization lorax low-rank-adaptation continuous-batching multi-lora

Updated Jun 19, 2024
Jupyter Notebook

oladri-renuka / inference-server

Three LLM serving backends for GPT-2-124M built from scratch: naive serial, static batching, and continuous batching with paged KV-cache. Continuous+paged achieves 2.91 req/s with 0 failures vs 2.51 req/s and 6 failures for serial under mixed-length traffic at concurrency=20.

model-serving fastapi kv-cache llm-inference paged-attention continuous-batching

Updated Jun 27, 2026
Python

Keerthisriallavarapu / inferencelab

LLM inference optimization experiments: speculative decoding, continuous batching, KV-cache, quantization

python benchmarking transformers inference pytorch quantization llm vllm speculative-decoding continuous-batching

Updated May 25, 2026
Python

Dev228-afk / LLM-Sim-Bench

High-performance discrete-event simulator (C++20/Python) for modeling agentic LLM traffic, KV cache dynamics, Prefill-Decode Disaggregation (PDD), and scheduling policies. Features roofline model analysis, K-Means request clustering, and a real-time web dashboard.

machine-learning data-visualization pybind11 scheduling-algorithms queuing-theory discrete-event-simulation cpp20 performance-modeling roofline-model kv-cache llm-inference ai-infrastructure continuous-batching datacenter-traffic

Updated Jun 29, 2026
C++

Improve this page

Add a description, image, and links to the continuous-batching topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the continuous-batching topic, visit your repo's landing page and select "manage topics."