Skip to content

feat(llama-cpp): video input support (mtmd #24269)#10216

Merged
mudler merged 5 commits into
masterfrom
feat/llama-cpp-video-input
Jun 8, 2026
Merged

feat(llama-cpp): video input support (mtmd #24269)#10216
mudler merged 5 commits into
masterfrom
feat/llama-cpp-video-input

Conversation

@localai-bot

@localai-bot localai-bot commented Jun 8, 2026

Copy link
Copy Markdown
Collaborator

What

Adds video input support to the llama.cpp backend, so video-capable multimodal models (e.g. SmolVLM2-Video) can be sent a video in a chat request - mirroring how images and audio already work end to end. This is video understanding input, distinct from the existing text-to-video generation endpoint.

Tracks the upstream landing of video in mtmd: ggml-org/llama.cpp#24269 (merged 8f83d6c).

Why

The Go core was already fully wired for video input (proto Videos = 45, video_url request parsing, opts.Videos forwarding, <__media__> marker counting), but the C++ backend never read the field and the pinned llama.cpp predated mtmd video. This closes that gap and surfaces it in the chat UI.

Changes

  • Bump llama.cpp 9e3b9288f83d6c (9 commits) to pick up mtmd video support. MTMD_VIDEO defaults ON upstream; it only needs ffmpeg/ffprobe on PATH, which the runtime image already ships (Dockerfile).
  • grpc-server.cpp: forward request->videos() into the mtmd files vector on both request paths (template + non-template), in both the PredictStream and Predict mirror blocks:
    • non-template: a video_data build + base64-decode into files;
    • template: emit {"type":"input_video","input_video":{"data": ...}} chat chunks and include videos in the multimodal guard.
    • allow_video is auto-set at model load by the vendored upstream chat_params (mtmd_helper_support_video(mctx)), so no manual gating is added - video is accepted only when the loaded mmproj supports it.
  • React chat UI: accept video/*, keep video files as base64, show a film-icon badge, render attached video inline with a <video controls> player, and emit video_url content parts.

Data flow

UI video_urlrequest.go StringVideosinference.go videos[]llm.go opts.Videos → gRPC Videosgrpc-server.cpp input_video → mtmd (ffmpeg frame extraction) → model.

Notes / scope

  • No Docker change (ffmpeg already present). No proto change (field pre-existed). No Go core change (already wired).
  • Verification: React UI builds clean (vite build); C++ wiring is a structurally-verified mirror of the working audio path. The native backend build is left to CI.
  • Out of scope (follow-ons): adding a video-capable GGUF to the gallery + a full e2e run.

Assisted-by: Claude:claude-opus-4-8 [Claude Code]

🤖 Generated with Claude Code


Status: draft — e2e-verified, with one upstream caveat

Built the backend locally and ran a real video chat against gemma-4-e2b-it-qat-q4_0 (mmproj, tokenizer-template path).

Found + fixed an upstream crash. llama.cpp's new video code (#24269) double-fcloses the ffmpeg/ffprobe stdin: mtmd_helper_video::feed_stdin() closes the FILE returned by subprocess_stdin() (which is sp->stdin_file), then subprocess_destroy() closes the same pointer again → heap corruption that aborts the backend on any base64 input_video (the CLI --video <file> path is unaffected, which is likely why it shipped). This hits upstream's own llama-server too. Worked around here with a vendored one-line patch (backend/cpp/llama-cpp/patches/0001-*.patch, applied by prepare.sh); an upstream PR will follow.

After the patch: video works end to end — ffmpeg extracts frames, the model sees them, and answers correctly (solid-red clip → "Red", solid-blue → "Blue").

Known limitation (separate upstream issue, not blocking): within a single server process, repeated video requests that share an identical text prompt can reuse the first video's frames (prompt/KV-cache collision — image bitmaps carry an fnv-hash id so the cache distinguishes them, but the lazy video's expanded frames don't). Workaround: distinct prompts or disabling the prompt cache. ggml-org/llama.cpp#24303

Marking draft until the upstream crash fix is merged (or we're comfortable shipping with the vendored patch).


Update: dropped the vendored patch, re-pinned to upstream #24316

Upstream replaced the ad-hoc video stdin handling with a proper RAII refactor — ggml-org/llama.cpp#24316 ("mtmd: refactor video subproc handling") — which contains the same sp->stdin_file = nullptr guard the vendored patch added, plus join-before-destroy ordering. So LLAMA_VERSION is now re-pinned to that change and patches/0001-* is removed.

Re-verified e2e (no patch): no crash, red clip → "Red", blue → "Blue".

Caveat: #24316 isn't merged yet, so this currently pins to its branch-head commit (28ca1e60). Re-pin to the squash-merge commit on master once it lands, or git fetch may lose the commit after the branch is deleted. The secondary prompt-cache frame-reuse note above is unaffected by #24316 (still open as a separate item).

mudler added 4 commits June 8, 2026 15:02
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
… paths)

Wire request->videos() into grpc-server.cpp mirroring the existing image
and audio handling: a video_data build + non-template files extraction, and
input_video chat chunks on the tokenizer-template path. allow_video is
auto-set at model load by the vendored upstream chat_params.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Mirror the image/audio attachment path for video: emit video_url content
parts, accept video/* in the picker, keep video files as base64, show a
film icon badge, and render attached video inline with a <video> player.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Upstream mtmd video input (ggml-org/llama.cpp#24269) double-fcloses the
ffmpeg/ffprobe stdin FILE: feed_stdin() fclose()s the FILE returned by
subprocess_stdin() (which is sp->stdin_file), then subprocess_destroy()
fclose()s the same pointer again -> heap corruption that aborts the
backend on any base64 input_video request (the CLI --video file path is
unaffected). Vendor a one-line fix (null sp->stdin_file after fclose)
via prepare.sh's patches/ until upstream merges it.

Verified e2e with gemma-4-e2b-it-qat-q4_0: video frames decode via
ffmpeg and the model answers correctly (red clip -> 'Red', blue -> 'Blue').

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
@localai-bot localai-bot marked this pull request as draft June 8, 2026 16:30
@mudler

mudler commented Jun 8, 2026

Copy link
Copy Markdown
Owner

upstream patch: ggml-org/llama.cpp#24313

Upstream replaced the ad-hoc video stdin handling with a proper RAII
refactor (ggml-org/llama.cpp#24316, "mtmd: refactor video subproc
handling"), which includes the same `sp->stdin_file = nullptr` guard our
patch added (plus join-before-destroy ordering). Re-pin LLAMA_VERSION to
that branch head and drop patches/0001 - it's now redundant.

Verified e2e with gemma-4-e2b-it-qat-q4_0: no crash, video frames decode
and the model answers correctly (red clip -> "Red", blue -> "Blue").

NOTE: #24316 is not yet merged, so this pins to its branch-head commit
(28ca1e60). Re-pin to the squash-merge commit on master once it lands,
otherwise `git fetch` may lose the commit after the branch is deleted.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
@mudler mudler merged commit 9323f4b into master Jun 8, 2026
67 of 68 checks passed
@mudler mudler deleted the feat/llama-cpp-video-input branch June 8, 2026 21:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants