Diarization benchmark reports 80.8% DER on AMI ES2004a (target <30%)

## Summary

The automated diarization benchmark on PR #744 reports a **DER of 80.8%** and **JER of 80.9%** on AMI ES2004a — roughly 2.7× worse than the <30% DER target and far outside the 18–30% research baseline. A ~80% DER indicates the diarizer is essentially failing (near chance-level speaker assignment), not just degraded.

Benchmark comment: https://github.com/FluidInference/FluidAudio/pull/744#issuecomment-4827420876

## Reported results (AMI Corpus ES2004a, 1049.0s audio)

| Metric | Value | Target | Status |
|--------|-------|--------|--------|
| DER    | **80.8%** | <30% | ⚠️ |
| JER    | **80.9%** | <25% | ⚠️ |
| RTFx   | 29.91x    | >1.0x | ✅ |

Pipeline timing looked nominal (Segmentation 10.5s, Embedding 17.5s, Clustering 7.0s), so this is an accuracy failure rather than a crash or timeout.

## Notes

- PR #744 itself only touches the Nemotron ASR front-end (native-Swift mel), so the diarization regression is very likely pre-existing / unrelated to that PR's changes and is being surfaced by the CI benchmark that runs on every PR.
- Need to confirm whether this reproduces on `main` (i.e. is it a real regression in the diarization pipeline, or a benchmark/CI harness issue — e.g. wrong reference RTTM, model download/version mismatch, or eval collar/mapping bug).

## Next steps

- [ ] Reproduce locally: `swift run fluidaudiocli diarization-benchmark` on AMI ES2004a
- [ ] Confirm whether `main` shows the same ~80% DER or if this is PR-branch specific
- [ ] Check the diarization models actually downloaded/compiled (vs. a fallback) in the CI run
- [ ] Verify the reference annotation and DER scoring (collar, speaker mapping) are correct


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Diarization benchmark reports 80.8% DER on AMI ES2004a (target <30%) #752

Summary

Reported results (AMI Corpus ES2004a, 1049.0s audio)

Notes

Next steps

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Metric	Value	Target	Status
DER	80.8%	<30%	⚠️
JER	80.9%	<25%	⚠️
RTFx	29.91x	>1.0x	✅

Uh oh!

Diarization benchmark reports 80.8% DER on AMI ES2004a (target <30%) #752

Description

Summary

Reported results (AMI Corpus ES2004a, 1049.0s audio)

Notes

Next steps

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions