You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The automated diarization benchmark on PR #744 reports a DER of 80.8% and JER of 80.9% on AMI ES2004a — roughly 2.7× worse than the <30% DER target and far outside the 18–30% research baseline. A ~80% DER indicates the diarizer is essentially failing (near chance-level speaker assignment), not just degraded.
Need to confirm whether this reproduces on main (i.e. is it a real regression in the diarization pipeline, or a benchmark/CI harness issue — e.g. wrong reference RTTM, model download/version mismatch, or eval collar/mapping bug).
Next steps
Reproduce locally: swift run fluidaudiocli diarization-benchmark on AMI ES2004a
Confirm whether main shows the same ~80% DER or if this is PR-branch specific
Check the diarization models actually downloaded/compiled (vs. a fallback) in the CI run
Verify the reference annotation and DER scoring (collar, speaker mapping) are correct
Summary
The automated diarization benchmark on PR #744 reports a DER of 80.8% and JER of 80.9% on AMI ES2004a — roughly 2.7× worse than the <30% DER target and far outside the 18–30% research baseline. A ~80% DER indicates the diarizer is essentially failing (near chance-level speaker assignment), not just degraded.
Benchmark comment: #744 (comment)
Reported results (AMI Corpus ES2004a, 1049.0s audio)
Pipeline timing looked nominal (Segmentation 10.5s, Embedding 17.5s, Clustering 7.0s), so this is an accuracy failure rather than a crash or timeout.
Notes
main(i.e. is it a real regression in the diarization pipeline, or a benchmark/CI harness issue — e.g. wrong reference RTTM, model download/version mismatch, or eval collar/mapping bug).Next steps
swift run fluidaudiocli diarization-benchmarkon AMI ES2004amainshows the same ~80% DER or if this is PR-branch specific