Migrate obsoleted editdistance to rapidfuzz#3105
Conversation
There was a problem hiding this comment.
Code Review
This pull request replaces the editdistance dependency with rapidfuzz to calculate Character Error Rate (CER) and Word Error Rate (WER) in funasr/metrics/common.py, updating setup.py accordingly. The review feedback suggests addressing potential ZeroDivisionError bugs in calculate_cer and calculate_wer when reference sequences are empty. Additionally, it recommends reordering imports in common.py to adhere to PEP 8 standards and specifying a minimum version constraint of rapidfuzz>=3.0.0 in setup.py to prevent import errors.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| char_eds, char_ref_lens = [], [] | ||
| for i, seq_hat_text in enumerate(seqs_hat): | ||
| seq_true_text = seqs_true[i] | ||
| hyp_chars = seq_hat_text.replace(" ", "") | ||
| ref_chars = seq_true_text.replace(" ", "") | ||
| char_eds.append(editdistance.eval(hyp_chars, ref_chars)) | ||
| char_eds.append(Levenshtein.distance(hyp_chars, ref_chars)) | ||
| char_ref_lens.append(len(ref_chars)) | ||
| return float(sum(char_eds)) / sum(char_ref_lens) |
There was a problem hiding this comment.
If seqs_true contains only empty strings or strings with only spaces, char_ref_lens will sum to 0, leading to a ZeroDivisionError. Adding a guard check to ensure the reference length is greater than 0 before division prevents potential runtime crashes.
| char_eds, char_ref_lens = [], [] | |
| for i, seq_hat_text in enumerate(seqs_hat): | |
| seq_true_text = seqs_true[i] | |
| hyp_chars = seq_hat_text.replace(" ", "") | |
| ref_chars = seq_true_text.replace(" ", "") | |
| char_eds.append(editdistance.eval(hyp_chars, ref_chars)) | |
| char_eds.append(Levenshtein.distance(hyp_chars, ref_chars)) | |
| char_ref_lens.append(len(ref_chars)) | |
| return float(sum(char_eds)) / sum(char_ref_lens) | |
| char_eds, char_ref_lens = [], [] | |
| for i, seq_hat_text in enumerate(seqs_hat): | |
| seq_true_text = seqs_true[i] | |
| hyp_chars = seq_hat_text.replace(" ", "") | |
| ref_chars = seq_true_text.replace(" ", "") | |
| char_eds.append(Levenshtein.distance(hyp_chars, ref_chars)) | |
| char_ref_lens.append(len(ref_chars)) | |
| ref_len = sum(char_ref_lens) | |
| return float(sum(char_eds)) / ref_len if ref_len > 0 else 0.0 |
| word_eds, word_ref_lens = [], [] | ||
| for i, seq_hat_text in enumerate(seqs_hat): | ||
| seq_true_text = seqs_true[i] | ||
| hyp_words = seq_hat_text.split() | ||
| ref_words = seq_true_text.split() | ||
| word_eds.append(editdistance.eval(hyp_words, ref_words)) | ||
| word_eds.append(Levenshtein.distance(hyp_words, ref_words)) | ||
| word_ref_lens.append(len(ref_words)) | ||
| return float(sum(word_eds)) / sum(word_ref_lens) |
There was a problem hiding this comment.
If seqs_true contains only empty strings or strings with only spaces, word_ref_lens will sum to 0, leading to a ZeroDivisionError. Adding a guard check to ensure the reference length is greater than 0 before division prevents potential runtime crashes.
| word_eds, word_ref_lens = [], [] | |
| for i, seq_hat_text in enumerate(seqs_hat): | |
| seq_true_text = seqs_true[i] | |
| hyp_words = seq_hat_text.split() | |
| ref_words = seq_true_text.split() | |
| word_eds.append(editdistance.eval(hyp_words, ref_words)) | |
| word_eds.append(Levenshtein.distance(hyp_words, ref_words)) | |
| word_ref_lens.append(len(ref_words)) | |
| return float(sum(word_eds)) / sum(word_ref_lens) | |
| word_eds, word_ref_lens = [], [] | |
| for i, seq_hat_text in enumerate(seqs_hat): | |
| seq_true_text = seqs_true[i] | |
| hyp_words = seq_hat_text.split() | |
| ref_words = seq_true_text.split() | |
| word_eds.append(Levenshtein.distance(hyp_words, ref_words)) | |
| word_ref_lens.append(len(ref_words)) | |
| ref_len = sum(word_ref_lens) | |
| return float(sum(word_eds)) / ref_len if ref_len > 0 else 0.0 |
| from rapidfuzz.distance import Levenshtein | ||
| from itertools import groupby |
There was a problem hiding this comment.
According to PEP 8, standard library imports (like itertools) should be grouped together and placed before third-party imports (like rapidfuzz). Please reorder these imports to adhere to PEP 8 guidelines.
| from rapidfuzz.distance import Levenshtein | |
| from itertools import groupby | |
| from itertools import groupby | |
| from rapidfuzz.distance import Levenshtein |
References
- PEP 8 states that imports should be grouped in the following order: 1. Standard library imports, 2. Related third party imports, 3. Local application/library specific imports. (link)
| # Speaker & evaluation | ||
| "umap_learn", | ||
| "editdistance>=0.5.2", | ||
| "rapidfuzz", |
There was a problem hiding this comment.
The rapidfuzz.distance module (specifically Levenshtein) was introduced in rapidfuzz version 2.0.0. To prevent potential ImportError issues on environments with older versions of rapidfuzz installed, please specify a minimum version constraint such as rapidfuzz>=3.0.0.
| "rapidfuzz", | |
| "rapidfuzz>=3.0.0", |
| # train: The modules invoked when training only. | ||
| "train": [ | ||
| "editdistance", | ||
| "rapidfuzz", |
Summary
Since
editdistancehas been archived and cannot be newly installed on Windows, replace it withrapidfuzz.Type of change
Validation
python -m compileall funasr examples tests