Skip to content

Migrate obsoleted editdistance to rapidfuzz#3105

Open
excosy wants to merge 2 commits into
modelscope:mainfrom
excosy:main
Open

Migrate obsoleted editdistance to rapidfuzz#3105
excosy wants to merge 2 commits into
modelscope:mainfrom
excosy:main

Conversation

@excosy

@excosy excosy commented Jul 2, 2026

Copy link
Copy Markdown

Summary

Since editdistance has been archived and cannot be newly installed on Windows, replace it with rapidfuzz.

Type of change

  • Bug fix
  • Documentation
  • Example or demo
  • Runtime or deployment
  • Benchmark or evaluation
  • Model/training change

Validation

  • python -m compileall funasr examples tests
  • Docs or links checked
  • Runtime/deployment command tested

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request replaces the editdistance dependency with rapidfuzz to calculate Character Error Rate (CER) and Word Error Rate (WER) in funasr/metrics/common.py, updating setup.py accordingly. The review feedback suggests addressing potential ZeroDivisionError bugs in calculate_cer and calculate_wer when reference sequences are empty. Additionally, it recommends reordering imports in common.py to adhere to PEP 8 standards and specifying a minimum version constraint of rapidfuzz>=3.0.0 in setup.py to prevent import errors.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread funasr/metrics/common.py Outdated
Comment on lines 218 to 225
char_eds, char_ref_lens = [], []
for i, seq_hat_text in enumerate(seqs_hat):
seq_true_text = seqs_true[i]
hyp_chars = seq_hat_text.replace(" ", "")
ref_chars = seq_true_text.replace(" ", "")
char_eds.append(editdistance.eval(hyp_chars, ref_chars))
char_eds.append(Levenshtein.distance(hyp_chars, ref_chars))
char_ref_lens.append(len(ref_chars))
return float(sum(char_eds)) / sum(char_ref_lens)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

If seqs_true contains only empty strings or strings with only spaces, char_ref_lens will sum to 0, leading to a ZeroDivisionError. Adding a guard check to ensure the reference length is greater than 0 before division prevents potential runtime crashes.

Suggested change
char_eds, char_ref_lens = [], []
for i, seq_hat_text in enumerate(seqs_hat):
seq_true_text = seqs_true[i]
hyp_chars = seq_hat_text.replace(" ", "")
ref_chars = seq_true_text.replace(" ", "")
char_eds.append(editdistance.eval(hyp_chars, ref_chars))
char_eds.append(Levenshtein.distance(hyp_chars, ref_chars))
char_ref_lens.append(len(ref_chars))
return float(sum(char_eds)) / sum(char_ref_lens)
char_eds, char_ref_lens = [], []
for i, seq_hat_text in enumerate(seqs_hat):
seq_true_text = seqs_true[i]
hyp_chars = seq_hat_text.replace(" ", "")
ref_chars = seq_true_text.replace(" ", "")
char_eds.append(Levenshtein.distance(hyp_chars, ref_chars))
char_ref_lens.append(len(ref_chars))
ref_len = sum(char_ref_lens)
return float(sum(char_eds)) / ref_len if ref_len > 0 else 0.0

Comment thread funasr/metrics/common.py Outdated
Comment on lines 236 to 243
word_eds, word_ref_lens = [], []
for i, seq_hat_text in enumerate(seqs_hat):
seq_true_text = seqs_true[i]
hyp_words = seq_hat_text.split()
ref_words = seq_true_text.split()
word_eds.append(editdistance.eval(hyp_words, ref_words))
word_eds.append(Levenshtein.distance(hyp_words, ref_words))
word_ref_lens.append(len(ref_words))
return float(sum(word_eds)) / sum(word_ref_lens)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

If seqs_true contains only empty strings or strings with only spaces, word_ref_lens will sum to 0, leading to a ZeroDivisionError. Adding a guard check to ensure the reference length is greater than 0 before division prevents potential runtime crashes.

Suggested change
word_eds, word_ref_lens = [], []
for i, seq_hat_text in enumerate(seqs_hat):
seq_true_text = seqs_true[i]
hyp_words = seq_hat_text.split()
ref_words = seq_true_text.split()
word_eds.append(editdistance.eval(hyp_words, ref_words))
word_eds.append(Levenshtein.distance(hyp_words, ref_words))
word_ref_lens.append(len(ref_words))
return float(sum(word_eds)) / sum(word_ref_lens)
word_eds, word_ref_lens = [], []
for i, seq_hat_text in enumerate(seqs_hat):
seq_true_text = seqs_true[i]
hyp_words = seq_hat_text.split()
ref_words = seq_true_text.split()
word_eds.append(Levenshtein.distance(hyp_words, ref_words))
word_ref_lens.append(len(ref_words))
ref_len = sum(word_ref_lens)
return float(sum(word_eds)) / ref_len if ref_len > 0 else 0.0

Comment thread funasr/metrics/common.py Outdated
Comment on lines 13 to 14
from rapidfuzz.distance import Levenshtein
from itertools import groupby

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

According to PEP 8, standard library imports (like itertools) should be grouped together and placed before third-party imports (like rapidfuzz). Please reorder these imports to adhere to PEP 8 guidelines.

Suggested change
from rapidfuzz.distance import Levenshtein
from itertools import groupby
from itertools import groupby
from rapidfuzz.distance import Levenshtein
References
  1. PEP 8 states that imports should be grouped in the following order: 1. Standard library imports, 2. Related third party imports, 3. Local application/library specific imports. (link)

Comment thread setup.py Outdated
# Speaker & evaluation
"umap_learn",
"editdistance>=0.5.2",
"rapidfuzz",

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The rapidfuzz.distance module (specifically Levenshtein) was introduced in rapidfuzz version 2.0.0. To prevent potential ImportError issues on environments with older versions of rapidfuzz installed, please specify a minimum version constraint such as rapidfuzz>=3.0.0.

Suggested change
"rapidfuzz",
"rapidfuzz>=3.0.0",

Comment thread setup.py Outdated
# train: The modules invoked when training only.
"train": [
"editdistance",
"rapidfuzz",

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

To remain consistent with the main requirements, please specify the same minimum version constraint for rapidfuzz in the train extra requirements.

Suggested change
"rapidfuzz",
"rapidfuzz>=3.0.0",

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant