Skip to content

Hi TN date class accuracy improvement#418

Open
shrpawar-alt wants to merge 3 commits intoNVIDIA:staging/hi_tn_v3from
shrpawar-alt:hi-tn-date-v2
Open

Hi TN date class accuracy improvement#418
shrpawar-alt wants to merge 3 commits intoNVIDIA:staging/hi_tn_v3from
shrpawar-alt:hi-tn-date-v2

Conversation

@shrpawar-alt
Copy link
Copy Markdown
Contributor

What does this PR do ?

Improved Date class accuracy from ~87 % to ~99 % by introducing additional graph coverage for the cases failing earlier.

Before your PR is "Ready for review"

Pre checks:

  • Have you signed your commits? Use git commit -s to sign.
  • Do all unittests finish successfully before sending PR?
    1. pytest or (if your machine does not have GPU) pytest --cpu from the root folder (given you marked your test cases accordingly @pytest.mark.run_only_on('CPU')).
    2. Sparrowhawk tests bash tools/text_processing_deployment/export_grammars.sh --MODE=test ...
  • If you are adding a new feature: Have you added test cases for both pytest and Sparrowhawk here.
  • Have you added __init__.py for every folder and subfolder, including data folder which has .TSV files?
  • Have you followed codeQL results and removed unused variables and imports (report is at the bottom of the PR in github review box) ?
  • Have you added the correct license header Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. to all newly added Python files?
  • If you copied nemo_text_processing/text_normalization/en/graph_utils.py your header's second line should be Copyright 2015 and onwards Google, Inc.. See an example here.
  • Remove import guards (try import: ... except: ...) if not already done.
  • If you added a new language or a new feature please update the NeMo documentation (lives in different repo).
  • Have you added your language support to tools/text_processing_deployment/pynini_export.py.

PR Type:

  • New Feature
  • Bugfix
  • Documentation
  • Test

If you haven't finished some of the above items you can still open "Draft" PR.

@shrpawar-alt shrpawar-alt marked this pull request as ready for review April 21, 2026 11:16
Signed-off-by: Shreyas Pawar <shrpawar@nvidia.com>
Signed-off-by: Shreyas Pawar <shrpawar@nvidia.com>
@mgrafu mgrafu changed the base branch from main to staging/hi_tn_v3 April 28, 2026 19:44
30 तीस
31 इकतीस No newline at end of file
31 इकतीस
१ एक
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of adding these as additional lines, let's create a rule that allows accepting the number with or without a preceding 0. if possible, let's also leverage cardinals instead

अक्तूबर अक्तूबर
नवंबर नवंबर
दिसंबर दिसंबर
१ जनवरी
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of adding these as additional lines, let's create a rule that allows accepting the number with or without a preceding 0.

11 नवंबर
12 दिसंबर No newline at end of file
12 दिसंबर
जनवरी जनवरी
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you just want to accept the month name without a transformation, you can just have an acceptor only from pynini.project of the normalized forms of this tsv

३० तीस
३१ इकतीस
13 तेरह
14 चौदह
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are the lines here different than the ones in days? if not, let's restrict a rule to these numbers instead of creating a separate data file

prefix_union = pynini.union(*prefixes_list)

verbalized_hundreds = teens_ties_hi.project("output")
verbalized_unit = pynini.union(teens_ties_hi.project("output"), digit.project("output"))
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you have teens_ties_hi.project("output") defined above as verbalized_hundreds, so let's use that name instead of defining again

"०१-०४-२०२४" -> date { day: "एक" month: "अप्रैल" year: "दो हज़ार चौबीस" }
"०४-०१-२०२४" -> date { month: "अप्रैल" day: "एक" year: "दो हज़ार चौबीस" }

"०१-०४-२०२४" -> date { day: "एक" month: "अप्रैल" year: "दो हज़ार चौबीस" }
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's have all arrows either aligned or separated by just one space

teens_and_ties = pynutil.add_weight(teens_ties, -0.1)

# Read suffixes from file into a list
digit_as_day = pynini.string_file(get_abs_path("data/numbers/digit.tsv"))
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how are these different from the ones defined in days?

unambiguous_days_graph = pynutil.insert("day: \"") + unambiguous_day_num + pynutil.insert("\"") + insert_space

graph_mm_dd = months_graph + delete_dash + days_graph
# ── Month graph ──────────────────────────────────────────────────────
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't need these comments since the variables are named accordingly

graph_dd_mm = days_graph + delete_numeric_sep + months_graph

graph_mm_dd_yyyy = months_graph + delete_separator + days_graph + delete_separator + years_graph
# MM-DD: only fires when day is unambiguously > 12
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this format used commonly?

final_graph = (
pynutil.add_weight(graph_dd_mm, -0.001)
| graph_mm_dd
# Full date with era — most specific first
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't need these comments either

@@ -1,4 +1,4 @@
06-05~छः मई
06-05~छह मई
३१-०६~इकतीस जून
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's add test cases for all the different scenarios that we just added

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants