Skip to content

hi_itn_electronic#437

Open
mayuris-00 wants to merge 7 commits into
NVIDIA:staging/hi_itn_v3from
mayuris-00:hi-itn-electronic-clean
Open

hi_itn_electronic#437
mayuris-00 wants to merge 7 commits into
NVIDIA:staging/hi_itn_v3from
mayuris-00:hi-itn-electronic-clean

Conversation

@mayuris-00

@mayuris-00 mayuris-00 commented Jun 8, 2026

Copy link
Copy Markdown

What does this PR do ?

Add a one line overview of what this PR aims to accomplish.

Before your PR is "Ready for review"

Pre checks:

  • Have you signed your commits? Use git commit -s to sign.
  • Do all unittests finish successfully before sending PR?
    1. pytest or (if your machine does not have GPU) pytest --cpu from the root folder (given you marked your test cases accordingly @pytest.mark.run_only_on('CPU')).
    2. Sparrowhawk tests bash tools/text_processing_deployment/export_grammars.sh --MODE=test ...
  • If you are adding a new feature: Have you added test cases for both pytest and Sparrowhawk here.
  • Have you added __init__.py for every folder and subfolder, including data folder which has .TSV files?
  • Have you followed codeQL results and removed unused variables and imports (report is at the bottom of the PR in github review box) ?
  • Have you added the correct license header Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. to all newly added Python files?
  • If you copied nemo_text_processing/text_normalization/en/graph_utils.py your header's second line should be Copyright 2015 and onwards Google, Inc.. See an example here.
  • Remove import guards (try import: ... except: ...) if not already done.
  • If you added a new language or a new feature please update the NeMo documentation (lives in different repo).
  • Have you added your language support to tools/text_processing_deployment/pynini_export.py.

PR Type:

  • New Feature
  • Bugfix
  • Documentation
  • Test

If you haven't finished some of the above items you can still open "Draft" PR.

Signed-off-by: Mayuri S <mayuris@nvidia.com>
Signed-off-by: Mayuri S <mayuris@nvidia.com>
Signed-off-by: Mayuri S <mayuris@nvidia.com>
Signed-off-by: Mayuri S <mayuris@nvidia.com>
Signed-off-by: Mayuri S <mayuris@nvidia.com>
Signed-off-by: mayuris-00 <mayuris@nvidia.com>
Signed-off-by: Mayuri S <mayuris@nvidia.com>
@mayuris-00 mayuris-00 force-pushed the hi-itn-electronic-clean branch from 3e004de to 01b2fc4 Compare June 10, 2026 04:35
@@ -0,0 +1,21 @@
ग्लूकोज C6H12O6

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please refer to the changes we did with chemical formulas for TN. we do not want hardcoded formulas, only base elements hardcoded and rules that build on them instead. if this is not possible, it's better to state this as a limitation than to have only for certain elements.

venv वेन्व
SAMPLE एस ए एम पी एल ई
hotmail हॉटमेल
ExpressScribeTranscriptionSoftware ई एक्स पी आर ई एस एस एस सी आर आई बी ई टी आर ए एन एस सी आर आई पी टी आई ओ एन एस ओ एफ टी डब्ल्यू ए आर ई

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how is this a common word?

hotmail हॉटमेल
ExpressScribeTranscriptionSoftware ई एक्स पी आर ई एस एस एस सी आर आई बी ई टी आर ए एन एस सी आर आई पी टी आई ओ एन एस ओ एफ टी डब्ल्यू ए आर ई
Phones पी एच ओ एन ई एस
TXR20820d90fb1d3327447009e701166f29 टी एक्स आर दो शून्य आठ दो शून्य डी नौ शून्य एफ बी एक डी तीन तीन दो सात चार चार सात शून्य शून्य नौ ई सात शून्य एक एक छह छह एफ दो नौ

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how is this a common word?

@@ -0,0 +1,10 @@
1 १

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are these digits any different from the ones available for cardinals?

@@ -0,0 +1,10 @@
एक 1

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are these digits any different from the ones available for cardinals?

@@ -0,0 +1,54 @@
a ए

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of hardcoding upper and lower, let's use capitalize in script

nic एन आई सी
sims सिम्स
pope पोप
Zoom ज़ेड ओ ओ एम

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it necessary to have overlap between domain and server name?

sharda शारदा
universities यूनिवर्सिटीज़
mcdonald मैक्डॉनल्ड
southmountaincc साउथ माउन्टेन सी सी

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's trim this list to only be the most common cases

@@ -0,0 +1,24 @@
ज़ेड एक्स आठ शून्य एक नौ आठ शून्य ZX80 1980

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's have a serial class instead of hardcoding certain alphanumeric cases (you can look at the TN implementation for this)

-> tokens { electronic { path: "/home/user/documents" } }
IP address:
e.g. एक नौ दो डॉट एक छह आठ डॉट एक डॉट एक
-> tokens { electronic { ip: "192.168.1.1" } }

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please only use the tags defined in the semiotic classes proto

'ip' is not one of them


special_codes_map = pynini.string_file(get_abs_path("data/electronic/special_codes.tsv")).optimize()

to_lower = pynini.cdrewrite(

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please use

instead

)
latin_run_lower = make_lower(latin_run)

_drive_chars = pynini.union("C", "D", "E", "F", "G", "H", "I", "J")

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be a tsv instead of hardcoded here

drive_letter = pynini.compose(letter_map_upper, _drive_chars)

def _backslash():
return pynutil.delete("बैकवर्ड") + delete_space + pynutil.delete("स्लैश") + pynutil.insert("\\\\")

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in general let's add all transformations in tsv files

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants