Hi TN date class accuracy improvement#418
Hi TN date class accuracy improvement#418shrpawar-alt wants to merge 3 commits intoNVIDIA:staging/hi_tn_v3from
Conversation
Signed-off-by: Shreyas Pawar <shrpawar@nvidia.com>
Signed-off-by: Shreyas Pawar <shrpawar@nvidia.com>
79f8c53 to
017a615
Compare
for more information, see https://pre-commit.ci
| 30 तीस | ||
| 31 इकतीस No newline at end of file | ||
| 31 इकतीस | ||
| १ एक |
There was a problem hiding this comment.
instead of adding these as additional lines, let's create a rule that allows accepting the number with or without a preceding 0. if possible, let's also leverage cardinals instead
| अक्तूबर अक्तूबर | ||
| नवंबर नवंबर | ||
| दिसंबर दिसंबर | ||
| १ जनवरी |
There was a problem hiding this comment.
instead of adding these as additional lines, let's create a rule that allows accepting the number with or without a preceding 0.
| 11 नवंबर | ||
| 12 दिसंबर No newline at end of file | ||
| 12 दिसंबर | ||
| जनवरी जनवरी |
There was a problem hiding this comment.
if you just want to accept the month name without a transformation, you can just have an acceptor only from pynini.project of the normalized forms of this tsv
| ३० तीस | ||
| ३१ इकतीस | ||
| 13 तेरह | ||
| 14 चौदह |
There was a problem hiding this comment.
are the lines here different than the ones in days? if not, let's restrict a rule to these numbers instead of creating a separate data file
| prefix_union = pynini.union(*prefixes_list) | ||
|
|
||
| verbalized_hundreds = teens_ties_hi.project("output") | ||
| verbalized_unit = pynini.union(teens_ties_hi.project("output"), digit.project("output")) |
There was a problem hiding this comment.
you have teens_ties_hi.project("output") defined above as verbalized_hundreds, so let's use that name instead of defining again
| "०१-०४-२०२४" -> date { day: "एक" month: "अप्रैल" year: "दो हज़ार चौबीस" } | ||
| "०४-०१-२०२४" -> date { month: "अप्रैल" day: "एक" year: "दो हज़ार चौबीस" } | ||
|
|
||
| "०१-०४-२०२४" -> date { day: "एक" month: "अप्रैल" year: "दो हज़ार चौबीस" } |
There was a problem hiding this comment.
let's have all arrows either aligned or separated by just one space
| teens_and_ties = pynutil.add_weight(teens_ties, -0.1) | ||
|
|
||
| # Read suffixes from file into a list | ||
| digit_as_day = pynini.string_file(get_abs_path("data/numbers/digit.tsv")) |
There was a problem hiding this comment.
how are these different from the ones defined in days?
| unambiguous_days_graph = pynutil.insert("day: \"") + unambiguous_day_num + pynutil.insert("\"") + insert_space | ||
|
|
||
| graph_mm_dd = months_graph + delete_dash + days_graph | ||
| # ── Month graph ────────────────────────────────────────────────────── |
There was a problem hiding this comment.
we don't need these comments since the variables are named accordingly
| graph_dd_mm = days_graph + delete_numeric_sep + months_graph | ||
|
|
||
| graph_mm_dd_yyyy = months_graph + delete_separator + days_graph + delete_separator + years_graph | ||
| # MM-DD: only fires when day is unambiguously > 12 |
There was a problem hiding this comment.
is this format used commonly?
| final_graph = ( | ||
| pynutil.add_weight(graph_dd_mm, -0.001) | ||
| | graph_mm_dd | ||
| # Full date with era — most specific first |
There was a problem hiding this comment.
we don't need these comments either
| @@ -1,4 +1,4 @@ | |||
| 06-05~छः मई | |||
| 06-05~छह मई | |||
| ३१-०६~इकतीस जून | |||
There was a problem hiding this comment.
let's add test cases for all the different scenarios that we just added
What does this PR do ?
Improved Date class accuracy from ~87 % to ~99 % by introducing additional graph coverage for the cases failing earlier.
Before your PR is "Ready for review"
Pre checks:
git commit -sto sign.pytestor (if your machine does not have GPU)pytest --cpufrom the root folder (given you marked your test cases accordingly@pytest.mark.run_only_on('CPU')).bash tools/text_processing_deployment/export_grammars.sh --MODE=test ...pytestand Sparrowhawk here.__init__.pyfor every folder and subfolder, includingdatafolder which has .TSV files?Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.to all newly added Python files?Copyright 2015 and onwards Google, Inc.. See an example here.try import: ... except: ...) if not already done.PR Type:
If you haven't finished some of the above items you can still open "Draft" PR.