Skip to content

Fixed unit test failures for test_terminal_output_response_charset_detection and test_terminal_output_request_charset_detection#1855

Open
jautung wants to merge 1 commit into
httpie:masterfrom
jautung:update-big5-detection-test
Open

Fixed unit test failures for test_terminal_output_response_charset_detection and test_terminal_output_request_charset_detection#1855
jautung wants to merge 1 commit into
httpie:masterfrom
jautung:update-big5-detection-test

Conversation

@jautung
Copy link
Copy Markdown

@jautung jautung commented May 24, 2026

These two unit tests were failing due to incorrect encoding detection of 卷首卷首卷首卷首卷卷首卷首卷首卷首卷首卷首卷首卷首卷首卷首卷首卷首卷首 -> big5. The unit tests used an ambiguous Big5 test string that charset_normalizer could not reliably distinguish from Johab.

Before After
Screenshot 2026-05-24 at 10 06 39 AM Screenshot 2026-05-24 at 10 08 27 AM

Added (temporary, not committed) debugging logs to encoding.py:detect_encoding:

def detect_encoding(content: ContentBytes) -> str:
    ...
    if len(content) > TOO_SMALL_SEQUENCE:
        match = from_bytes(bytes(content)).best()
        print()
        print('content', content)
        print()
        print('bytes(content)', bytes(content))
        print()
        print('from_bytes(bytes(content))._results', from_bytes(bytes(content))._results)
        print()
        print('match', match)
        print()
        print('match.encoding', match.encoding)
        print()
        if match:
            encoding = match.encoding
    return encoding

Noted that, with the current (original) text string of: 卷首卷首卷首卷首卷卷首卷首卷首卷首卷首卷首卷首卷首卷首卷首卷首卷首卷首, we were getting:

content bytearray(b'\xa8\xf7\xad\xba\xa8\xf7\xad\xba\xa8\xf7\xad\xba\xa8\xf7\xad\xba\xa8\xf7\xa8\xf7\xad\xba\xa8\xf7\xad\xba\xa8\xf7\xad\xba\xa8\xf7\xad\xba\xa8\xf7\xad\xba\xa8\xf7\xad\xba\xa8\xf7\xad\xba\xa8\xf7\xad\xba\xa8\xf7\xad\xba\xa8\xf7\xad\xba\xa8\xf7\xad\xba\xa8\xf7\xad\xba\xa8\xf7\xad\xba')

bytes(content) b'\xa8\xf7\xad\xba\xa8\xf7\xad\xba\xa8\xf7\xad\xba\xa8\xf7\xad\xba\xa8\xf7\xa8\xf7\xad\xba\xa8\xf7\xad\xba\xa8\xf7\xad\xba\xa8\xf7\xad\xba\xa8\xf7\xad\xba\xa8\xf7\xad\xba\xa8\xf7\xad\xba\xa8\xf7\xad\xba\xa8\xf7\xad\xba\xa8\xf7\xad\xba\xa8\xf7\xad\xba\xa8\xf7\xad\xba\xa8\xf7\xad\xba'

from_bytes(bytes(content))._results [<CharsetMatch 'johab' fp(-1782770569132810705)>, <CharsetMatch 'big5' fp(9095422849593591809)>, <CharsetMatch 'shift_jis_2004' fp(3898380262017389457)>]

match 뻥솤뻥솤뻥솤뻥솤뻥뻥솤뻥솤뻥솤뻥솤뻥솤뻥솤뻥솤뻥솤뻥솤뻥솤뻥솤뻥솤뻥솤

match.encoding johab

The best match was johab, and big5 was the second best match in the list. Some byte sequences are genuinely ambiguous between encodings like Big5, Johab, and Shift-JIS because they share overlapping byte ranges, so this makes sense.

Fix: updated the test string to be 你好世界。你好世界。你好世界。你好世界。你好世界。你好世界。你好世界。, which is unambigiously big5-encoded:

content bytearray(b' \xa7A\xa6n\xa5@\xac\xc9\xa1C\xa7A\xa6n\xa5@\xac\xc9\xa1C\xa7A\xa6n\xa5@\xac\xc9\xa1C\xa7A\xa6n\xa5@\xac\xc9\xa1C\xa7A\xa6n\xa5@\xac\xc9\xa1C\xa7A\xa6n\xa5@\xac\xc9\xa1C\xa7A\xa6n\xa5@\xac\xc9\xa1C')

bytes(content) b' \xa7A\xa6n\xa5@\xac\xc9\xa1C\xa7A\xa6n\xa5@\xac\xc9\xa1C\xa7A\xa6n\xa5@\xac\xc9\xa1C\xa7A\xa6n\xa5@\xac\xc9\xa1C\xa7A\xa6n\xa5@\xac\xc9\xa1C\xa7A\xa6n\xa5@\xac\xc9\xa1C\xa7A\xa6n\xa5@\xac\xc9\xa1C'

from_bytes(bytes(content))._results [<CharsetMatch 'big5' fp(320475358053722554)>]

match  你好世界。你好世界。你好世界。你好世界。你好世界。你好世界。你好世界。

match.encoding big5

All tests are passing now.

@codecov-commenter
Copy link
Copy Markdown

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 94.11%. Comparing base (4d7d6b6) to head (fac60b8).
⚠️ Report is 383 commits behind head on master.
❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1855      +/-   ##
==========================================
- Coverage   97.28%   94.11%   -3.18%     
==========================================
  Files          67      113      +46     
  Lines        4235     7694    +3459     
==========================================
+ Hits         4120     7241    +3121     
- Misses        115      453     +338     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants