Skip to content

gh-53144: Improve charset support in the email package#149942

Merged
serhiy-storchaka merged 7 commits into
python:mainfrom
serhiy-storchaka:email-charset-aliases
Jun 5, 2026
Merged

gh-53144: Improve charset support in the email package#149942
serhiy-storchaka merged 7 commits into
python:mainfrom
serhiy-storchaka:email-charset-aliases

Conversation

@serhiy-storchaka

@serhiy-storchaka serhiy-storchaka commented May 17, 2026

Copy link
Copy Markdown
Member

Defer to the codecs module for all aliases.
Use MIME/IANA names for all IANA registered charsets.

Defer to the codecs module for all aliases.
Use MIME/IANA names for all IANA registered charsets.
@serhiy-storchaka serhiy-storchaka requested a review from a team as a code owner May 17, 2026 09:42
@malemburg malemburg changed the title gh-53144: Imporove charset support in the email package gh-53144: Improve charset support in the email package May 17, 2026
Comment thread Lib/email/charset.py Outdated

@bitdancer bitdancer left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this looks good.

I'm getting two test failures against current main, one of which (test_chinese_codecs) looks like charset name changes and is probably correct, but the other(test_korean_codecs) I don't understand.

Comment thread Lib/email/charset.py Outdated
@@ -61,40 +62,73 @@
'utf-8': (SHORTEST, BASE64, 'utf-8'),

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
'utf-8': (SHORTEST, BASE64, 'utf-8'),
'utf-8': (SHORTEST, BASE64, None),

I don't know why this is set to a non-None value when all the others are None, except for the one that wants to use a different charset for header encoding. So I think we should "fix" it to avoid confusion later (assuming I'm not missing something here).

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This item can be removed. (SHORTEST, BASE64, None) is the default value.

Comment thread Lib/email/charset.py Outdated
self.header_encoding = henc
self.body_encoding = benc
self.output_charset = ALIASES.get(conv, conv)
self.output_charset = conv

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment says "but let the user override it", which is what the call to ALIASES does. While it is unlikely anyone is using that facility, it is part of the API to update the aliases table, so we should maybe leave it. The place it might get used is exactly the place it might be wanted: overriding iso-2022-jp 7bit codec. So, since it is cheap, and a no-op in all other cases, maybe we shouldn't risk breaking anyone's legacy code?

msg.set_param('charset',
email.charset.ALIASES.get(charset, charset),
replace=True)
msg.set_param('charset', charset, replace=True)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your reordering of the operations here looks correct, which presumably means there is a missing test that would show the bug. Do you want to add one? If not I'll make a note and try to remember to do it some day ;)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added test_set_text_charset_cp949. Note that charset="euc-kr", even if ALIASES maps 'cp949' to 'ks_c_5601-1987'.

But when I tried to add similar test with shif_jis or euc-jp, it failed (trying to encode surrogates). It fails also with the current code.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opened #150771.

Comment thread Lib/test/test_email/test_email.py
@bedevere-app

bedevere-app Bot commented Jun 1, 2026

Copy link
Copy Markdown

When you're done making the requested changes, leave the comment: I have made the requested changes; please review again.

@serhiy-storchaka serhiy-storchaka left a comment

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm getting two test failures against current main, one of which (test_chinese_codecs) looks like charset name changes and is probably correct, but the other(test_korean_codecs) I don't understand.

This is interesting. There are three chunks with encodings euc-kr, ks_c_5601-1987 and cp949. In main, cp949 is translated to ks_c_5601-1987, so the 2nd and the 3rd chunks are merged. With this PR, the second chunk, ks_c_5601-1987, translated to euc-kr, so the 1st and the 2nd chunks are merged.

This is because in Python ks_c_5601-1987 is an alias of euc-kr, but in IANA they are separate codecs.

Comment thread Lib/email/charset.py Outdated
Comment thread Lib/email/charset.py Outdated
Comment thread Lib/email/charset.py Outdated
@@ -61,40 +62,73 @@
'utf-8': (SHORTEST, BASE64, 'utf-8'),

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This item can be removed. (SHORTEST, BASE64, None) is the default value.

msg.set_param('charset',
email.charset.ALIASES.get(charset, charset),
replace=True)
msg.set_param('charset', charset, replace=True)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added test_set_text_charset_cp949. Note that charset="euc-kr", even if ALIASES maps 'cp949' to 'ks_c_5601-1987'.

But when I tried to add similar test with shif_jis or euc-jp, it failed (trying to encode surrogates). It fails also with the current code.

@serhiy-storchaka

Copy link
Copy Markdown
Member Author

I have made the requested changes; please review again.

@bedevere-app

bedevere-app Bot commented Jun 1, 2026

Copy link
Copy Markdown

Thanks for making the requested changes!

@bitdancer: please review the changes made to this pull request.

@bedevere-app bedevere-app Bot requested a review from bitdancer June 1, 2026 22:24
@serhiy-storchaka

Copy link
Copy Markdown
Member Author

This is because in Python ks_c_5601-1987 is an alias of euc-kr, but in IANA they are separate codecs.

See #62825.

@bitdancer bitdancer left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@bitdancer

Copy link
Copy Markdown
Member

This is because in Python ks_c_5601-1987 is an alias of euc-kr, but in IANA they are separate codecs.

See #62825.

It seems to me the change in behavior introduced by this PR at least partially addresses 62825 in that it makes the output encoding for the case in question the encoding that handles more characters. Do you agree or am I misunderstanding?

@serhiy-storchaka

Copy link
Copy Markdown
Member Author

I do not think it makes test_korean_codecs better or worse. At least we cannot see this until other issue with Korean charsets are fixed. But test_set_text_charset_cp949 fails on main, so there is at least some progress. And test_chinese_codecs is now more correct.

@malemburg

Copy link
Copy Markdown
Member

I don't have comments on this change, except that it should likely make use of the change suggested in #62825

@serhiy-storchaka serhiy-storchaka added the needs backport to 3.15 pre-release feature fixes, bugs and security fixes label Jun 5, 2026
@serhiy-storchaka serhiy-storchaka merged commit c195a04 into python:main Jun 5, 2026
57 checks passed
@miss-islington-app

Copy link
Copy Markdown

Thanks @serhiy-storchaka for the PR 🌮🎉.. I'm working now to backport this PR to: 3.15.
🐍🍒⛏🤖

@serhiy-storchaka serhiy-storchaka deleted the email-charset-aliases branch June 5, 2026 12:54
@bedevere-app

bedevere-app Bot commented Jun 5, 2026

Copy link
Copy Markdown

GH-150967 is a backport of this pull request to the 3.15 branch.

@bedevere-app bedevere-app Bot removed the needs backport to 3.15 pre-release feature fixes, bugs and security fixes label Jun 5, 2026
serhiy-storchaka added a commit that referenced this pull request Jun 5, 2026
) (GH-150967)

Defer to the codecs module for all aliases.
Use MIME/IANA names for all IANA registered charsets.
Fix email.contentmanager.set_text_content().
(cherry picked from commit c195a04)

Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants