fix(make-pdf): correct CJK rendering (URL sentinel leak, JP-first fonts, CJK quotes)#2012
Draft
rssprivacy-commits wants to merge 1 commit into
Draft
fix(make-pdf): correct CJK rendering (URL sentinel leak, JP-first fonts, CJK quotes)#2012rssprivacy-commits wants to merge 1 commit into
rssprivacy-commits wants to merge 1 commit into
Conversation
|
Merging to
After your PR is submitted to the merge queue, this comment will be automatically updated with its status. If the PR fails, failure details will also be posted here |
…nts, CJK quotes Three defects surfaced rendering Simplified-Chinese documents; refined after an independent two-model code audit (which caught a regression in the first pass of the quote fix). 1. Bare URLs leaked internal `SMARTPANTS_PRESERVED_N` sentinels into output AND left <a>/<p> unclosed (everything after became one hyperlink). URL_RE's `\S+` (NUL is non-whitespace) swallowed the adjacent tag placeholders; single-pass restore could not un-nest them. Fix: stop the URL match at the NUL boundary; additionally strip any stray NUL from input at smartypants() entry so text cannot forge a placeholder or create NUL-adjacency nesting. 2. The CJK font stack listed Hiragino (Japanese) before any Chinese font, so Simplified-Chinese text rendered in Japanese glyph variants (直/骨/角/没). Fix: PingFang SC / Noto Sans CJK SC / Source Han Sans SC / Microsoft YaHei first; JP fonts demoted to last resort. (Trade-off: true Japanese documents now prefer SC glyphs for shared Han; acceptable for an SC-primary tool. A lang-attribute-based selector would be the fuller fix.) 3. A quote directly after a CJK colon or opening bracket (:(【「『〈《) is now treated as opening. Sentence/clause-ending punctuation (,。、;!?) is deliberately excluded — a quote after those is usually a CLOSING quote (Chinese puts the period inside: 。"), and including them flipped closing quotes to opening. Verified: pdffonts PingFang-only; pdftotext no sentinel leak, correct opening AND closing quotes (他说:"你好。" closes correctly); visual render no anchor bleed. make-pdf/test: 91 pass / 0 fail. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
7dc5331 to
40a9fdd
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Three real defects surface when rendering Simplified-Chinese documents with
make-pdf. All reproduced on v1.57.10.0, fixed at source, and verified.1. Bare URLs corrupt the document (worst)
Any bare URL leaks internal
SMARTPANTS_PRESERVED_Nsentinels as visible text and leaves the<a>/<p>unclosed — so every heading/paragraph after the URL becomes one giant hyperlink.Root cause:
smartypants.tscarves tags into NUL-delimitedSMARTPANTS_PRESERVED_Nplaceholders, thenURL_RE = /\bhttps?:\/\/\S+/gruns.\S+(a NUL is non-whitespace) greedily swallows the adjacent</a>/</p>placeholders; the single-pass restore then cannot un-nest them.Fix: stop the URL match at the NUL boundary —
[^\s]+.2. Simplified-Chinese renders in Japanese glyphs
print-css.tsCJK_STACKlistsHiragino Kaku Gothic ProN(Japanese) before any Chinese font, so Chinese text falls back to Japanese glyph variants (直/骨/角/没/别) —pdffontsshows a mix ofHiraKakuProN+PingFangSCin one document.Fix: put
PingFang SC, Heiti SC, Noto Sans CJK SC, Source Han Sans SC, Microsoft YaHeifirst; JP fonts demoted to last resort.3. Opening quotes after CJK punctuation render as closing quotes
A double/single quote directly after CJK punctuation (:,。(「) rendered as a closing quote. The opening-quote heuristic only treats a quote as opening after whitespace/brackets.
Fix: add CJK punctuation/openers
:,。、;!?(【「『〈《to the opening-quote context. Ideographs are intentionally excluded, so a closing quote after a Han character stays closing.Verification
pdffonts: PingFang-only, zero Hiraginopdftotext: noSMARTPANTS_PRESERVEDleak; correct curly quotesbun test make-pdf/test/: 91 pass / 0 fail (unchanged before/after)Total change: 5 insertions / 5 deletions across 2 files. Opened as draft for maintainer review.
🤖 Generated with Claude Code