Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
50 changes: 44 additions & 6 deletions lib/Service/Note.php
Original file line number Diff line number Diff line change
Expand Up @@ -56,26 +56,64 @@ public function getContent() : string {
);
$content = mb_convert_encoding($content, 'UTF-8');
}
$content = str_replace([ pack('H*', 'FEFF'), pack('H*', 'FFEF'), pack('H*', 'EFBBBF') ], '', $content);

// Strip Byte Order Marks (BOM) for UTF-8, UTF-16 BE, and UTF-16 LE
$content = str_replace(["\xEF\xBB\xBF", "\xFE\xFF", "\xFF\xFE"], '', $content);

return $content;
}

public function getExcerpt(int $maxlen = 100) : string {
$excerpt = trim($this->noteUtil->stripMarkdown($this->getContent()));
$excerpt = $this->noteUtil->stripMarkdown($this->getExcerptContent($maxlen));

Comment on lines +67 to +68

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since this no longer goes through getContent(), it loses the non-UTF-8 handling done there. A UTF-16-encoded note that produced a readable excerpt before will now be read as raw bytes and decoded as UTF-8 below, yielding garbage?

$title = $this->getTitle();
if (!empty($title)) {
$length = mb_strlen($title, 'utf-8');
if (strncasecmp($excerpt, $title, $length) === 0) {
$excerpt = mb_substr($excerpt, $length, null, 'utf-8');
if ($title !== '') {
$titleLength = mb_strlen($title, 'utf-8');
if (strncasecmp($excerpt, $title, $titleLength) === 0) {
$excerpt = mb_substr($excerpt, $titleLength, null, 'utf-8');
}
}

$excerpt = trim($excerpt);

if (mb_strlen($excerpt, 'utf-8') > $maxlen) {
$excerpt = mb_substr($excerpt, 0, $maxlen, 'utf-8') . '…';
}
return str_replace("\n", "\u{2003}", $excerpt);
}

/**
* Lightweight best-effort content reader for excerpts only.
*/
private function getExcerptContent(int $maxlen) : string {
$handle = $this->file->fopen('r');
if (!is_resource($handle)) {
return '';
}

// Over-read bytes assuming worst-case UTF-8 size (up to 4 bytes per
// character). This is only a heuristic for preview generation; markdown
// stripping may reduce the visible character count further.
$bytesToRead = max(512, $maxlen * 4);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe * 6 is better. With the default maxlen=100 this reads 512 bytes. After stripMarkdown() and the leading-title strip (lines 71–76), a long first line / URL / long title can push the visible excerpt below maxlen, where the old full-read produced a complete one


try {
$content = fread($handle, $bytesToRead);
if ($content === false) {
return '';
}
} finally {
fclose($handle);
}

// Remove any partial trailing multibyte character from the truncated read.
$content = mb_strcut($content, 0, strlen($content), 'UTF-8');
Comment on lines +108 to +109

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dont understand. The comment says this removes a partial trailing multibyte char, but passing strlen($content) (the full byte length) as the cut length makes it effectively a no-op for that purpos?


// Strip Byte Order Marks (BOM) for UTF-8, UTF-16 BE, and UTF-16 LE
$content = str_replace(["\xEF\xBB\xBF", "\xFE\xFF", "\xFF\xFE"], '', $content);
Comment on lines +111 to +112

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The UTF-16 BOMs (\xFE\xFF, \xFF\xFE) are stripped here, but since the body isn't transcoded from UTF-16 (a i said at https://github.com/nextcloud/notes/pull/1886/changes#r3450190129) , the rest of a UTF-16 note is still mis-decoded?


return $content;
}

public function getModified() : int {
return $this->file->getMTime();
}
Expand Down
Loading