php-toolkit/components/Encoding/README.md at trunk · WordPress/php-toolkit

slug

encoding

title

Encoding

install

wp-php-toolkit/encoding

see_also

html | HTML | Normalize incoming text before HTML tokenization.

xml | XML | Keep invalid bytes out of XML streams.

dataliberation | DataLiberation | Clean content before importing it into WordPress.

UTF-8 validation and scrubbing with a pure-PHP fallback when mbstring is unavailable. Detects malformed bytes and replaces them per the Unicode maximal-subpart algorithm.

Why this exists

Every parser in this toolkit eventually has to decide what to do with text bytes. XML rejects malformed UTF-8. JSON and databases can fail late. CSS, HTML, WXR, and Blueprint validation all need consistent answers about whether a string is well-formed Unicode.

The Encoding component provides the small UTF-8 primitives the rest of the toolkit can share: validate bytes, scrub invalid sequences, scan code points, and detect Unicode noncharacters. When mbstring is available it can delegate to it; when it is not, the component uses its own byte scanner so behavior stays available in restricted PHP environments.

Historically, this became the common foundation for Blueprint validation and CSS/XML processing, replacing ad hoc Unicode helpers with the WordPress core UTF-8 routines used here.

Validating UTF-8 before storing it

wp_is_valid_utf8() rejects overlong sequences, surrogate halves, and stray ISO-8859-1 bytes. Use it as a guard in front of any code path that assumes UTF-8 (database, JSON, XML).

<?php
require '/wordpress/wp-content/php-toolkit/vendor/autoload.php';

use function WordPress\Encoding\wp_is_valid_utf8;

$samples = array(
	'ASCII'          => 'just a test',
	'UTF-8 pencil'   => "\xE2\x9C\x8F",
	'latin-1 byte'   => "B\xFCch",
	'overlong slash' => "\xC1\xBF",
	'surrogate half' => "\xED\xB0\x80",
);

foreach ( $samples as $label => $bytes ) {
	echo sprintf( "%-14s %s\n", $label . ':', wp_is_valid_utf8( $bytes ) ? 'valid' : 'invalid' );
}

ASCII:         valid
UTF-8 pencil:  valid
latin-1 byte:  invalid
overlong slash: invalid
surrogate half: invalid

Scrubbing invalid bytes with U+FFFD

Replace each ill-formed sequence with the Unicode replacement character. Useful right before serializing to XML, JSON, or sending to an LLM that will choke on broken bytes.

<?php
require '/wordpress/wp-content/php-toolkit/vendor/autoload.php';

use function WordPress\Encoding\wp_scrub_utf8;

$broken = "the byte \xC0 should not be here.";
echo wp_scrub_utf8( $broken ) . "\n";

echo wp_scrub_utf8( ".\xE2\x8C\xE2\x8C." ) . "\n";

the byte � should not be here.
.��.

Detecting noncharacters MySQL/utf8mb4 will reject

Code points like U+FFFE, U+FFFF, and the U+FDD0–U+FDEF block are valid Unicode but forbidden in XML and rejected by some databases. Check before inserting user-submitted content into a strict utf8mb4 column.

<?php
require '/wordpress/wp-content/php-toolkit/vendor/autoload.php';

use function WordPress\Encoding\wp_has_noncharacters;

$samples = array(
	'normal text' => 'normal text',
	'U+FFFE'      => "oops \u{FFFE}",
	'U+FDD0'      => "hi \u{FDD0} bye",
);

foreach ( $samples as $label => $text ) {
	echo sprintf( "%-12s %s\n", $label . ':', wp_has_noncharacters( $text ) ? 'reject' : 'ok' );
}

normal text: ok
U+FFFE:      reject
U+FDD0:      reject

Three-way pipeline: validate, scrub, then check noncharacters

Real-world inputs are messy: an old WXR export, a CSV with mixed encodings, a paste from Word. Combination of validate + scrub + noncharacter-check covers the three classes of breakage that bite later.

<?php
require '/wordpress/wp-content/php-toolkit/vendor/autoload.php';

use function WordPress\Encoding\wp_is_valid_utf8;
use function WordPress\Encoding\wp_scrub_utf8;
use function WordPress\Encoding\wp_has_noncharacters;

$inputs = array(
	'good'      => 'Café',
	'latin1'    => "caf\xE9",
	'overlong'  => "x\xC1\xBFy",
	'noncharac' => "hi \u{FFFE} there",
);

foreach ( $inputs as $label => $bytes ) {
	$valid    = wp_is_valid_utf8( $bytes );
	$cleaned  = wp_scrub_utf8( $bytes );
	$weird    = wp_has_noncharacters( $cleaned );
	echo sprintf( "%-10s valid=%s noncharacter=%s -> %s\n", $label, $valid ? 'Y' : 'N', $weird ? 'Y' : 'N', $cleaned );
}

good       valid=Y noncharacter=N -> Café
latin1     valid=N noncharacter=N -> caf�
overlong   valid=N noncharacter=N -> x��y
noncharac  valid=Y noncharacter=Y -> hi � there

Salvaging a legacy ISO-8859-1 column inside a UTF-8 corpus

Old WordPress databases sometimes mix encodings: most rows are UTF-8 but a few were stored as latin-1. Detect the bad rows with wp_is_valid_utf8() and only re-encode those.

<?php
require '/wordpress/wp-content/php-toolkit/vendor/autoload.php';

use function WordPress\Encoding\wp_is_valid_utf8;
use function WordPress\Encoding\wp_scrub_utf8;

$rows = array(
	1 => 'Plain ASCII',
	2 => 'Café',
	3 => "caf\xE9",
	4 => "weird \xC0 byte",
);

foreach ( $rows as $id => $value ) {
	if ( wp_is_valid_utf8( $value ) ) {
		echo "#$id ok: $value\n";
		continue;
	}
	$converted = @iconv( 'ISO-8859-1', 'UTF-8', $value );
	if ( false !== $converted && wp_is_valid_utf8( $converted ) ) {
		echo "#$id recovered as latin1: $converted\n";
	} else {
		echo "#$id unrecoverable, scrubbing: " . wp_scrub_utf8( $value ) . "\n";
	}
}

#1 ok: Plain ASCII
#2 ok: Café
#3 recovered as latin1: café
#4 recovered as latin1: weird À byte

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why this exists

Validating UTF-8 before storing it

Scrubbing invalid bytes with U+FFFD

Detecting noncharacters MySQL/utf8mb4 will reject

Three-way pipeline: validate, scrub, then check noncharacters

Salvaging a legacy ISO-8859-1 column inside a UTF-8 corpus

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Why this exists

Validating UTF-8 before storing it

Scrubbing invalid bytes with U+FFFD

Detecting noncharacters MySQL/utf8mb4 will reject

Three-way pipeline: validate, scrub, then check noncharacters

Salvaging a legacy ISO-8859-1 column inside a UTF-8 corpus