| slug | encoding | |||
|---|---|---|---|---|
| title | Encoding | |||
| install | wp-php-toolkit/encoding | |||
| see_also |
|
UTF-8 validation and scrubbing with a pure-PHP fallback when mbstring is unavailable. Detects malformed bytes and replaces them per the Unicode maximal-subpart algorithm.
Every parser in this toolkit eventually has to decide what to do with text bytes. XML rejects malformed UTF-8. JSON and databases can fail late. CSS, HTML, WXR, and Blueprint validation all need consistent answers about whether a string is well-formed Unicode.
The Encoding component provides the small UTF-8 primitives the rest of the toolkit can share: validate bytes, scrub invalid sequences, scan code points, and detect Unicode noncharacters. When mbstring is available it can delegate to it; when it is not, the component uses its own byte scanner so behavior stays available in restricted PHP environments.
Historically, this became the common foundation for Blueprint validation and CSS/XML processing, replacing ad hoc Unicode helpers with the WordPress core UTF-8 routines used here.
wp_is_valid_utf8() rejects overlong sequences, surrogate halves, and stray ISO-8859-1 bytes. Use it as a guard in front of any code path that assumes UTF-8 (database, JSON, XML).
<?php
require '/wordpress/wp-content/php-toolkit/vendor/autoload.php';
use function WordPress\Encoding\wp_is_valid_utf8;
$samples = array(
'ASCII' => 'just a test',
'UTF-8 pencil' => "\xE2\x9C\x8F",
'latin-1 byte' => "B\xFCch",
'overlong slash' => "\xC1\xBF",
'surrogate half' => "\xED\xB0\x80",
);
foreach ( $samples as $label => $bytes ) {
echo sprintf( "%-14s %s\n", $label . ':', wp_is_valid_utf8( $bytes ) ? 'valid' : 'invalid' );
}ASCII: valid
UTF-8 pencil: valid
latin-1 byte: invalid
overlong slash: invalid
surrogate half: invalid
Replace each ill-formed sequence with the Unicode replacement character. Useful right before serializing to XML, JSON, or sending to an LLM that will choke on broken bytes.
<?php
require '/wordpress/wp-content/php-toolkit/vendor/autoload.php';
use function WordPress\Encoding\wp_scrub_utf8;
$broken = "the byte \xC0 should not be here.";
echo wp_scrub_utf8( $broken ) . "\n";
echo wp_scrub_utf8( ".\xE2\x8C\xE2\x8C." ) . "\n";the byte � should not be here.
.��.
Code points like U+FFFE, U+FFFF, and the U+FDD0–U+FDEF block are valid Unicode but forbidden in XML and rejected by some databases. Check before inserting user-submitted content into a strict utf8mb4 column.
<?php
require '/wordpress/wp-content/php-toolkit/vendor/autoload.php';
use function WordPress\Encoding\wp_has_noncharacters;
$samples = array(
'normal text' => 'normal text',
'U+FFFE' => "oops \u{FFFE}",
'U+FDD0' => "hi \u{FDD0} bye",
);
foreach ( $samples as $label => $text ) {
echo sprintf( "%-12s %s\n", $label . ':', wp_has_noncharacters( $text ) ? 'reject' : 'ok' );
}normal text: ok
U+FFFE: reject
U+FDD0: reject
Real-world inputs are messy: an old WXR export, a CSV with mixed encodings, a paste from Word. Combination of validate + scrub + noncharacter-check covers the three classes of breakage that bite later.
<?php
require '/wordpress/wp-content/php-toolkit/vendor/autoload.php';
use function WordPress\Encoding\wp_is_valid_utf8;
use function WordPress\Encoding\wp_scrub_utf8;
use function WordPress\Encoding\wp_has_noncharacters;
$inputs = array(
'good' => 'Café',
'latin1' => "caf\xE9",
'overlong' => "x\xC1\xBFy",
'noncharac' => "hi \u{FFFE} there",
);
foreach ( $inputs as $label => $bytes ) {
$valid = wp_is_valid_utf8( $bytes );
$cleaned = wp_scrub_utf8( $bytes );
$weird = wp_has_noncharacters( $cleaned );
echo sprintf( "%-10s valid=%s noncharacter=%s -> %s\n", $label, $valid ? 'Y' : 'N', $weird ? 'Y' : 'N', $cleaned );
}good valid=Y noncharacter=N -> Café
latin1 valid=N noncharacter=N -> caf�
overlong valid=N noncharacter=N -> x��y
noncharac valid=Y noncharacter=Y -> hi � there
Old WordPress databases sometimes mix encodings: most rows are UTF-8 but a few were stored as latin-1. Detect the bad rows with wp_is_valid_utf8() and only re-encode those.
<?php
require '/wordpress/wp-content/php-toolkit/vendor/autoload.php';
use function WordPress\Encoding\wp_is_valid_utf8;
use function WordPress\Encoding\wp_scrub_utf8;
$rows = array(
1 => 'Plain ASCII',
2 => 'Café',
3 => "caf\xE9",
4 => "weird \xC0 byte",
);
foreach ( $rows as $id => $value ) {
if ( wp_is_valid_utf8( $value ) ) {
echo "#$id ok: $value\n";
continue;
}
$converted = @iconv( 'ISO-8859-1', 'UTF-8', $value );
if ( false !== $converted && wp_is_valid_utf8( $converted ) ) {
echo "#$id recovered as latin1: $converted\n";
} else {
echo "#$id unrecoverable, scrubbing: " . wp_scrub_utf8( $value ) . "\n";
}
}#1 ok: Plain ASCII
#2 ok: Café
#3 recovered as latin1: café
#4 recovered as latin1: weird À byte