| slug | html | ||||
|---|---|---|---|---|---|
| title | HTML | ||||
| install | wp-php-toolkit/html | ||||
| credit_title | Ported from WordPress core | ||||
| credit_body | The HTML component is a port of WordPress core's <code>WP_HTML_Tag_Processor</code> and <code>WP_HTML_Processor</code>. Source: <a href="https://github.com/WordPress/wordpress-develop/tree/trunk/src/wp-includes/html-api">WordPress/wordpress-develop</a>. Bug fixes flow in both directions. | ||||
| see_also |
|
A pure-PHP HTML parser and tag rewriter mirroring WordPress core's HTML API. Handle browser-style HTML fragments for supported markup — without libxml2, DOMDocument, or regex hacks — and rewrite attributes in a single linear pass.
WordPress runs HTML fragments through filters every time a request renders: post content, block markup, comments, excerpts, widgets, feeds, imported documents. Those fragments can omit <html> and <body>, close tags implicitly, or mix browser-correct markup with author mistakes that DOMDocument and regular expressions do not model well.
The HTML component gives WordPress-style code the same parsing model WordPress core uses: a browser-compatible tokenizer and tree-aware processor that run in pure PHP. Choose it for exact-byte rewrites, imperfect fragments, and post-content filters where a full DOM would do too much work.
The component gives you two processors. WP_HTML_Tag_Processor is a forward-only cursor over tags and tokens — useful for attribute rewriting at scale. WP_HTML_Processor layers HTML5 tree construction on top so you can query by ancestry (breadcrumbs), serialize the parsed document, and trust that <p>one<p>two parses as two paragraphs the way a browser sees it.
Scope: WP_HTML_Processor intentionally supports WordPress core's current subset of HTML5. It aborts on markup it cannot safely model, including table-internal content, foreign content such as SVG/MathML, and content outside the supported body parsing modes. Use get_unsupported_exception() when you need to explain why processing stopped.
Footgun: Mutations are buffered. Nothing changes in the source string until you call get_updated_html(). If you read get_attribute() after a set_attribute() on the same tag, you see the new value — but downstream tooling reading the original string sees stale HTML until you serialize.
The "hello world" of tag rewriting. One linear pass, no DOM, no reserialization cost beyond the bytes you actually changed.
Try this: click Run, then change 'lazy' to 'eager' on the first image only by guarding it with $tags->get_attribute( 'src' ) === 'hero.jpg'. Run again and notice that get_updated_html() only rewrites the bytes for that one tag.
<?php
require '/php-toolkit/vendor/autoload.php';
$html = <<<'HTML'
<article>
<img src="hero.jpg" alt="Hero">
<p>Intro copy.</p>
<img src="inline.jpg" alt="Inline">
</article>
HTML;
$tags = new WP_HTML_Tag_Processor( $html );
while ( $tags->next_tag( 'img' ) ) {
// Don't clobber an explicit eager hint the author already set.
if ( null === $tags->get_attribute( 'loading' ) ) {
$tags->set_attribute( 'loading', 'lazy' );
}
$tags->set_attribute( 'decoding', 'async' );
}
echo $tags->get_updated_html();<article>
<img decoding="async" loading="lazy" src="hero.jpg" alt="Hero">
<p>Intro copy.</p>
<img decoding="async" loading="lazy" src="inline.jpg" alt="Inline">
</article>
Use this before sending post content to an RSS feed, an email template, or a CDN-backed copy of a site. The processor rewrites only the changed bytes, so untouched markup stays byte-identical.
<?php
require '/php-toolkit/vendor/autoload.php';
$html = <<<'HTML'
<p>See <a href="/about">about</a>, <a href="https://example.com/x">x</a>,
and <a href="contact.html">contact</a>.</p>
HTML;
$base = 'https://my-site.test/';
$tags = new WP_HTML_Tag_Processor( $html );
while ( $tags->next_tag( 'a' ) ) {
$href = $tags->get_attribute( 'href' );
if ( null === $href || '' === $href ) {
continue;
}
if ( preg_match( '#^[a-z][a-z0-9+.-]*:#i', $href ) || 0 === strpos( $href, '//' ) || 0 === strpos( $href, '#' ) ) {
continue;
}
$tags->set_attribute( 'href', rtrim( $base, '/' ) . '/' . ltrim( $href, '/' ) );
}
echo $tags->get_updated_html();<p>See <a href="https://my-site.test/about">about</a>, <a href="https://example.com/x">x</a>,
and <a href="https://my-site.test/contact.html">contact</a>.</p>
A common sanitization step: neutralize untrusted HTML before display. Blank a script's body with set_modifiable_text() and strip every on* attribute via get_attribute_names_with_prefix().
<?php
require '/php-toolkit/vendor/autoload.php';
$untrusted = <<<'HTML'
<p onclick="x()">hi</p>
<script>evil()</script>
<img src="x" onerror="boom()">
HTML;
$tags = new WP_HTML_Tag_Processor( $untrusted );
while ( $tags->next_tag() ) {
// next_tag() never lands on closing tags, so no is_tag_closer() guard
// is needed here.
if ( 'SCRIPT' === $tags->get_tag() ) {
$tags->set_modifiable_text( '' );
}
foreach ( $tags->get_attribute_names_with_prefix( 'on' ) as $attr ) {
$tags->remove_attribute( $attr );
}
}
echo $tags->get_updated_html();<p >hi</p>
<script></script>
<img src="x" >
Content Security Policy in nonce- mode requires every inline <script> and <style> to carry a matching nonce attribute. Tag-by-tag is exactly the right granularity.
<?php
require '/php-toolkit/vendor/autoload.php';
$nonce = bin2hex( random_bytes( 8 ) );
$html = <<<'HTML'
<head><style>body{font:16px sans-serif}</style></head>
<body><script>console.log("hi")</script><script src="vendor.js"></script></body>
HTML;
$tags = new WP_HTML_Tag_Processor( $html );
while ( $tags->next_tag() ) {
$tag = $tags->get_tag();
if ( 'SCRIPT' === $tag || 'STYLE' === $tag ) {
$tags->set_attribute( 'nonce', $nonce );
}
}
echo "nonce: {$nonce}\n\n";
echo $tags->get_updated_html();nonce: <random>
<head><style nonce="<random>">body{font:16px sans-serif}</style></head>
<body><script nonce="<random>">console.log("hi")</script><script nonce="<random>" src="vendor.js"></script></body>
Generate responsive image markup at render time without touching the editor data model. Read the existing src, derive a srcset with width descriptors, add a sizes hint.
<?php
require '/php-toolkit/vendor/autoload.php';
$html = '<figure><img src="https://cdn.test/uploads/photo.jpg" alt="Sunset"></figure>';
$widths = array( 480, 768, 1200 );
$tags = new WP_HTML_Tag_Processor( $html );
while ( $tags->next_tag( 'img' ) ) {
$src = $tags->get_attribute( 'src' );
if ( null === $src || $tags->get_attribute( 'srcset' ) !== null ) {
continue;
}
$variants = array();
foreach ( $widths as $w ) {
$variants[] = $src . '?w=' . $w . ' ' . $w . 'w';
}
$tags->set_attribute( 'srcset', implode( ', ', $variants ) );
$tags->set_attribute( 'sizes', '(max-width: 768px) 100vw, 768px' );
}
echo $tags->get_updated_html();<figure><img sizes="(max-width: 768px) 100vw, 768px" srcset="https://cdn.test/uploads/photo.jpg?w=480 480w, https://cdn.test/uploads/photo.jpg?w=768 768w, https://cdn.test/uploads/photo.jpg?w=1200 1200w" src="https://cdn.test/uploads/photo.jpg" alt="Sunset"></figure>
The HTML5 entity table has roughly 2,200 named references and a long list of edge cases. WP_HTML_Decoder implements the algorithm — don't roll your own.
<?php
require '/php-toolkit/vendor/autoload.php';
echo "attribute: " . WP_HTML_Decoder::decode_attribute( 'path?a=1&b=2&copy' ) . "\n";
echo "text: " . WP_HTML_Decoder::decode_text_node( 'AT&T — 100% 😀' ) . "\n";
// Safe URL prefix check that decodes character references while comparing.
// `j` is the letter `j`, so this string really does start with javascript:.
// strpos() would miss it.
$is_javascript = WP_HTML_Decoder::attribute_starts_with(
'javascript:alert(1)',
'javascript:',
'ascii-case-insensitive'
);
var_dump( $is_javascript );attribute: path?a=1&b=2©
text: AT&T — 100% 😀
bool(true)
The full WP_HTML_Processor understands HTML5 tree construction, so you can ask "find every <img> directly inside a <figure>" without writing your own DOM walker.
<?php
require '/php-toolkit/vendor/autoload.php';
$html = <<<'HTML'
<article>
<figure><img src="hero.jpg" alt="Hero"><figcaption>Hero shot</figcaption></figure>
<p>Body copy <img src="emoji.png" alt=""> mid-paragraph.</p>
<figure><img src="diagram.png" alt="Diagram"></figure>
</article>
HTML;
$p = WP_HTML_Processor::create_fragment( $html );
$figure_images = 0;
while ( $p->next_tag( array( 'breadcrumbs' => array( 'FIGURE', 'IMG' ) ) ) ) {
$p->add_class( 'figure-image' );
$figure_images++;
}
echo "found {$figure_images} figure images\n";
echo $p->get_updated_html();found 2 figure images
<article>
<figure><img class="figure-image" src="hero.jpg" alt="Hero"><figcaption>Hero shot</figcaption></figure>
<p>Body copy <img src="emoji.png" alt=""> mid-paragraph.</p>
<figure><img class="figure-image" src="diagram.png" alt="Diagram"></figure>
</article>
The full processor exposes get_current_depth() and get_breadcrumbs(). Combine with next_token() to print a structural outline.
<?php
require '/php-toolkit/vendor/autoload.php';
$html = <<<'HTML'
<section><h1>Title</h1>
<section><h2>Chapter 1</h2><p>Body</p></section>
<section><h2>Chapter 2</h2><p>More body</p></section>
</section>
HTML;
$p = WP_HTML_Processor::create_fragment( $html );
while ( $p->next_token() ) {
if ( '#tag' !== $p->get_token_type() || $p->is_tag_closer() ) {
continue;
}
$tag = $p->get_tag();
if ( ! preg_match( '/^H[1-6]$/', $tag ) ) {
continue;
}
$indent = str_repeat( ' ', max( 0, $p->get_current_depth() - 2 ) );
$text = '';
while ( $p->next_token() ) {
if ( '#text' === $p->get_token_type() ) {
$text .= $p->get_modifiable_text();
continue;
}
if ( '#tag' === $p->get_token_type() && $tag === $p->get_tag() && $p->is_tag_closer() ) {
break;
}
}
echo "{$indent}{$tag} {$text}\n";
} H1 Title
H2 Chapter 1
H2 Chapter 2
Bookmarks are the one escape from forward-only scanning. Save a position, scan ahead, decide what to do, then seek() back and rewrite the earlier tag.
<?php
require '/php-toolkit/vendor/autoload.php';
$html = <<<'HTML'
<ul>
<li><input type="checkbox" checked> Buy milk</li>
<li><input type="checkbox"> Walk the dog</li>
<li><input type="checkbox" checked> Read book</li>
</ul>
HTML;
$tags = new WP_HTML_Tag_Processor( $html );
$tags->next_tag( 'ul' );
$tags->set_bookmark( 'list' );
$total = 0;
$done = 0;
while ( $tags->next_tag( 'input' ) ) {
$total++;
if ( null !== $tags->get_attribute( 'checked' ) ) {
$done++;
}
}
$tags->seek( 'list' );
$tags->set_attribute( 'data-progress', $done . '/' . $total );
$tags->release_bookmark( 'list' );
echo $tags->get_updated_html();<ul data-progress="2/3">
<li><input type="checkbox" checked> Buy milk</li>
<li><input type="checkbox"> Walk the dog</li>
<li><input type="checkbox" checked> Read book</li>
</ul>
| Use | For |
|---|---|
WP_HTML_Tag_Processor | Attribute rewriting, sanitization, finding tags by name. Forward-only walks. Anything where speed and byte-honesty matter more than context. |
WP_HTML_Processor::create_fragment() | Queries by ancestry (breadcrumbs), heading outline extraction, anything that needs to know "is this tag inside that one." |
WP_HTML_Decoder::decode_text_node() | Turning entity-encoded text (AT&T) back into raw text correctly. Implements the HTML5 entity algorithm — don't roll your own. |
WP_HTML_Decoder::attribute_starts_with() | Safe URL-prefix checks that decode HTML character references while comparing — so javascript: (where a is the letter a) is correctly recognized as starting with javascript:. The classic strpos approach misses these. |
Footgun: next_tag() only stops on opening tags. Closers and text are skipped, so a guard like ! $tags->is_tag_closer() inside a next_tag() loop is harmless but never fires. If you need to visit closing tags or text nodes, use next_token() instead and check get_token_type().
Footgun: Tag-name matches are uppercase. get_tag() always returns the tag name in uppercase ('IMG', not 'img'). Compare accordingly. The filter argument to next_tag() is case-insensitive in either direction.
Footgun: Don't confuse WP_HTML_Tag_Processor with the full processor. The cursor is forward-only and ancestry-blind, and it doesn't expose get_breadcrumbs() at all — calling that on a WP_HTML_Tag_Processor raises a Call to undefined method error. Breadcrumbs and HTML5 tree construction (implicit <tbody> insertion, automatic <p> closing, and the rest) live only on WP_HTML_Processor.