php-toolkit/components/HTML/README.md at trunk · WordPress/php-toolkit

slug

html

title

HTML

install

wp-php-toolkit/html

credit_title

Ported from WordPress core

credit_body

The HTML component is a port of WordPress core's <code>WP_HTML_Tag_Processor</code> and <code>WP_HTML_Processor</code>. Source: <a href="https://github.com/WordPress/wordpress-develop/tree/trunk/src/wp-includes/html-api">WordPress/wordpress-develop</a>. Bug fixes flow in both directions.

see_also

../learn/01-rewriting-html.html | Tutorial — Rewriting HTML safely | The chapter that introduces the cursor model and the <code>clean_post_html()</code> function reused later in the importer.

blockparser | BlockParser | Parse block comments first, then rewrite the HTML inside each block.

markdown | Markdown | Convert Markdown to blocks before polishing generated HTML.

dataliberation | DataLiberation | Rewrite URLs and media references during import/export pipelines.

A pure-PHP HTML parser and tag rewriter mirroring WordPress core's HTML API. Handle browser-style HTML fragments for supported markup — without libxml2, DOMDocument, or regex hacks — and rewrite attributes in a single linear pass.

Why this exists

WordPress runs HTML fragments through filters every time a request renders: post content, block markup, comments, excerpts, widgets, feeds, imported documents. Those fragments can omit <html> and <body>, close tags implicitly, or mix browser-correct markup with author mistakes that DOMDocument and regular expressions do not model well.

The HTML component gives WordPress-style code the same parsing model WordPress core uses: a browser-compatible tokenizer and tree-aware processor that run in pure PHP. Choose it for exact-byte rewrites, imperfect fragments, and post-content filters where a full DOM would do too much work.

The component gives you two processors. WP_HTML_Tag_Processor is a forward-only cursor over tags and tokens — useful for attribute rewriting at scale. WP_HTML_Processor layers HTML5 tree construction on top so you can query by ancestry (breadcrumbs), serialize the parsed document, and trust that <p>one<p>two parses as two paragraphs the way a browser sees it.

Scope: WP_HTML_Processor intentionally supports WordPress core's current subset of HTML5. It aborts on markup it cannot safely model, including table-internal content, foreign content such as SVG/MathML, and content outside the supported body parsing modes. Use get_unsupported_exception() when you need to explain why processing stopped.

Footgun: Mutations are buffered. Nothing changes in the source string until you call get_updated_html(). If you read get_attribute() after a set_attribute() on the same tag, you see the new value — but downstream tooling reading the original string sees stale HTML until you serialize.

Add loading="lazy" to every image

The "hello world" of tag rewriting. One linear pass, no DOM, no reserialization cost beyond the bytes you actually changed.

Try this: click Run, then change 'lazy' to 'eager' on the first image only by guarding it with $tags->get_attribute( 'src' ) === 'hero.jpg'. Run again and notice that get_updated_html() only rewrites the bytes for that one tag.

<?php
require '/php-toolkit/vendor/autoload.php';

$html = <<<'HTML'
<article>
	<img src="hero.jpg" alt="Hero">
	<p>Intro copy.</p>
	<img src="inline.jpg" alt="Inline">
</article>
HTML;

$tags = new WP_HTML_Tag_Processor( $html );
while ( $tags->next_tag( 'img' ) ) {
	// Don't clobber an explicit eager hint the author already set.
	if ( null === $tags->get_attribute( 'loading' ) ) {
		$tags->set_attribute( 'loading', 'lazy' );
	}
	$tags->set_attribute( 'decoding', 'async' );
}

echo $tags->get_updated_html();

<article>
	<img decoding="async" loading="lazy" src="hero.jpg" alt="Hero">
	<p>Intro copy.</p>
	<img decoding="async" loading="lazy" src="inline.jpg" alt="Inline">
</article>

Rewrite relative links to absolute URLs

Use this before sending post content to an RSS feed, an email template, or a CDN-backed copy of a site. The processor rewrites only the changed bytes, so untouched markup stays byte-identical.

<?php
require '/php-toolkit/vendor/autoload.php';

$html = <<<'HTML'
<p>See <a href="/about">about</a>, <a href="https://example.com/x">x</a>, 
and <a href="contact.html">contact</a>.</p>
HTML;

$base = 'https://my-site.test/';

$tags = new WP_HTML_Tag_Processor( $html );
while ( $tags->next_tag( 'a' ) ) {
	$href = $tags->get_attribute( 'href' );
	if ( null === $href || '' === $href ) {
		continue;
	}
	if ( preg_match( '#^[a-z][a-z0-9+.-]*:#i', $href ) || 0 === strpos( $href, '//' ) || 0 === strpos( $href, '#' ) ) {
		continue;
	}
	$tags->set_attribute( 'href', rtrim( $base, '/' ) . '/' . ltrim( $href, '/' ) );
}

echo $tags->get_updated_html();

<p>See <a href="https://my-site.test/about">about</a>, <a href="https://example.com/x">x</a>, 
and <a href="https://my-site.test/contact.html">contact</a>.</p>

Strip every script and inline event handler

A common sanitization step: neutralize untrusted HTML before display. Blank a script's body with set_modifiable_text() and strip every on* attribute via get_attribute_names_with_prefix().

<?php
require '/php-toolkit/vendor/autoload.php';

$untrusted = <<<'HTML'
<p onclick="x()">hi</p>
<script>evil()</script>
<img src="x" onerror="boom()">
HTML;

$tags = new WP_HTML_Tag_Processor( $untrusted );
while ( $tags->next_tag() ) {
	// next_tag() never lands on closing tags, so no is_tag_closer() guard
	// is needed here.
	if ( 'SCRIPT' === $tags->get_tag() ) {
		$tags->set_modifiable_text( '' );
	}
	foreach ( $tags->get_attribute_names_with_prefix( 'on' ) as $attr ) {
		$tags->remove_attribute( $attr );
	}
}

echo $tags->get_updated_html();

<p >hi</p>
<script></script>
<img src="x" >

Stamp a CSP nonce on inline scripts and styles

Content Security Policy in nonce- mode requires every inline <script> and <style> to carry a matching nonce attribute. Tag-by-tag is exactly the right granularity.

<?php
require '/php-toolkit/vendor/autoload.php';

$nonce = bin2hex( random_bytes( 8 ) );

$html = <<<'HTML'
<head><style>body{font:16px sans-serif}</style></head>
<body><script>console.log("hi")</script><script src="vendor.js"></script></body>
HTML;

$tags = new WP_HTML_Tag_Processor( $html );
while ( $tags->next_tag() ) {
	$tag = $tags->get_tag();
	if ( 'SCRIPT' === $tag || 'STYLE' === $tag ) {
		$tags->set_attribute( 'nonce', $nonce );
	}
}

echo "nonce: {$nonce}\n\n";
echo $tags->get_updated_html();

nonce: <random>

<head><style nonce="<random>">body{font:16px sans-serif}</style></head>
<body><script nonce="<random>">console.log("hi")</script><script nonce="<random>" src="vendor.js"></script></body>

Build a srcset from a single src

Generate responsive image markup at render time without touching the editor data model. Read the existing src, derive a srcset with width descriptors, add a sizes hint.

<?php
require '/php-toolkit/vendor/autoload.php';

$html = '<figure><img src="https://cdn.test/uploads/photo.jpg" alt="Sunset"></figure>';
$widths = array( 480, 768, 1200 );

$tags = new WP_HTML_Tag_Processor( $html );
while ( $tags->next_tag( 'img' ) ) {
	$src = $tags->get_attribute( 'src' );
	if ( null === $src || $tags->get_attribute( 'srcset' ) !== null ) {
		continue;
	}
	$variants = array();
	foreach ( $widths as $w ) {
		$variants[] = $src . '?w=' . $w . ' ' . $w . 'w';
	}
	$tags->set_attribute( 'srcset', implode( ', ', $variants ) );
	$tags->set_attribute( 'sizes', '(max-width: 768px) 100vw, 768px' );
}

echo $tags->get_updated_html();

<figure><img sizes="(max-width: 768px) 100vw, 768px" srcset="https://cdn.test/uploads/photo.jpg?w=480 480w, https://cdn.test/uploads/photo.jpg?w=768 768w, https://cdn.test/uploads/photo.jpg?w=1200 1200w" src="https://cdn.test/uploads/photo.jpg" alt="Sunset"></figure>

Decode HTML entities the way the spec demands

The HTML5 entity table has roughly 2,200 named references and a long list of edge cases. WP_HTML_Decoder implements the algorithm — don't roll your own.

<?php
require '/php-toolkit/vendor/autoload.php';

echo "attribute: " . WP_HTML_Decoder::decode_attribute( 'path?a=1&amp;b=2&amp;copy' ) . "\n";
echo "text:      " . WP_HTML_Decoder::decode_text_node( 'AT&amp;T &mdash; 100&percnt; &#x1F600;' ) . "\n";

// Safe URL prefix check that decodes character references while comparing.
// `&#x6A;` is the letter `j`, so this string really does start with javascript:.
// strpos() would miss it.
$is_javascript = WP_HTML_Decoder::attribute_starts_with(
	'&#x6A;avascript:alert(1)',
	'javascript:',
	'ascii-case-insensitive'
);
var_dump( $is_javascript );

attribute: path?a=1&b=2&copy
text:      AT&T — 100% 😀
bool(true)

Find images by ancestry with breadcrumbs

The full WP_HTML_Processor understands HTML5 tree construction, so you can ask "find every <img> directly inside a <figure>" without writing your own DOM walker.

<?php
require '/php-toolkit/vendor/autoload.php';

$html = <<<'HTML'
<article>
<figure><img src="hero.jpg" alt="Hero"><figcaption>Hero shot</figcaption></figure>
<p>Body copy <img src="emoji.png" alt=""> mid-paragraph.</p>
<figure><img src="diagram.png" alt="Diagram"></figure>
</article>
HTML;

$p = WP_HTML_Processor::create_fragment( $html );
$figure_images = 0;
while ( $p->next_tag( array( 'breadcrumbs' => array( 'FIGURE', 'IMG' ) ) ) ) {
	$p->add_class( 'figure-image' );
	$figure_images++;
}

echo "found {$figure_images} figure images\n";
echo $p->get_updated_html();

found 2 figure images
<article>
<figure><img class="figure-image" src="hero.jpg" alt="Hero"><figcaption>Hero shot</figcaption></figure>
<p>Body copy <img src="emoji.png" alt=""> mid-paragraph.</p>
<figure><img class="figure-image" src="diagram.png" alt="Diagram"></figure>
</article>

Outline a document by walking tokens with depth

The full processor exposes get_current_depth() and get_breadcrumbs(). Combine with next_token() to print a structural outline.

<?php
require '/php-toolkit/vendor/autoload.php';

$html = <<<'HTML'
<section><h1>Title</h1>
<section><h2>Chapter 1</h2><p>Body</p></section>
<section><h2>Chapter 2</h2><p>More body</p></section>
</section>
HTML;

$p = WP_HTML_Processor::create_fragment( $html );
while ( $p->next_token() ) {
	if ( '#tag' !== $p->get_token_type() || $p->is_tag_closer() ) {
		continue;
	}
	$tag = $p->get_tag();
	if ( ! preg_match( '/^H[1-6]$/', $tag ) ) {
		continue;
	}
	$indent = str_repeat( '  ', max( 0, $p->get_current_depth() - 2 ) );
	$text = '';
	while ( $p->next_token() ) {
		if ( '#text' === $p->get_token_type() ) {
			$text .= $p->get_modifiable_text();
			continue;
		}
		if ( '#tag' === $p->get_token_type() && $tag === $p->get_tag() && $p->is_tag_closer() ) {
			break;
		}
	}
	echo "{$indent}{$tag}  {$text}\n";
}

    H1  Title
      H2  Chapter 1
      H2  Chapter 2

Bookmarks: annotate a parent based on its children

Bookmarks are the one escape from forward-only scanning. Save a position, scan ahead, decide what to do, then seek() back and rewrite the earlier tag.

<?php
require '/php-toolkit/vendor/autoload.php';

$html = <<<'HTML'
<ul>
<li><input type="checkbox" checked> Buy milk</li>
<li><input type="checkbox"> Walk the dog</li>
<li><input type="checkbox" checked> Read book</li>
</ul>
HTML;

$tags = new WP_HTML_Tag_Processor( $html );
$tags->next_tag( 'ul' );
$tags->set_bookmark( 'list' );

$total = 0;
$done = 0;
while ( $tags->next_tag( 'input' ) ) {
	$total++;
	if ( null !== $tags->get_attribute( 'checked' ) ) {
		$done++;
	}
}

$tags->seek( 'list' );
$tags->set_attribute( 'data-progress', $done . '/' . $total );
$tags->release_bookmark( 'list' );

echo $tags->get_updated_html();

<ul data-progress="2/3">
<li><input type="checkbox" checked> Buy milk</li>
<li><input type="checkbox"> Walk the dog</li>
<li><input type="checkbox" checked> Read book</li>
</ul>

When to use which

Use	For
`WP_HTML_Tag_Processor`	Attribute rewriting, sanitization, finding tags by name. Forward-only walks. Anything where speed and byte-honesty matter more than context.
`WP_HTML_Processor::create_fragment()`	Queries by ancestry (`breadcrumbs`), heading outline extraction, anything that needs to know "is this tag inside that one."
`WP_HTML_Decoder::decode_text_node()`	Turning entity-encoded text (`AT&T`) back into raw text correctly. Implements the HTML5 entity algorithm — don't roll your own.
`WP_HTML_Decoder::attribute_starts_with()`	Safe URL-prefix checks that decode HTML character references while comparing — so `javascript:` (where `a` is the letter `a`) is correctly recognized as starting with `javascript:`. The classic `strpos` approach misses these.

Footgun: next_tag() only stops on opening tags. Closers and text are skipped, so a guard like ! $tags->is_tag_closer() inside a next_tag() loop is harmless but never fires. If you need to visit closing tags or text nodes, use next_token() instead and check get_token_type().

Footgun: Tag-name matches are uppercase. get_tag() always returns the tag name in uppercase ('IMG', not 'img'). Compare accordingly. The filter argument to next_tag() is case-insensitive in either direction.

Footgun: Don't confuse WP_HTML_Tag_Processor with the full processor. The cursor is forward-only and ancestry-blind, and it doesn't expose get_breadcrumbs() at all — calling that on a WP_HTML_Tag_Processor raises a Call to undefined method error. Breadcrumbs and HTML5 tree construction (implicit <tbody> insertion, automatic <p> closing, and the rest) live only on WP_HTML_Processor.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why this exists

Add loading="lazy" to every image

Rewrite relative links to absolute URLs

Strip every script and inline event handler

Stamp a CSP nonce on inline scripts and styles

Build a srcset from a single src

Decode HTML entities the way the spec demands

Find images by ancestry with breadcrumbs

Outline a document by walking tokens with depth

Bookmarks: annotate a parent based on its children

When to use which

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Why this exists

Add loading="lazy" to every image

Rewrite relative links to absolute URLs

Strip every script and inline event handler

Stamp a CSP nonce on inline scripts and styles

Build a srcset from a single src

Decode HTML entities the way the spec demands

Find images by ancestry with breadcrumbs

Outline a document by walking tokens with depth

Bookmarks: annotate a parent based on its children

When to use which