DOM解析器，允许在<script>标签中使用HTML5样式</

php dom html

2022-08-30 10:46:15

更新：（问题底部）似乎越来越近了，我只需要提高我对它如何使用的理解。html5lib

我正在尝试为PHP 5.3找到一个与HTML5兼容的DOM解析器。特别是，我需要在脚本标记中访问以下类似 HTML 的 CDATA：

<script type="text/x-jquery-tmpl" id="foo">
    <table><tr><td>${name}</td></tr></table>
</script>

大多数解析器会过早地结束解析，因为 HTML 4.01 在标记内找到 ETAGO （）时会结束脚本标记解析。但是，HTML5 允许在 < .到目前为止，我尝试过的所有解析器要么都失败了，要么它们的文档记录太差，以至于我不知道它们是否有效。</<script></script>

我的要求：

真正的解析器，而不是正则表达式黑客。
能够加载整个页面或 HTML 片段。
能够通过标记的 id 属性进行选择，将脚本内容拉回。

输入：

<script id="foo"><td>bar</td></script>

输出失败的示例（不关闭）：</td>

<script id="foo"><td>bar</script>

一些解析器及其结果：

DOMDocument （failed）

源：

<?php

header('Content-type: text/plain');
$d = new DOMDocument;
$d->loadHTML('<script id="foo"><td>bar</td></script>');
echo $d->saveHTML();

输出：

Warning: DOMDocument::loadHTML(): Unexpected end tag : td in Entity, line: 1 in /home/adam/public_html/2010/10/26/dom.php on line 5
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><head><script id="foo"><td>bar</script></head></html>

FluentDOM （failed）

源：

<?php

header('Content-type: text/plain');
require_once 'FluentDOM/src/FluentDOM.php';
$html = "<html><head></head><body><script id='foo'><td></td></script></body></html>";
echo FluentDOM($html, 'text/html');

输出：

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><head></head><body><script id="foo"><td></script></body></html>

phpQuery （failed）

源：

<?php

header('Content-type: text/plain');

require_once 'phpQuery.php';

phpQuery::newDocumentHTML(<<<EOF
<script type="text/x-jquery-tmpl" id="foo">
<td>test</td>
</script>
EOF
);

echo （string）pq（'#foo'）;

输出：

<script type="text/x-jquery-tmpl" id="foo">
<td>test
</script>

html5lib （passes）

可能很有希望。我可以获取标签的内容吗？script#foo

源：

<?php

header('Content-type: text/plain');

include 'HTML5/Parser.php';

$html = "<!DOCTYPE html><html><head></head><body><script id='foo'><td></td></script></body></html>";
$d = HTML5_Parser::parse($html);

echo $d->saveHTML();

输出：

<html><head></head><body><script id="foo"><td></td></script></body></html>

答案 1

我遇到了同样的问题，显然你可以通过将文档加载为XML并将其另存为HTML:)

$d = new DOMDocument;
$d->loadXML('<script id="foo"><td>bar</td></script>');
echo $d->saveHTML();

但是，当然，标记必须是无错误的，loadXML才能工作。

答案 2

我只是发现（在我的情况下）。

尝试更改参数选项loadHTMLLIBXML_SCHEMA_CREATEDOMDocument

$dom = new DOMDocument;

libxml_use_internal_errors(true);
//$dom->loadHTML($buffer, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
$dom->loadHTML($buffer, LIBXML_SCHEMA_CREATE);