如何在没有HTML包装器的情况下保存DOMDocument的HTML?

2022-08-30 07:06:48

我是下面的函数,我正在努力输出DOMDocument,而无需在内容输出之前附加XML,HTML,bodyp标签包装器。建议的修复方法:

$postarray['post_content'] = $d->saveXML($d->getElementsByTagName('p')->item(0));

仅当内容内部没有块级元素时才有效。但是,当它这样做时,如下面的示例中的 h1 元素所示,saveXML 的结果输出将被截断为...

<p>如果你喜欢</p>

我已被指出这篇文章是一种可能的解决方法,但我无法理解如何将其实现到此解决方案中(请参阅下面注释掉的尝试)。

有什么建议吗?

function rseo_decorate_keyword($postarray) {
    global $post;
    $keyword = "Jasmine Tea"
    $content = "If you like <h1>jasmine tea</h1> you will really like it with Jasmine Tea flavors. This is the last ocurrence of the phrase jasmine tea within the content. If there are other instances of the keyword jasmine tea within the text what happens to jasmine tea."
    $d = new DOMDocument();
    @$d->loadHTML($content);
    $x = new DOMXpath($d);
    $count = $x->evaluate("count(//text()[contains(translate(., 'ABCDEFGHJIKLMNOPQRSTUVWXYZ', 'abcdefghjiklmnopqrstuvwxyz'), '$keyword') and (ancestor::b or ancestor::strong)])");
    if ($count > 0) return $postarray;
    $nodes = $x->query("//text()[contains(translate(., 'ABCDEFGHJIKLMNOPQRSTUVWXYZ', 'abcdefghjiklmnopqrstuvwxyz'), '$keyword') and not(ancestor::h1) and not(ancestor::h2) and not(ancestor::h3) and not(ancestor::h4) and not(ancestor::h5) and not(ancestor::h6) and not(ancestor::b) and not(ancestor::strong)]");
    if ($nodes && $nodes->length) {
        $node = $nodes->item(0);
        // Split just before the keyword
        $keynode = $node->splitText(strpos($node->textContent, $keyword));
        // Split after the keyword
        $node->nextSibling->splitText(strlen($keyword));
        // Replace keyword with <b>keyword</b>
        $replacement = $d->createElement('strong', $keynode->textContent);
        $keynode->parentNode->replaceChild($replacement, $keynode);
    }
$postarray['post_content'] = $d->saveXML($d->getElementsByTagName('p')->item(0));
//  $postarray['post_content'] = $d->saveXML($d->getElementsByTagName('body')->item(1));
//  $postarray['post_content'] = $d->saveXML($d->getElementsByTagName('body')->childNodes);
return $postarray;
}

答案 1

所有这些答案现在都是错误的,因为从 PHP 5.4 和 Libxml 2.6 开始,loadHTML 现在有一个参数,指示 Libxml 应该如何解析内容。$option

因此,如果我们使用这些选项加载HTML

$html->loadHTML($content, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);

当做的时候会没有、没有、没有。saveHTML()doctype<html><body>

LIBXML_HTML_NOIMPLIED关闭隐含的 html/body 元素的自动添加功能可防止在找不到默认文档类型时添加默认文档类型。LIBXML_HTML_NODEFDTD

有关 Libxml 参数的完整文档,请点击此处

(请注意,文档说需要 Libxml 2.6,但仅在 Libxml 2.7.8 中可用,并且在 Libxml 2.7.7 中可用)loadHTMLLIBXML_HTML_NODEFDTDLIBXML_HTML_NOIMPLIED


答案 2

只需在使用loadHTML()加载文档后直接删除节点:

# remove <!DOCTYPE 
$doc->removeChild($doc->doctype);           

# remove <html><body></body></html> 
$doc->replaceChild($doc->firstChild->firstChild->firstChild, $doc->firstChild);

推荐