关闭字符串中打开的 HTML 标记

php string regex

2022-08-30 20:18:30

情况是一个字符串，导致如下结果：

<p>This is some text and here is a <strong>bold text then the post stop here....</p>

由于该函数返回文本的预告片（摘要），因此它会在某些单词之后停止。在这种情况下，强标记未关闭。但是整个字符串被包装在一个段落中。

是否可以将上述结果/输出转换为以下内容：

<p>This is some text and here is a <strong>bold text then the post stop here....</strong></p>

我不知道从哪里开始。问题是..我在网络上找到了一个函数，它执行正则表达式，但它将结束标记放在字符串之后。因此，它不会验证，因为我希望段落标签内的所有打开/关闭标签。我发现的函数也这样做，这也是错误的：

<p>This is some text and here is a <strong>bold text then the post stop here....</p></strong>

我想知道标签可以是强，斜体，任何东西。这就是为什么我无法追加函数并在函数中手动关闭它的原因。任何模式可以为我做到这一点？

答案 1

这是我以前用过的一个函数，效果很好：

function closetags($html) {
    preg_match_all('#<(?!meta|img|br|hr|input\b)\b([a-z]+)(?: .*)?(?<![/|/ ])>#iU', $html, $result);
    $openedtags = $result[1];
    preg_match_all('#</([a-z]+)>#iU', $html, $result);
    $closedtags = $result[1];
    $len_opened = count($openedtags);
    if (count($closedtags) == $len_opened) {
        return $html;
    }
    $openedtags = array_reverse($openedtags);
    for ($i=0; $i < $len_opened; $i++) {
        if (!in_array($openedtags[$i], $closedtags)) {
            $html .= '</'.$openedtags[$i].'>';
        } else {
            unset($closedtags[array_search($openedtags[$i], $closedtags)]);
        }
    }
    return $html;
}

就个人而言，我不会使用正则表达式，而是使用像Tidy这样的库。这将类似于以下内容：

$str = '<p>This is some text and here is a <strong>bold text then the post stop here....</p>';
$tidy = new Tidy();
$clean = $tidy->repairString($str, array(
    'output-xml' => true,
    'input-xml' => true
));
echo $clean;

答案 2

对原始答案的一个小小的修改...而原始答案正确剥离了标签。我发现在我的截断过程中，我最终可能会得到被切碎的标签。例如：

This text has some <b>in it</b>

在字符 21 处截断会导致：

This text has some <

以下代码基于下一个最佳答案构建并修复了此问题。

function truncateHTML($html, $length)
{
    $truncatedText = substr($html, $length);
    $pos = strpos($truncatedText, ">");
    if($pos !== false)
    {
        $html = substr($html, 0,$length + $pos + 1);
    }
    else
    {
        $html = substr($html, 0,$length);
    }

    preg_match_all('#<(?!meta|img|br|hr|input\b)\b([a-z]+)(?: .*)?(?<![/|/ ])>#iU', $html, $result);
    $openedtags = $result[1];

    preg_match_all('#</([a-z]+)>#iU', $html, $result);
    $closedtags = $result[1];

    $len_opened = count($openedtags);

    if (count($closedtags) == $len_opened)
    {
        return $html;
    }

    $openedtags = array_reverse($openedtags);
    for ($i=0; $i < $len_opened; $i++)
    {
        if (!in_array($openedtags[$i], $closedtags))
        {
            $html .= '</'.$openedtags[$i].'>';
        }
        else
        {
            unset($closedtags[array_search($openedtags[$i], $closedtags)]);
        }
    }


    return $html;
}


$str = "This text has <b>bold</b> in it</b>";
print "Test 1 - Truncate with no tag: " . truncateHTML($str, 5) . "<br>\n";
print "Test 2 - Truncate at start of tag: " . truncateHTML($str, 20) . "<br>\n";
print "Test 3 - Truncate in the middle of a tag: " . truncateHTML($str, 16) . "<br>\n";
print "Test 4: - Truncate with less text: " . truncateHTML($str, 300) . "<br>\n";

希望它能帮助那里的某个人。