编辑:现在我知道得更好
使用正则表达式来解决这类问题是一个坏主意,可能会导致不可维护和不可靠的代码。最好使用 HTML 解析器。
正则表达式解决方案
在这种情况下,最好将过程分成两部分:
我假设你的文档不是xHTML严格的,所以你不能使用XML解析器。例如,使用此网页源代码:
/* preg_match_all match the regexp in all the $html string and output everything as
an array in $result. "i" option is used to make it case insensitive */
preg_match_all('/<img[^>]+>/i',$html, $result);
print_r($result);
Array
(
[0] => Array
(
[0] => <img src="/Content/Img/stackoverflow-logo-250.png" width="250" height="70" alt="logo link to homepage" />
[1] => <img class="vote-up" src="/content/img/vote-arrow-up.png" alt="vote up" title="This was helpful (click again to undo)" />
[2] => <img class="vote-down" src="/content/img/vote-arrow-down.png" alt="vote down" title="This was not helpful (click again to undo)" />
[3] => <img src="http://www.gravatar.com/avatar/df299babc56f0a79678e567e87a09c31?s=32&d=identicon&r=PG" height=32 width=32 alt="gravatar image" />
[4] => <img class="vote-up" src="/content/img/vote-arrow-up.png" alt="vote up" title="This was helpful (click again to undo)" />
[...]
)
)
然后,我们得到所有带有循环的img标签属性:
$img = array();
foreach( $result as $img_tag)
{
preg_match_all('/(alt|title|src)=("[^"]*")/i',$img_tag, $img[$img_tag]);
}
print_r($img);
Array
(
[<img src="/Content/Img/stackoverflow-logo-250.png" width="250" height="70" alt="logo link to homepage" />] => Array
(
[0] => Array
(
[0] => src="/Content/Img/stackoverflow-logo-250.png"
[1] => alt="logo link to homepage"
)
[1] => Array
(
[0] => src
[1] => alt
)
[2] => Array
(
[0] => "/Content/Img/stackoverflow-logo-250.png"
[1] => "logo link to homepage"
)
)
[<img class="vote-up" src="/content/img/vote-arrow-up.png" alt="vote up" title="This was helpful (click again to undo)" />] => Array
(
[0] => Array
(
[0] => src="/content/img/vote-arrow-up.png"
[1] => alt="vote up"
[2] => title="This was helpful (click again to undo)"
)
[1] => Array
(
[0] => src
[1] => alt
[2] => title
)
[2] => Array
(
[0] => "/content/img/vote-arrow-up.png"
[1] => "vote up"
[2] => "This was helpful (click again to undo)"
)
)
[<img class="vote-down" src="/content/img/vote-arrow-down.png" alt="vote down" title="This was not helpful (click again to undo)" />] => Array
(
[0] => Array
(
[0] => src="/content/img/vote-arrow-down.png"
[1] => alt="vote down"
[2] => title="This was not helpful (click again to undo)"
)
[1] => Array
(
[0] => src
[1] => alt
[2] => title
)
[2] => Array
(
[0] => "/content/img/vote-arrow-down.png"
[1] => "vote down"
[2] => "This was not helpful (click again to undo)"
)
)
[<img src="http://www.gravatar.com/avatar/df299babc56f0a79678e567e87a09c31?s=32&d=identicon&r=PG" height=32 width=32 alt="gravatar image" />] => Array
(
[0] => Array
(
[0] => src="http://www.gravatar.com/avatar/df299babc56f0a79678e567e87a09c31?s=32&d=identicon&r=PG"
[1] => alt="gravatar image"
)
[1] => Array
(
[0] => src
[1] => alt
)
[2] => Array
(
[0] => "http://www.gravatar.com/avatar/df299babc56f0a79678e567e87a09c31?s=32&d=identicon&r=PG"
[1] => "gravatar image"
)
)
[..]
)
)
正则表达式占用大量 CPU 资源,因此您可能需要缓存此页面。如果您没有缓存系统,则可以通过使用ob_start并从文本文件中加载/保存来调整自己的缓存系统。
这些东西是如何工作的?
首先,我们使用preg_ match_ all,一个函数,它获取与模式匹配的每个字符串,并将其输出到它的第三个参数中。
正则表达式 :
<img[^>]+>
我们将其应用于所有html网页。它可以被读取为以“<img
”开头的每个字符串,包含非“>”char并以>结尾。
(alt|title|src)=("[^"]*")
我们将其连续应用于每个 img 标记。它可以被解读为每个以“alt”,“title”或“src”开头的字符串,然后是“=”,然后是'“',一堆不是'''的东西,并以'''结尾。隔离 () 之间的子字符串。
最后,每次你想处理正则表达式时,拥有好的工具来快速测试它们非常方便。检查此在线正则表达式测试器。
编辑:回答第一条评论。
的确,我没有想到(希望是少数)使用单引号的人。
好吧,如果您只使用',只需将所有“替换为'”。
如果两者混合。首先,你应该打自己:-),然后尝试使用(“|”)而不是 “和 [^ø] 来替换 [^”]。