php 句子边界检测

php regex text-segmentation nlp

2022-08-31 00:52:00

我想在PHP中将文本划分为句子。我目前正在使用正则表达式，它带来了大约95%的准确率，并希望通过使用更好的方法来改进。我见过在Perl，Java和C中做到这一点的NLP工具，但没有看到任何适合PHP的东西。您知道这样的工具吗？

答案 1

增强的正则表达式解决方案

假设您确实关心处理：等缩写，那么以下单个正则表达式解决方案运行良好：Mr.Mrs.

<?php // test.php Rev:20160820_1800
$split_sentences = '%(?#!php/i split_sentences Rev:20160820_1800)
    # Split sentences on whitespace between them.
    # See: http://stackoverflow.com/a/5844564/433790
    (?<=          # Sentence split location preceded by
      [.!?]       # either an end of sentence punct,
    | [.!?][\'"]  # or end of sentence punct and quote.
    )             # End positive lookbehind.
    (?<!          # But don\'t split after these:
      Mr\.        # Either "Mr."
    | Mrs\.       # Or "Mrs."
    | Ms\.        # Or "Ms."
    | Jr\.        # Or "Jr."
    | Dr\.        # Or "Dr."
    | Prof\.      # Or "Prof."
    | Sr\.        # Or "Sr."
    | T\.V\.A\.   # Or "T.V.A."
                 # Or... (you get the idea).
    )             # End negative lookbehind.
    \s+           # Split on whitespace between sentences,
    (?=\S)        # (but not at end of string).
    %xi';  // End $split_sentences.

$text = 'This is sentence one. Sentence two! Sentence thr'.
        'ee? Sentence "four". Sentence "five"! Sentence "'.
        'six"? Sentence "seven." Sentence \'eight!\' Dr. '.
        'Jones said: "Mrs. Smith you have a lovely daught'.
        'er!" The T.V.A. is a big project! '; // Note ws at end.

$sentences = preg_split($split_sentences, $text, -1, PREG_SPLIT_NO_EMPTY);
for ($i = 0; $i < count($sentences); ++$i) {
    printf("Sentence[%d] = [%s]\n", $i + 1, $sentences[$i]);
}
?>

请注意，您可以轻松地在表达式中添加或删除缩写。给定以下测试段落：

这是第一句话。第二句！第三句？句子“四”。句子“五”！句子“六”？句子“七”。“八句！”琼斯博士说：“史密斯太太，你有一个可爱的女儿！”电视机是一个大项目！

下面是脚本的输出：

Sentence[1] = [This is sentence one.]
Sentence[2] = [Sentence two!]
Sentence[3] = [Sentence three?]
Sentence[4] = [Sentence "four".]
Sentence[5] = [Sentence "five"!]
Sentence[6] = [Sentence "six"?]
Sentence[7] = [Sentence "seven."]
Sentence[8] = [Sentence 'eight!']
Sentence[9] = [Dr. Jones said: "Mrs. Smith you have a lovely daughter!"]
Sentence[10] = [The T.V.A. is a big project!]

基本的正则表达式解决方案

问题的作者评论说，上述解决方案“忽略了许多选项”，不够通用。我不确定这是什么意思，但上述表达的本质是尽可能干净和简单。在这里：

$re = '/(?<=[.!?]|[.!?][\'"])\s+(?=\S)/';
$sentences = preg_split($re, $text, -1, PREG_SPLIT_NO_EMPTY);

请注意，这两种解决方案都能正确识别结尾标点符号后以引号结尾的句子。如果您不关心匹配以引号结尾的句子，则可以将正则表达式简化为：。/(?<=[.!?])\s+(?=\S)/

编辑：20130820_1000已添加（另一个要忽略的标点符号单词）到正则表达式和测试字符串。（回答PapyRef的评论问题）T.V.A.

编辑：20130820_1800整理并重命名正则表达式并添加了shebang。还修复了正则表达式，以防止在尾随空格上拆分文本。

答案 2

对别人的工作略有改进：

$re = '/# Split sentences on whitespace between them.
(?<=                # Begin positive lookbehind.
  [.!?]             # Either an end of sentence punct,
| [.!?][\'"]        # or end of sentence punct and quote.
)                   # End positive lookbehind.
(?<!                # Begin negative lookbehind.
  Mr\.              # Skip either "Mr."
| Mrs\.             # or "Mrs.",
| Ms\.              # or "Ms.",
| Jr\.              # or "Jr.",
| Dr\.              # or "Dr.",
| Prof\.            # or "Prof.",
| Sr\.              # or "Sr.",
| \s[A-Z]\.              # or initials ex: "George W. Bush",
                    # or... (you get the idea).
)                   # End negative lookbehind.
\s+                 # Split on whitespace between sentences.
/ix';

$sentences = preg_split($re, $story, -1, PREG_SPLIT_NO_EMPTY);