如何使用斯坦福解析器将文本拆分为句子?

如何使用斯坦福解析器将文本或段落拆分为句子?

有没有可以提取句子的方法,比如为Ruby提供的?getSentencesFromString()


答案 1

您可以检查文档预处理器类。下面是一个简短的片段。我认为可能还有其他方法可以做你想做的事情。

String paragraph = "My 1st sentence. “Does it work for questions?” My third sentence.";
Reader reader = new StringReader(paragraph);
DocumentPreprocessor dp = new DocumentPreprocessor(reader);
List<String> sentenceList = new ArrayList<String>();

for (List<HasWord> sentence : dp) {
   // SentenceUtils not Sentence
   String sentenceString = SentenceUtils.listToString(sentence);
   sentenceList.add(sentenceString);
}

for (String sentence : sentenceList) {
   System.out.println(sentence);
}

答案 2

我知道已经有一个被接受的答案...但通常你只会从带注释的文档中获取句子注释。

// creates a StanfordCoreNLP object, with POS tagging, lemmatization, NER, parsing, and coreference resolution 
Properties props = new Properties();
props.put("annotators", "tokenize, ssplit, pos, lemma, ner, parse, dcoref");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);

// read some text in the text variable
String text = ... // Add your text here!

// create an empty Annotation just with the given text
Annotation document = new Annotation(text);

// run all Annotators on this text
pipeline.annotate(document);

// these are all the sentences in this document
// a CoreMap is essentially a Map that uses class objects as keys and has values with custom types
List<CoreMap> sentences = document.get(SentencesAnnotation.class);

for(CoreMap sentence: sentences) {
  // traversing the words in the current sentence
  // a CoreLabel is a CoreMap with additional token-specific methods
  for (CoreLabel token: sentence.get(TokensAnnotation.class)) {
    // this is the text of the token
    String word = token.get(TextAnnotation.class);
    // this is the POS tag of the token
    String pos = token.get(PartOfSpeechAnnotation.class);
    // this is the NER label of the token
    String ne = token.get(NamedEntityTagAnnotation.class);       
  }

}

源 - http://nlp.stanford.edu/software/corenlp.shtml(半途而废)

如果您只查找句子,则可以从管道初始化中删除后面的步骤,例如“parse”和“dcoref”,这将为您节省一些负载和处理时间。摇滚乐。~K


推荐