如何从Lucene TokenStream获得代币?

2022-08-31 13:07:49

我正在尝试使用Apache Lucene进行标记化,并且我对从.TokenStream

最糟糕的是,我正在查看JavaDocs中解决我的问题的评论。

http://lucene.apache.org/java/3_0_1/api/core/org/apache/lucene/analysis/TokenStream.html#incrementToken%28%29

不知何故,应该使用 an,而不是 s。我完全不知所措。AttributeSourceToken

任何人都可以解释如何从TokenStream获取类似令牌的信息吗?


答案 1

是的,这有点复杂(与好的方式相比),但这应该这样做:

TokenStream tokenStream = analyzer.tokenStream(fieldName, reader);
OffsetAttribute offsetAttribute = tokenStream.getAttribute(OffsetAttribute.class);
TermAttribute termAttribute = tokenStream.getAttribute(TermAttribute.class);

while (tokenStream.incrementToken()) {
    int startOffset = offsetAttribute.startOffset();
    int endOffset = offsetAttribute.endOffset();
    String term = termAttribute.term();
}

编辑:方式

根据Donotello的说法,已被弃用,取而代之的是.根据jpountz(和Lucene的文档),比.TermAttributeCharTermAttributeaddAttributegetAttribute

TokenStream tokenStream = analyzer.tokenStream(fieldName, reader);
OffsetAttribute offsetAttribute = tokenStream.addAttribute(OffsetAttribute.class);
CharTermAttribute charTermAttribute = tokenStream.addAttribute(CharTermAttribute.class);

tokenStream.reset();
while (tokenStream.incrementToken()) {
    int startOffset = offsetAttribute.startOffset();
    int endOffset = offsetAttribute.endOffset();
    String term = charTermAttribute.toString();
}

答案 2

这是它应该的样子(亚当答案的干净版本):

TokenStream stream = analyzer.tokenStream(null, new StringReader(text));
CharTermAttribute cattr = stream.addAttribute(CharTermAttribute.class);
stream.reset();
while (stream.incrementToken()) {
  System.out.println(cattr.toString());
}
stream.end();
stream.close();