Java Lucene NGramTokenizer

2022-09-04 19:39:54

我正在尝试将字符串标记化为ngram。奇怪的是,在NGramTokenizer的文档中,我没有看到一个方法可以返回被标记化的单个ngram。实际上,我只在 NGramTokenizer 类中看到两个返回 String Objects 的方法。

这是我的代码:

Reader reader = new StringReader("This is a test string");
NGramTokenizer gramTokenizer = new NGramTokenizer(reader, 1, 3);
  1. 被标记化的ngram在哪里?
  2. 如何获取字符串/单词的输出?

我希望我的输出是这样的:This, is, a, test, string, This is, is a, a test, test string, This is a, is a, is a test, a test string.


答案 1

我不认为你会找到你要找的东西,试图找到返回String的方法。您需要处理属性s。

应该像这样工作:

Reader reader = new StringReader("This is a test string");
NGramTokenizer gramTokenizer = new NGramTokenizer(reader, 1, 3);
CharTermAttribute charTermAttribute = gramTokenizer.addAttribute(CharTermAttribute.class);
gramTokenizer.reset();

while (gramTokenizer.incrementToken()) {
    String token = charTermAttribute.toString();
    //Do something
}
gramTokenizer.end();
gramTokenizer.close();

但是,如果之后需要重用 Tokenizer,请务必将其重置()。


标记分组单词,而不是字符,每个注释:

Reader reader = new StringReader("This is a test string");
TokenStream tokenizer = new StandardTokenizer(Version.LUCENE_36, reader);
tokenizer = new ShingleFilter(tokenizer, 1, 3);
CharTermAttribute charTermAttribute = tokenizer.addAttribute(CharTermAttribute.class);

while (tokenizer.incrementToken()) {
    String token = charTermAttribute.toString();
    //Do something
}

答案 2

对于最新版本的Lucene(4.2.1),这是一个干净的代码。在执行此代码之前,您必须导入 2 个 jar 文件:

  • 胭砜烯芯-4.2.1.jar
  • 苋烯-分析家-普通-4.2.1.jar

http://www.apache.org/dyn/closer.cgi/lucene/java/4.2.1

//LUCENE 4.2.1
Reader reader = new StringReader("This is a test string");      
NGramTokenizer gramTokenizer = new NGramTokenizer(reader, 1, 3);

CharTermAttribute charTermAttribute = gramTokenizer.addAttribute(CharTermAttribute.class);

while (gramTokenizer.incrementToken()) {
    String token = charTermAttribute.toString();
    System.out.println(token);
}

推荐