从句子生成 N 元语法

2022-09-01 09:21:30

如何生成字符串的 n 元语法,如下所示:

String Input="This is my car."

我想用这个输入生成n-gram:

Input Ngram size = 3

输出应为:

This
is
my
car

This is
is my
my car

This is my
is my car

在Java中给出一些想法,如何实现它,或者是否有任何库可用于它。

我正在尝试使用这个NGramTokenizer,但它给出了n-gram的字符序列,我想要n-gram的单词序列。


答案 1

我相信这会做你想要的:

import java.util.*;

public class Test {

    public static List<String> ngrams(int n, String str) {
        List<String> ngrams = new ArrayList<String>();
        String[] words = str.split(" ");
        for (int i = 0; i < words.length - n + 1; i++)
            ngrams.add(concat(words, i, i+n));
        return ngrams;
    }

    public static String concat(String[] words, int start, int end) {
        StringBuilder sb = new StringBuilder();
        for (int i = start; i < end; i++)
            sb.append((i > start ? " " : "") + words[i]);
        return sb.toString();
    }

    public static void main(String[] args) {
        for (int n = 1; n <= 3; n++) {
            for (String ngram : ngrams(n, "This is my car."))
                System.out.println(ngram);
            System.out.println();
        }
    }
}

输出:

This
is
my
car.

This is
is my
my car.

This is my
is my car.

作为迭代器实现的“按需”解决方案:

class NgramIterator implements Iterator<String> {

    String[] words;
    int pos = 0, n;

    public NgramIterator(int n, String str) {
        this.n = n;
        words = str.split(" ");
    }

    public boolean hasNext() {
        return pos < words.length - n + 1;
    }

    public String next() {
        StringBuilder sb = new StringBuilder();
        for (int i = pos; i < pos + n; i++)
            sb.append((i > pos ? " " : "") + words[i]);
        pos++;
        return sb.toString();
    }

    public void remove() {
        throw new UnsupportedOperationException();
    }
}

答案 2

你正在寻找瓦片过滤器

更新:该链接指向版本 3.0.2。在较新版本的 Lucene 中,此类可能位于不同的包中。


推荐