如何将OpenNLP与Java一起使用?

2022-09-02 19:26:06

我想POStag一个英语句子并进行一些处理。我想使用openNLP。我已经安装了它

当我执行命令时

I:\Workshop\Programming\nlp\opennlp-tools-1.5.0-bin\opennlp-tools-1.5.0>java -jar opennlp-tools-1.5.0.jar POSTagger models\en-pos-maxent.bin < Text.txt

它为输出提供在文本中标记输入的输出.txt

    Loading POS Tagger model ... done (4.009s)
My_PRP$ name_NN is_VBZ Shabab_NNP i_FW am_VBP 22_CD years_NNS old._.


Average: 66.7 sent/s
Total: 1 sent
Runtime: 0.015s

我希望它正确安装?

现在,我如何从Java应用程序内部进行此POStagging?我已经将openNLPtools,jwnl,maxent jar添加到项目中,但是我如何调用POStagging?


答案 1

以下是我编写的一些(旧)示例代码,以及要遵循的现代化代码:

package opennlp;

import opennlp.tools.cmdline.PerformanceMonitor;
import opennlp.tools.cmdline.postag.POSModelLoader;
import opennlp.tools.postag.POSModel;
import opennlp.tools.postag.POSSample;
import opennlp.tools.postag.POSTaggerME;
import opennlp.tools.tokenize.WhitespaceTokenizer;
import opennlp.tools.util.ObjectStream;
import opennlp.tools.util.PlainTextByLineStream;

import java.io.File;
import java.io.IOException;
import java.io.StringReader;

public class OpenNlpTest {
public static void main(String[] args) throws IOException {
    POSModel model = new POSModelLoader().load(new File("en-pos-maxent.bin"));
    PerformanceMonitor perfMon = new PerformanceMonitor(System.err, "sent");
    POSTaggerME tagger = new POSTaggerME(model);

    String input = "Can anyone help me dig through OpenNLP's horrible documentation?";
    ObjectStream<String> lineStream =
            new PlainTextByLineStream(new StringReader(input));

    perfMon.start();
    String line;
    while ((line = lineStream.read()) != null) {

        String whitespaceTokenizerLine[] = WhitespaceTokenizer.INSTANCE.tokenize(line);
        String[] tags = tagger.tag(whitespaceTokenizerLine);

        POSSample sample = new POSSample(whitespaceTokenizerLine, tags);
        System.out.println(sample.toString());

        perfMon.incrementCounter();
    }
    perfMon.stopAndPrintFinalResult();
}
}

输出为:

Loading POS Tagger model ... done (2.045s)
Can_MD anyone_NN help_VB me_PRP dig_VB through_IN OpenNLP's_NNP horrible_JJ documentation?_NN

Average: 76.9 sent/s 
Total: 1 sent
Runtime: 0.013s

这基本上是从作为OpenNLP的一部分包含的POSTaggerTool类开始工作的。是一个具有标记类型本身的数组。sample.getTags()String

这需要直接文件访问训练数据,这真的非常非常蹩脚。

为此更新的代码库略有不同(可能更有用)。

首先,一个Maven POM:

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <groupId>org.javachannel</groupId>
    <artifactId>opennlp-example</artifactId>
    <version>1.0-SNAPSHOT</version>
    <dependencies>
        <dependency>
            <groupId>org.apache.opennlp</groupId>
            <artifactId>opennlp-tools</artifactId>
            <version>1.6.0</version>
        </dependency>
        <dependency>
            <groupId>org.testng</groupId>
            <artifactId>testng</artifactId>
            <version>[6.8.21,)</version>
            <scope>test</scope>
        </dependency>
    </dependencies>
    <build>
        <plugins>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-compiler-plugin</artifactId>
                <version>3.1</version>
                <configuration>
                    <source>1.8</source>
                    <target>1.8</target>
                </configuration>
            </plugin>
        </plugins>
    </build>
</project>

这是作为测试编写的代码,因此位于:./src/test/java/org/javachannel/opennlp/example

package org.javachannel.opennlp.example;

import opennlp.tools.cmdline.PerformanceMonitor;
import opennlp.tools.postag.POSModel;
import opennlp.tools.postag.POSSample;
import opennlp.tools.postag.POSTaggerME;
import opennlp.tools.tokenize.WhitespaceTokenizer;
import org.testng.annotations.DataProvider;
import org.testng.annotations.Test;

import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;
import java.net.URL;
import java.nio.channels.Channels;
import java.nio.channels.ReadableByteChannel;
import java.util.stream.Stream;

public class POSTest {
    private void download(String url, File destination) throws IOException {
        URL website = new URL(url);
        ReadableByteChannel rbc = Channels.newChannel(website.openStream());
        FileOutputStream fos = new FileOutputStream(destination);
        fos.getChannel().transferFrom(rbc, 0, Long.MAX_VALUE);
    }

    @DataProvider
    Object[][] getCorpusData() {
        return new Object[][][]{{{
                "Can anyone help me dig through OpenNLP's horrible documentation?"
        }}};
    }

    @Test(dataProvider = "getCorpusData")
    public void showPOS(Object[] input) throws IOException {
        File modelFile = new File("en-pos-maxent.bin");
        if (!modelFile.exists()) {
            System.out.println("Downloading model.");
            download("http://opennlp.sourceforge.net/models-1.5/en-pos-maxent.bin", modelFile);
        }
        POSModel model = new POSModel(modelFile);
        PerformanceMonitor perfMon = new PerformanceMonitor(System.err, "sent");
        POSTaggerME tagger = new POSTaggerME(model);

        perfMon.start();
        Stream.of(input).map(line -> {
            String whitespaceTokenizerLine[] = WhitespaceTokenizer.INSTANCE.tokenize(line.toString());
            String[] tags = tagger.tag(whitespaceTokenizerLine);

            POSSample sample = new POSSample(whitespaceTokenizerLine, tags);

            perfMon.incrementCounter();
            return sample.toString();
        }).forEach(System.out::println);
        perfMon.stopAndPrintFinalResult();
    }
}

这段代码实际上并没有测试任何东西 - 如果有的话,它是一个烟雾测试 - 但它应该作为一个起点。另一个(可能)好的事情是,如果您还没有下载模型,它会为您下载模型。


答案 2

URL http://bulba.sdsu.edu/jeanette/thesis/PennTags.html 不再起作用。我在 http://www.slideshare.net/gagan1667/opennlp-demo 的第14张幻灯片上找到了以下内容

enter image description here


推荐