Java 读取包含 7000 万行文本的大文本文件

java io

2022-09-01 00:28:59

我有一个很大的测试文件，里面有7000万行文本。我必须逐行读取文件。

我使用了两种不同的方法：

InputStreamReader isr = new InputStreamReader(new FileInputStream(FilePath),"unicode");
BufferedReader br = new BufferedReader(isr);
while((cur=br.readLine()) != null);

和

LineIterator it = FileUtils.lineIterator(new File(FilePath), "unicode");
while(it.hasNext()) cur=it.nextLine();

有没有另一种方法可以使这项任务更快？

答案 1

1）我确信在速度上没有区别，两者都在内部使用FileInputStream和缓冲

2）您可以进行测量并亲眼看到

3）虽然没有性能优势，但我喜欢1.7方法

try (BufferedReader br = Files.newBufferedReader(Paths.get("test.txt"), StandardCharsets.UTF_8)) {
    for (String line = null; (line = br.readLine()) != null;) {
        //
    }
}

4）基于扫描仪的版本

    try (Scanner sc = new Scanner(new File("test.txt"), "UTF-8")) {
        while (sc.hasNextLine()) {
            String line = sc.nextLine();
        }
        // note that Scanner suppresses exceptions
        if (sc.ioException() != null) {
            throw sc.ioException();
        }
    }

5）这可能比其他的更快

try (SeekableByteChannel ch = Files.newByteChannel(Paths.get("test.txt"))) {
    ByteBuffer bb = ByteBuffer.allocateDirect(1000);
    for(;;) {
        StringBuilder line = new StringBuilder();
        int n = ch.read(bb);
        // add chars to line
        // ...
    }
}

它需要一些编码，但由于..它实际上可以更快。它允许操作系统直接从文件读取字节，而无需复制ByteBuffer.allocateDirectByteBuffer

6）并行处理肯定会提高速度。制作一个大字节缓冲区，运行几个任务，将字节从文件并行读取到该缓冲区中，当准备好查找行的第一端时，创建一个，查找下一个...String

答案 2

如果您正在查看性能，则可以查看软件包 - 这些软件包应该比java.nio.*java.io.*