适用于 Java 的良好且有效的 CSV/TSV 阅读器

2022-09-03 13:04:44

我正在尝试读取大文件和(制表符分隔的)文件,其中包含大约行或更多。现在我试图用opencsv阅读一个包含行,但它给我一个.它适用于带有行的较小文件。所以我想知道是否有任何其他支持读取巨大和文件。你有什么想法吗?CSVTSV1000000TSV~2500000java.lang.NullPointerExceptionTSV~250000LibrariesCSVTSV

每个对我的代码感兴趣的人(我缩短了它,所以显然是无效的):Try-Catch

InputStreamReader in = null;
CSVReader reader = null;
try {
    in = this.replaceBackSlashes();
    reader = new CSVReader(in, this.seperator, '\"', this.offset);
    ret = reader.readAll();
} finally {
    try {
        reader.close();
    } 
}

编辑:这是我构造:InputStreamReader

private InputStreamReader replaceBackSlashes() throws Exception {
        FileInputStream fis = null;
        Scanner in = null;
        try {
            fis = new FileInputStream(this.csvFile);
            in = new Scanner(fis, this.encoding);
            ByteArrayOutputStream out = new ByteArrayOutputStream();

            while (in.hasNext()) {
                String nextLine = in.nextLine().replace("\\", "/");
                // nextLine = nextLine.replaceAll(" ", "");
                nextLine = nextLine.replaceAll("'", "");
                out.write(nextLine.getBytes());
                out.write("\n".getBytes());
            }

            return new InputStreamReader(new ByteArrayInputStream(out.toByteArray()));
        } catch (Exception e) {
            in.close();
            fis.close();
            this.logger.error("Problem at replaceBackSlashes", e);
        }
        throw new Exception();
    }

答案 1

不要使用 CSV 分析器来分析 TSV 输入。例如,如果 TSV 具有带引号字符的字段,它将中断。

uniVocity-parsers带有一个TSV解析器。您可以毫无问题地解析十亿行。

解析 TSV 输入的示例:

TsvParserSettings settings = new TsvParserSettings();
TsvParser parser = new TsvParser(settings);

// parses all rows in one go.
List<String[]> allRows = parser.parseAll(new FileReader(yourFile));

如果您的输入太大,无法保存在内存中,请执行以下操作:

TsvParserSettings settings = new TsvParserSettings();

// all rows parsed from your input will be sent to this processor
ObjectRowProcessor rowProcessor = new ObjectRowProcessor() {
    @Override
    public void rowProcessed(Object[] row, ParsingContext context) {
        //here is the row. Let's just print it.
        System.out.println(Arrays.toString(row));
    }
};
// the ObjectRowProcessor supports conversions from String to whatever you need:
// converts values in columns 2 and 5 to BigDecimal
rowProcessor.convertIndexes(Conversions.toBigDecimal()).set(2, 5);

// converts the values in columns "Description" and "Model". Applies trim and to lowercase to the values in these columns.
rowProcessor.convertFields(Conversions.trim(), Conversions.toLowerCase()).set("Description", "Model");

//configures to use the RowProcessor
settings.setRowProcessor(rowProcessor);

TsvParser parser = new TsvParser(settings);
//parses everything. All rows will be pumped into your RowProcessor.
parser.parse(new FileReader(yourFile));

披露:我是这个库的作者。它是开源和免费的(Apache V2.0许可证)。


答案 2

我没有尝试过,但我之前已经调查过superCSV。

http://sourceforge.net/projects/supercsv/

http://supercsv.sourceforge.net/

检查这是否适合您,250万行。