StringTokenizer 类与 String.split 方法在 Java 中的性能

2022-09-01 02:54:31

在我的软件中,我需要将字符串拆分为单词。我目前有超过19,000,000个文档,每个文档超过30个单词。

以下两种方法中哪一种是执行此操作的最佳方法(就性能而言)?

StringTokenizer sTokenize = new StringTokenizer(s," ");
while (sTokenize.hasMoreTokens()) {

String[] splitS = s.split(" ");
for(int i =0; i < splitS.length; i++)

答案 1

如果你的数据已经在数据库中,你需要解析单词字符串,我建议重复使用indexOf。它比任何一种解决方案都快很多倍。

但是,从数据库获取数据的成本仍然可能高得多。

StringBuilder sb = new StringBuilder();
for (int i = 100000; i < 100000 + 60; i++)
    sb.append(i).append(' ');
String sample = sb.toString();

int runs = 100000;
for (int i = 0; i < 5; i++) {
    {
        long start = System.nanoTime();
        for (int r = 0; r < runs; r++) {
            StringTokenizer st = new StringTokenizer(sample);
            List<String> list = new ArrayList<String>();
            while (st.hasMoreTokens())
                list.add(st.nextToken());
        }
        long time = System.nanoTime() - start;
        System.out.printf("StringTokenizer took an average of %.1f us%n", time / runs / 1000.0);
    }
    {
        long start = System.nanoTime();
        Pattern spacePattern = Pattern.compile(" ");
        for (int r = 0; r < runs; r++) {
            List<String> list = Arrays.asList(spacePattern.split(sample, 0));
        }
        long time = System.nanoTime() - start;
        System.out.printf("Pattern.split took an average of %.1f us%n", time / runs / 1000.0);
    }
    {
        long start = System.nanoTime();
        for (int r = 0; r < runs; r++) {
            List<String> list = new ArrayList<String>();
            int pos = 0, end;
            while ((end = sample.indexOf(' ', pos)) >= 0) {
                list.add(sample.substring(pos, end));
                pos = end + 1;
            }
        }
        long time = System.nanoTime() - start;
        System.out.printf("indexOf loop took an average of %.1f us%n", time / runs / 1000.0);
    }
 }

指纹

StringTokenizer took an average of 5.8 us
Pattern.split took an average of 4.8 us
indexOf loop took an average of 1.8 us
StringTokenizer took an average of 4.9 us
Pattern.split took an average of 3.7 us
indexOf loop took an average of 1.7 us
StringTokenizer took an average of 5.2 us
Pattern.split took an average of 3.9 us
indexOf loop took an average of 1.8 us
StringTokenizer took an average of 5.1 us
Pattern.split took an average of 4.1 us
indexOf loop took an average of 1.6 us
StringTokenizer took an average of 5.0 us
Pattern.split took an average of 3.8 us
indexOf loop took an average of 1.6 us

打开文件的成本约为8毫秒。由于文件非常小,您的缓存可能会将性能提高2-5倍。即便如此,它将花费大约10个小时打开文件。使用 split 与 StringTokenizer 的成本远远低于 0.01 毫秒。要解析 1900 万 x 30 个单词 * 每个单词 8 个字母,大约需要 10 秒(每 2 秒大约 1 GB)

如果你想提高性能,我建议你拥有的文件要少得多。例如,使用数据库。如果您不想使用SQL数据库,我建议使用以下 http://nosql-database.org/


答案 2

在Java 7中拆分只是调用indexOf作为这个输入,见源代码。拆分应该非常快,接近于索引Of的重复调用。


推荐