获取字符串的大小,以字节为单位进行编码,而不转换为 byte[]

2022-09-02 13:49:52

我有一个情况,我需要知道/encoding对的大小,以字节为单位,但不能使用该方法,因为1)非常大,在数组中复制会使用大量的内存,但更多的是点2)根据*的长度分配数组每个字符的最大可能字节数。所以如果我有一个1.5B字符和UTF-16编码,会尝试分配一个3GB的数组并失败,因为数组被限制为2 ^ 32 - X字节(X是Java版本特定的)。StringgetBytes()StringStringbyte[]getBytes()byte[]StringStringgetBytes()

那么 - 有没有办法直接从对象计算/编码对的字节大小?StringString

更新:

以下是jtahlborn答案的工作实现:

private class CountingOutputStream extends OutputStream {
    int total;

    @Override
    public void write(int i) {
        throw new RuntimeException("don't use");
    }
    @Override
    public void write(byte[] b) {
        total += b.length;
    }

    @Override public void write(byte[] b, int offset, int len) {
        total += len;
    }
}

答案 1

很简单,只需将其写入虚拟输出流:

class CountingOutputStream extends OutputStream {
  private int _total;

  @Override public void write(int b) {
    ++_total;
  }

  @Override public void write(byte[] b) {
    _total += b.length;
  }

  @Override public void write(byte[] b, int offset, int len) {
    _total += len;
  }

  public int getTotalSize(){
     _total;
  }
}

CountingOutputStream cos = new CountingOutputStream();
Writer writer = new OutputStreamWriter(cos, "my_encoding");
//writer.write(myString);

// UPDATE: OutputStreamWriter does a simple copy of the _entire_ input string, to avoid that use:
for(int i = 0; i < myString.length(); i+=8096) {
  int end = Math.min(myString.length(), i+8096);
  writer.write(myString, i, end - i);
}

writer.flush();

System.out.println("Total bytes: " + cos.getTotalSize());

它不仅简单,而且可能与其他“复杂”答案一样快。


答案 2

使用apache-commons库也是如此:

public static long stringLength(String string, Charset charset) {

    try (NullOutputStream nul = new NullOutputStream();
         CountingOutputStream count = new CountingOutputStream(nul)) {

        IOUtils.write(string, count, charset.name());
        count.flush();
        return count.getCount();
    } catch (IOException e) {
        throw new IllegalStateException("Unexpected I/O.", e);
    }
}