一旦UTF-8编码,我如何截断Java字符串以适应给定数量的字节?
我如何截断Java,以便我知道一旦它被UTF-8编码,它将适合给定数量的字节存储?String
我如何截断Java,以便我知道一旦它被UTF-8编码,它将适合给定数量的字节存储?String
下面是一个简单的循环,它计算 UTF-8 表示形式的大小,并在超过 UTF-8 表示形式时进行截断:
public static String truncateWhenUTF8(String s, int maxBytes) {
int b = 0;
for (int i = 0; i < s.length(); i++) {
char c = s.charAt(i);
// ranges from http://en.wikipedia.org/wiki/UTF-8
int skip = 0;
int more;
if (c <= 0x007f) {
more = 1;
}
else if (c <= 0x07FF) {
more = 2;
} else if (c <= 0xd7ff) {
more = 3;
} else if (c <= 0xDFFF) {
// surrogate area, consume next char as well
more = 4;
skip = 1;
} else {
more = 3;
}
if (b + more > maxBytes) {
return s.substring(0, i);
}
b += more;
i += skip;
}
return s;
}
这将处理出现在输入字符串中的代理项对。Java 的 UTF-8 编码器(正确)将代理项对输出为单个 4 字节序列,而不是两个 3 字节序列,因此将返回最长的截断字符串。如果忽略实现中的代理项对,则截断的字符串可能会短接,而不是需要的。truncateWhenUTF8()
我还没有对该代码进行过大量测试,但这里有一些初步测试:
private static void test(String s, int maxBytes, int expectedBytes) {
String result = truncateWhenUTF8(s, maxBytes);
byte[] utf8 = result.getBytes(Charset.forName("UTF-8"));
if (utf8.length > maxBytes) {
System.out.println("BAD: our truncation of " + s + " was too big");
}
if (utf8.length != expectedBytes) {
System.out.println("BAD: expected " + expectedBytes + " got " + utf8.length);
}
System.out.println(s + " truncated to " + result);
}
public static void main(String[] args) {
test("abcd", 0, 0);
test("abcd", 1, 1);
test("abcd", 2, 2);
test("abcd", 3, 3);
test("abcd", 4, 4);
test("abcd", 5, 4);
test("a\u0080b", 0, 0);
test("a\u0080b", 1, 1);
test("a\u0080b", 2, 1);
test("a\u0080b", 3, 3);
test("a\u0080b", 4, 4);
test("a\u0080b", 5, 4);
test("a\u0800b", 0, 0);
test("a\u0800b", 1, 1);
test("a\u0800b", 2, 1);
test("a\u0800b", 3, 1);
test("a\u0800b", 4, 4);
test("a\u0800b", 5, 5);
test("a\u0800b", 6, 5);
// surrogate pairs
test("\uD834\uDD1E", 0, 0);
test("\uD834\uDD1E", 1, 0);
test("\uD834\uDD1E", 2, 0);
test("\uD834\uDD1E", 3, 0);
test("\uD834\uDD1E", 4, 4);
test("\uD834\uDD1E", 5, 4);
}
更新修改后的代码示例,它现在处理代理项对。
你应该使用CharsetEncoder,简单+复制尽可能多的你可以将UTF-8字符切成两半。getBytes()
像这样:
public static int truncateUtf8(String input, byte[] output) {
ByteBuffer outBuf = ByteBuffer.wrap(output);
CharBuffer inBuf = CharBuffer.wrap(input.toCharArray());
CharsetEncoder utf8Enc = StandardCharsets.UTF_8.newEncoder();
utf8Enc.encode(inBuf, outBuf, true);
System.out.println("encoded " + inBuf.position() + " chars of " + input.length() + ", result: " + outBuf.position() + " bytes");
return outBuf.position();
}