理解大小写不敏感比较器中的逻辑

string java jdk1.6

2022-09-01 12:22:52

任何人都可以解释以下代码，特别是为什么有三个语句（我已经标记了，和）？String.javaif//1//2//3

private static class CaseInsensitiveComparator
                     implements Comparator<String>, java.io.Serializable {
// use serialVersionUID from JDK 1.2.2 for interoperability
private static final long serialVersionUID = 8575799808933029326L;

    public int compare(String s1, String s2) {
        int n1=s1.length(), n2=s2.length();
        for (int i1=0, i2=0; i1<n1 && i2<n2; i1++, i2++) {
            char c1 = s1.charAt(i1);
            char c2 = s2.charAt(i2);
            if (c1 != c2) {/////////////////////////1
                c1 = Character.toUpperCase(c1);
                c2 = Character.toUpperCase(c2);
                if (c1 != c2) {/////////////////////////2
                    c1 = Character.toLowerCase(c1);
                    c2 = Character.toLowerCase(c2);
                    if (c1 != c2) {/////////////////////////3
                        return c1 - c2;
                    }
                }
            }
        }
        return n1 - n2;
    }
}

答案 1

来自 Unicode 技术标准：

此外，由于自然语言的变幻莫测，在某些情况下，两个不同的Unicode字符具有相同的大写或小写。

因此，仅比较两个字符的大写字母是不够的，因为它们可能具有不同的大写字母和相同的小写字母。

简单的暴力检查给出了一些结果。检查示例代码点 73 和 304：

char ch1 = (char) 73; //LATIN CAPITAL LETTER I
char ch2 = (char) 304; //LATIN CAPITAL LETTER I WITH DOT ABOVE
System.out.println(ch1==ch2);
System.out.println(Character.toUpperCase(ch1)==Character.toUpperCase(ch2));
System.out.println(Character.toLowerCase(ch1)==Character.toLowerCase(ch2));

输出：

false
false
true

所以“İ”和“I”并不相等。这两个字符都是大写的。但它们共享相同的小写字母：“i”，这给出了将它们视为相同值的理由，以防万一不敏感的比较。

答案 2

通常，我们希望转换一次案例，然后进行比较并完成。但是，代码将大小写转换两次，原因在对不同方法公共布尔区域Matches（布尔忽略大小写，int toffset，String other，int ooffset，int len）的注释中陈述：

不幸的是，格鲁吉亚语字母不能正常工作，格鲁吉亚语字母表对大小写转换有奇怪的规则。因此，我们需要在退出之前进行最后一次检查。

附录

的代码与中的代码有一些区别，但本质上是做同样的事情。为了进行交叉检查，下面引用了该方法的完整代码：regionMatchesCaseInsenstiveComparator

public boolean regionMatches(boolean ignoreCase, int toffset,
                       String other, int ooffset, int len) {
    char ta[] = value;
    int to = offset + toffset;
    char pa[] = other.value;
    int po = other.offset + ooffset;
    // Note: toffset, ooffset, or len might be near -1>>>1.
    if ((ooffset < 0) || (toffset < 0) || (toffset > (long)count - len) ||
            (ooffset > (long)other.count - len)) {
        return false;
    }
    while (len-- > 0) {
        char c1 = ta[to++];
        char c2 = pa[po++];
        if (c1 == c2) {
            continue;
        }
        if (ignoreCase) {
            // If characters don't match but case may be ignored,
            // try converting both characters to uppercase.
            // If the results match, then the comparison scan should
            // continue.
            char u1 = Character.toUpperCase(c1);
            char u2 = Character.toUpperCase(c2);
            if (u1 == u2) {
                continue;
            }
            // Unfortunately, conversion to uppercase does not work properly
            // for the Georgian alphabet, which has strange rules about case
            // conversion.  So we need to make one last check before
            // exiting.
            if (Character.toLowerCase(u1) == Character.toLowerCase(u2)) {
                continue;
            }
        }
        return false;
    }
    return true;
}