检查字符串是否是 Java 中 ISO 语言的 ISO 国家/地区的更简洁方法

2022-09-02 12:44:28

假设有一个两个字符,它应该代表ISO 639国家或语言名称。String

您知道,Locale 类有两个函数 getISOLanguagesgetISOCountries,它们分别返回包含所有 ISO 语言和 ISO 国家/地区的数组。String

要检查特定对象是否为有效的ISO语言或ISO国家/地区,我应该在数组中查找匹配的。好的,我可以通过使用二进制搜索来做到这一点(例如Arrays.binarySearch或ApacheCommons ArrayUtils.contains)。StringString

问题是:是否存在任何提供更清晰方法的实用程序(例如来自GuavaApache Commons库),例如返回布尔值以验证字符串为有效的ISO 639语言或ISO 639国家/地区的函数

例如:

public static boolean isValidISOLanguage(String s)
public static boolean isValidISOCountry(String s)

答案 1

我不会打扰使用二进制搜索或任何第三方库 - 这很好:HashSet

public final class IsoUtil {
    private static final Set<String> ISO_LANGUAGES = Set.of(Locale.getISOLanguages());
    private static final Set<String> ISO_COUNTRIES = Set.of(Locale.getISOCountries());

    private IsoUtil() {}

    public static boolean isValidISOLanguage(String s) {
        return ISO_LANGUAGES.contains(s);
    }

    public static boolean isValidISOCountry(String s) {
        return ISO_COUNTRIES.contains(s);
    }
}

您可以先检查字符串长度,但我不确定我会打扰 - 至少不会打扰,除非你想保护自己免受性能攻击,其中您被赋予了巨大的字符串,这需要很长时间才能散列。

编辑:如果您确实想使用第三方库,ICU4J是最有可能的竞争者 - 但这很可能比 支持的列表更新,因此您可能希望在任何地方使用ICU4J。Locale


答案 2

据我所知,在任何库中都没有这样的方法,但至少你可以自己声明它,就像这样:

import static java.util.Arrays.binarySearch;
import java.util.Locale;

/**
 * Validator of country code.
 * Uses binary search over array of sorted country codes.
 * Country code has two ASCII letters so we need at least two bytes to represent the code.
 * Two bytes are represented in Java by short type. This is useful for us because we can use Arrays.binarySearch(short[] a, short needle)
 * Each country code is converted to short via countryCodeNeedle() function.
 *
 * Average speed of the method is 246.058 ops/ms which is twice slower than lookup over HashSet (523.678 ops/ms).
 * Complexity is O(log(N)) instead of O(1) for HashSet.
 * But it consumes only 520 bytes of RAM to keep the list of country codes instead of 22064 (> 21 Kb) to hold HashSet of country codes.
 */
public class CountryValidator {
  /** Sorted array of country codes converted to short */
  private static final short[] COUNTRIES_SHORT = initShortArray(Locale.getISOCountries());

  public static boolean isValidCountryCode(String countryCode) {
    if (countryCode == null || countryCode.length() != 2 || countryCodeIsNotAlphaUppercase(countryCode)) {
      return false;
    }
    short needle = countryCodeNeedle(countryCode);
    return binarySearch(COUNTRIES_SHORT, needle) >= 0;
  }

  private static boolean countryCodeIsNotAlphaUppercase(String countryCode) {
    char c1 = countryCode.charAt(0);
    if (c1 < 'A' || c1 > 'Z') {
      return true;
    }
    char c2 = countryCode.charAt(1);
    return c2 < 'A' || c2 > 'Z';
  }

  /**
   * Country code has two ASCII letters so we need at least two bytes to represent the code.
   * Two bytes are represented in Java by short type. So we should convert two bytes of country code to short.
   * We can use something like:
   * short val = (short)((hi << 8) | lo);
   * But in fact very similar logic is done inside of String.hashCode() function.
   * And what is even more important is that each string object already has cached hash code.
   * So for us the conversion of two letter country code to short can be immediately.
   * We can relay on String's hash code because it's specified in JLS
   **/
  private static short countryCodeNeedle(String countryCode) {
    return (short) countryCode.hashCode();
  }

  private static short[] initShortArray(String[] isoCountries) {
    short[] countriesShortArray = new short[isoCountries.length];
    for (int i = 0; i < isoCountries.length; i++) {
      String isoCountry = isoCountries[i];
      countriesShortArray[i] = countryCodeNeedle(isoCountry);
    }
    return countriesShortArray;
  }
}

将始终创建一个新数组,因此我们应该将其存储到静态字段中以避免不必要的分配。同时或消耗大量内存,因此此验证程序将在数组上使用二进制搜索。这是速度和内存之间的权衡。Locale.getISOCountries()HashSetTreeSet


推荐