优化 Jaro-Winkler 算法

2022-09-04 03:47:10

我有这个Jaro-Winkler算法的代码取自这个网站。我需要跑150,000次才能获得差异之间的距离。这需要很长时间,因为我在Android移动设备上运行。

它能得到更多的优化吗?

public class Jaro {
    /**
     * gets the similarity of the two strings using Jaro distance.
     *
     * @param string1 the first input string
     * @param string2 the second input string
     * @return a value between 0-1 of the similarity
     */
    public float getSimilarity(final String string1, final String string2) {

        //get half the length of the string rounded up - (this is the distance used for acceptable transpositions)
        final int halflen = ((Math.min(string1.length(), string2.length())) / 2) + ((Math.min(string1.length(), string2.length())) % 2);

        //get common characters
        final StringBuffer common1 = getCommonCharacters(string1, string2, halflen);
        final StringBuffer common2 = getCommonCharacters(string2, string1, halflen);

        //check for zero in common
        if (common1.length() == 0 || common2.length() == 0) {
            return 0.0f;
        }

        //check for same length common strings returning 0.0f is not the same
        if (common1.length() != common2.length()) {
            return 0.0f;
        }

        //get the number of transpositions
        int transpositions = 0;
        int n=common1.length();
        for (int i = 0; i < n; i++) {
            if (common1.charAt(i) != common2.charAt(i))
                transpositions++;
        }
        transpositions /= 2.0f;

        //calculate jaro metric
        return (common1.length() / ((float) string1.length()) +
                common2.length() / ((float) string2.length()) +
                (common1.length() - transpositions) / ((float) common1.length())) / 3.0f;
    }

    /**
     * returns a string buffer of characters from string1 within string2 if they are of a given
     * distance seperation from the position in string1.
     *
     * @param string1
     * @param string2
     * @param distanceSep
     * @return a string buffer of characters from string1 within string2 if they are of a given
     *         distance seperation from the position in string1
     */
    private static StringBuffer getCommonCharacters(final String string1, final String string2, final int distanceSep) {
        //create a return buffer of characters
        final StringBuffer returnCommons = new StringBuffer();
        //create a copy of string2 for processing
        final StringBuffer copy = new StringBuffer(string2);
        //iterate over string1
        int n=string1.length();
        int m=string2.length();
        for (int i = 0; i < n; i++) {
            final char ch = string1.charAt(i);
            //set boolean for quick loop exit if found
            boolean foundIt = false;
            //compare char with range of characters to either side

            for (int j = Math.max(0, i - distanceSep); !foundIt && j < Math.min(i + distanceSep, m - 1); j++) {
                //check if found
                if (copy.charAt(j) == ch) {
                    foundIt = true;
                    //append character found
                    returnCommons.append(ch);
                    //alter copied string2 for processing
                    copy.setCharAt(j, (char)0);
                }
            }
        }
        return returnCommons;
    }
}

我提到,在整个过程中,我只制作脚本的实例,所以只有一次

jaro= new Jaro();

如果您要进行测试并且需要示例,因此不会破坏脚本,您将在此处找到它,在另一个用于python优化的线程中


答案 1

是的,但你不会喜欢它。将所有那些ed StringBuffers替换为在构造函数中分配的char数组,并且永远不会再次使用整数索引来跟踪其中的内容。new

这个待定的Commons-Lang补丁会给你一些味道。


答案 2

我知道这个问题可能已经解决了一段时间,但我想评论一下算法本身。当将字符串与自身进行比较时,答案是1/|字符串|关闭。当比较略有不同的值时,这些值也会变得更低。

解决此问题的方法是在 getCommonCharacters 方法的内部 for 语句中将 'm-1' 调整为 'm'。然后,代码就像一个超级按钮:)

有关些示例,另请参阅 http://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance。


推荐