Most efficient way to search for unknown patterns in a string?

substring algorithm java

2022-09-03 00:18:07

I am trying to find patterns that:

occur more than once
are more than 1 character long
are not substrings of any other known pattern

without knowing any of the patterns that might occur.

For example:

The string "the boy fell by the bell" would return .'ell', 'the b', 'y '
The string "the boy fell by the bell, the boy fell by the bell" would return .'the boy fell by the bell'

Using double for-loops, it can be brute forced very inefficiently:

ArrayList<String> patternsList = new ArrayList<>();
int length = string.length();
for (int i = 0; i < length; i++) {
    int limit = (length - i) / 2;
    for (int j = limit; j >= 1; j--) {
        int candidateEndIndex = i + j;
        String candidate = string.substring(i, candidateEndIndex);

        if(candidate.length() <= 1) {
            continue;
        }

        if (string.substring(candidateEndIndex).contains(candidate)) {
            boolean notASubpattern = true;
            for (String pattern : patternsList) {
                if (pattern.contains(candidate)) {
                    notASubpattern = false;
                    break;
                }
            }

            if (notASubpattern) {
                patternsList.add(candidate);
            }
        }
    }
}

However, this is incredibly slow when searching large strings with tons of patterns.

答案 1

You can build a suffix tree for your string in linear time: https://en.wikipedia.org/wiki/Suffix_tree

The patterns you are looking for are the strings corresponding to internal nodes that have only leaf children.

答案 2

You could use n-grams to find patterns in a string. It would take O(n) time to scan the string for n-grams. When you find a substring by using a n-gram, put it into a hash table with a count of how many times that substring was found in the string. When you're done searching for n-grams in the string, search the hash table for counts greater than 1 to find recurring patterns in the string.

For example, in the string "the boy fell by the bell, the boy fell by the bell" using a 6-gram will find the substring "the boy fell by the bell". A hash table entry with that substring will have a count of 2 because it occurred twice in the string. Varying the number of words in the n-gram will help you discover different patterns in the string.

Dictionary<string, int>dict = new Dictionary<string, int>();
int count = 0;
int ngramcount = 6;
string substring = "";

// Add entries to the hash table
while (count < str.length) {
    // copy the words into the substring
    int i = 0;
    substring = "";
    while (ngramcount > 0 && count < str.length) {
        substring[i] = str[count];
        if (str[i] == ' ')
            ngramcount--;
        i++;
        count++;
    }
    ngramcount = 6;
    substring.Trim();  // get rid of the last blank in the substring
    // Update the dictionary (hash table) with the substring
    if (dict.Contains(substring)) {  // substring is already in hash table so increment the count
        int hashCount = dict[substring];
        hashCount++;
        dict[substring] = hashCount;
    }
    else
        dict[substring] = 1;
}

// Find the most commonly occurrring pattern in the string
// by searching the hash table for the greatest count.
int maxCount = 0;
string mostCommonPattern = "";
foreach (KeyValuePair<string, int> pair in dict) {
    if (pair.Value > maxCount) {
        maxCount = pair.Value;
        mostCommonPattern = pair.Key;
    }
}