迭代字符串替换后可能产生的最短长度

performance optimization algorithm java

2022-09-04 04:58:04

如何通过对输入序列重复应用替换来合理有效地找到尽可能短的输出？我相信（如果我错了，请纠正我）在最坏的情况下这是指数时间，但由于下面的第二个约束，我不确定。幼稚的方法当然是。

我尝试对朴素方法进行编码（对于所有可能的替换，对于所有有效位置，在位置应用替换后，在输入的副本上递归。返回所有有效递归和输入中最短的，并在函数上有一个缓存来捕获等效的替换序列），但它（不可行的）很慢，我很确定这是一个算法问题，而不是实现。

有几件事可能会（也可能不）有所作为：

令牌是枚举类型。
映射中每个条目的输出长度严格小于条目的输入。
我不需要完成哪些替换以及在哪里进行替换，只需要结果序列即可。

因此，作为每个字符都是一个标记的示例（为了简单起见），如果我将替换映射设置为->，->和->，并且我应用了minimalString（'aaaaa'），我想得到'a'。aabaaaaaabababb

实际的方法签名类似于以下内容：

List<Token> getMinimalAfterReplacements(List<Token> inputList, Map<List<Token>, List<Token>> replacements) {
    ?
}

有没有比蛮力更好的方法？如果没有，例如，是否有可以利用的SAT图书馆或类似图书馆？是否可以对映射进行任何预处理，以便在使用不同的令牌列表使用相同的替换映射进行多次调用时使其更快？

答案 1

下面的代码是一个Python版本，用于查找尽可能短的减少。它是非递归的，但离朴素算法不太远。在每一步中，它都会尝试所有可能的单个约简，从而获得一组字符串，以便在下一步中约化。

当存在“吃符号”规则（如“aa”->“a”）时，一种有用的优化是检查下一组字符串是否有重复项。

另一种优化（未在下面的代码中实现）是将替换规则处理为有限自动机，该自动机通过输入字符串的单次传递来查找所有可能的单个约简的位置。但是，这无助于主树搜索算法的指数性质。

class Replacer:
  def __init__(self, replacements):
    self.replacements = [[tuple(key), tuple(value)] for key, value in replacements.items()]

  def get_possible_replacements(self, input):
    "Return all possible variations where a single replacement was done to the input"
    result = []
    for replace_what, replace_with in self.replacements:
      #print replace_what, replace_with
      for p in range(1 + len(input) - len(replace_what)):
        if input[p : p + len(replace_what)] == replace_what:
          input_copy = list(input[:])
          input_copy[p : p + len(replace_what)] = replace_with
          result.append(tuple(input_copy))
    return result

  def get_minimum_sequence_list(self, input):
    "Return the shortest irreducible sequence that can be obtained from the given input"
    irreducible = []
    to_reduce = [tuple(input)]
    to_reduce_new = []
    step = 1
    while to_reduce:
      print "Reduction step", step, ", number of candidates to reduce:", len(to_reduce)
      step += 1
      for current_input in to_reduce:
        reductions = self.get_possible_replacements(current_input)
        if not reductions:
          irreducible.append(current_input)
        else:
          to_reduce_new += reductions
      to_reduce = set(to_reduce_new[:]) # This dramatically reduces the tree width by removing duplicates
      to_reduce_new = []

    irreducible_sorted = sorted(set(irreducible), key = lambda x: len(x))
    #print "".join(input), "could be reduced to any of", ["".join(x) for x in irreducible_sorted]
    return irreducible_sorted[0]

  def get_minimum_sequence(self, input):
    return "".join(self.get_minimum_sequence_list(list(input)))

input = "aaaaa"

replacements = {
  "aaba" : "a",
  "aaa" : "ab",
  "aba" : "bb",
}

replacer = Replacer(replacements)
replaced = replacer.get_minimum_sequence(input)
print "The shortest string", input, "could be reduced to is", replaced

答案 2

只是一个简单的想法，可能会减少分支：使用这样的规则

ba -> c
ca -> b

和字符串，如

aaabaacaa
   ^  ^

你可以做两个替换，它们的顺序并不重要。这已经被记忆所覆盖，但是，生成无用的字符串仍然有相当大的开销。因此，我建议遵循以下规则：

在位置上的替换之后，仅考虑位置上的替换，使得pq

q + length(lhs_of_the_rule) > p

即，这样就不会从先前替换的左侧开始，或者它们重叠。

作为一个简单的低级优化，我建议用一个或（或封装的或什么）替换。较低的内存占用应该有助于缓存，您可以按一个（或两个）字符串元素索引数组，以便找出可能适用于它的规则。List<Token>Stringbyte[]short[]