Java Matcher 中的零星堆栈溢出错误

2022-09-04 21:09:02

我有一些文件解析器代码,我偶尔会在m.matches()(其中m是匹配器)上获得堆栈溢出错误。

我再次运行我的应用程序,它解析相同的文件,没有堆栈溢出。

确实,我的模式有点复杂。它基本上是一堆可选的零长度正前视,其中包含命名组,以便我可以匹配一堆变量名称/值对,而不管它们的顺序如何。但是我预计,如果某个字符串会导致堆栈溢出错误,它总是会导致它...不仅仅是有时...任何想法?

我的模式的简化版本"prefix(?=\\s+user=(?<user>\\S+))?(?=\\s+repo=(?<repo>\\S+))?.*?"

完整的正则表达式是...

app=github(?=(?:[^"]|"[^"]*")*\s+user=(?<user>\S+))?(?=(?:[^"]|"[^"]*")*\s+repo=(?<repo>\S+))?(?=(?:[^"]|"[^"]*")*\s+remote_address=(?<ip>\S+))?(?=(?:[^"]|"[^"]*")*\s+now="(?<time>\S+)\+\d\d:\d\d")?(?=(?:[^"]|"[^"]*")*\s+url="(?<url>\S+)")?(?=(?:[^"]|"[^"]*")*\s+referer="(?<referer>\S+)")?(?=(?:[^"]|"[^"]*")*\s+status=(?<status>\S+))?(?=(?:[^"]|"[^"]*")*\s+elapsed=(?<elapsed>\S+))?(?=(?:[^"]|"[^"]*")*\s+request_method=(?<requestmethod>\S+))?(?=(?:[^"]|"[^"]*")*\s+created_at="(?<createdat>\S+)(?:-|\+)\d\d:\d\d")?(?=(?:[^"]|"[^"]*")*\s+pull_request_id=(?<pullrequestid>\d+))?(?=(?:[^"]|"[^"]*")*\s+at=(?<at>\S+))?(?=(?:[^"]|"[^"]*")*\s+fn=(?<fn>\S+))?(?=(?:[^"]|"[^"]*")*\s+method=(?<method>\S+))?(?=(?:[^"]|"[^"]*")*\s+current_user=(?<user2>\S+))?(?=(?:[^"]|"[^"]*")*\s+content_length=(?<contentlength>\S+))?(?=(?:[^"]|"[^"]*")*\s+request_category=(?<requestcategory>\S+))?(?=(?:[^"]|"[^"]*")*\s+controller=(?<controller>\S+))?(?=(?:[^"]|"[^"]*")*\s+action=(?<action>\S+))?.*?

堆栈顶部溢出错误堆栈...(大约9800行长)

Exception: java.lang.StackOverflowError
    at java.util.regex.Pattern$BranchConn.match(Pattern.java:4480)
    at java.util.regex.Pattern$CharProperty.match(Pattern.java:3706)
    at java.util.regex.Pattern$Branch.match(Pattern.java:4516)
    at java.util.regex.Pattern$GroupHead.match(Pattern.java:4570)
    at java.util.regex.Pattern$Loop.match(Pattern.java:4697)
    at java.util.regex.Pattern$GroupTail.match(Pattern.java:4629)
    at java.util.regex.Pattern$BranchConn.match(Pattern.java:4480)
    at java.util.regex.Pattern$CharProperty.match(Pattern.java:3706)
    at java.util.regex.Pattern$Branch.match(Pattern.java:4516)
    at java.util.regex.Pattern$GroupHead.match(Pattern.java:4570)
    at java.util.regex.Pattern$Loop.match(Pattern.java:4697)
    at java.util.regex.Pattern$GroupTail.match(Pattern.java:4629)
    at java.util.regex.Pattern$BranchConn.match(Pattern.java:4480)
    at java.util.regex.Pattern$CharProperty.match(Pattern.java:3706)
    at java.util.regex.Pattern$Branch.match(Pattern.java:4516)
    at java.util.regex.Pattern$GroupHead.match(Pattern.java:4570)
    at java.util.regex.Pattern$Loop.match(Pattern.java:4697)
    at java.util.regex.Pattern$GroupTail.match(Pattern.java:4629)
    at java.util.regex.Pattern$BranchConn.match(Pattern.java:4480)
    at java.util.regex.Pattern$CharProperty.match(Pattern.java:3706)
    at java.util.regex.Pattern$Branch.match(Pattern.java:4516)
    at java.util.regex.Pattern$GroupHead.match(Pattern.java:4570)
    at java.util.regex.Pattern$Loop.match(Pattern.java:4697)
    at java.util.regex.Pattern$GroupTail.match(Pattern.java:4629)
    at java.util.regex.Pattern$BranchConn.match(Pattern.java:4480)
    at java.util.regex.Pattern$CharProperty.match(Pattern.java:3706)
    at java.util.regex.Pattern$Branch.match(Pattern.java:4516)
    at java.util.regex.Pattern$GroupHead.match(Pattern.java:4570)
    at java.util.regex.Pattern$Loop.match(Pattern.java:4697)
    at java.util.regex.Pattern$GroupTail.match(Pattern.java:4629)
    at java.util.regex.Pattern$BranchConn.match(Pattern.java:4480)
    at java.util.regex.Pattern$CharProperty.match(Pattern.java:3706)
    at java.util.regex.Pattern$Branch.match(Pattern.java:4516)
    at java.util.regex.Pattern$GroupHead.match(Pattern.java:4570)
    at java.util.regex.Pattern$Loop.match(Pattern.java:4697)
    at java.util.regex.Pattern$GroupTail.match(Pattern.java:4629)
    at java.util.regex.Pattern$BranchConn.match(Pattern.java:4480)
    at java.util.regex.Pattern$CharProperty.match(Pattern.java:3706)
    at java.util.regex.Pattern$Branch.match(Pattern.java:4516)
    at java.util.regex.Pattern$GroupHead.match(Pattern.java:4570)
    at java.util.regex.Pattern$Loop.match(Pattern.java:4697)
    at java.util.regex.Pattern$GroupTail.match(Pattern.java:4629)

我遇到错误的行的示例。(虽然我已经运行了10次,但没有得到任何错误)

app=github env=production enterprise=true auth_fingerprint=\"token:6b29527b:9.99.999.99\" controller=\"Api::GitCommits\" path_info=\"/api/v3/repos/XYZ-ABCDE/abcdefg-abc/git/commits/77ae1376f969059f5f1e23cc5669bff8cca50563.diff\" query_string=nil version=v3 auth=oauth current_user=abcdefghijk oauth_access_id=24 oauth_application_id=0 oauth_scopes=\"gist,notifications,repo,user\" route=\"/repositories/:repository_id/git/commits/:id\" org=XYZ-ABCDE oauth_party=personal repo=XYZ-ABCDE/abcdefg-abc repo_visibility=private now=\"2015-09-24T13:44:52+00:00\" request_id=675fa67e-c1de-4bfa-a965-127b928d427a server_id=c31404fc-b7d0-41a1-8017-fc1a6dce8111 remote_address=9.99.999.99 request_method=get content_length=92 content_type=\"application/json; charset=utf-8\" user_agent=nil accept=application/json language=nil referer=nil x_requested_with=nil status=404 elapsed=0.041 url=\"https://git.abc.abcd.abc.com/api/v3/repos/XYZ-ABCDE/abcdefg-abc/git/commits/77ae1376f969059f5f1e23cc5669bff8cca50563.diff\" worker_request_count=77192 request_category=apiapp=github env=production enterprise=true auth_fingerprint=\"token:6b29527b:9.99.999.99\" controller=\"Api::GitCommits\" path_info=\"/api/v3/repos/XYZ-ABCDE/abcdefg-abc/git/commits/9bee255c7b13c589f4e9f1cb2d4ebb5b8519ba9c.diff\" query_string=nil version=v3 auth=oauth current_user=abcdefghijk oauth_access_id=24 oauth_application_id=0 oauth_scopes=\"gist,notifications,repo,user\" route=\"/repositories/:repository_id/git/commits/:id\" org=XYZ-ABCDE oauth_party=personal repo=XYZ-ABCDE/abcdefg-abc repo_visibility=private now=\"2015-09-24T13:44:52+00:00\" request_id=89fcb32e-9ab5-47f7-9464-e5f5cff175e8 server_id=1b74880a-5124-4483-adce-111b60dac111 remote_address=9.99.999.99 request_method=get content_length=92 content_type=\"application/json; charset=utf-8\" user_agent=nil accept=application/json language=nil referer=nil x_requested_with=nil status=404 elapsed=0.024 url=\"https://git.abc.abcd.abc.com/api/v3/repos/XYZ-ABCDE/abcdefg-abc/git/commits/9bee255c7b13c589f4e9f1cb2d4ebb5b8519ba9c.diff\" worker_request_count=76263 request_category=api

有趣。。。这行似乎是一个错误...日志似乎将换行符放在错误的位置,导致两个日志条目位于一行上,后跟一个空行。正是这一长串导致了错误...反正一次...现在它运行良好,没有堆栈溢出


答案 1

有两种方法可以解决您的问题:

  • 正确解析输入字符串并从 中获取键值。Map

    我强烈建议使用此方法,因为代码将更加干净,并且我们不再需要监视输入大小的限制。

  • 修改现有的正则表达式,以大大降低导致的实现缺陷的影响。StackOverflowError

分析输入字符串

您可以使用以下正则表达式分析输入字符串:

\G\s*+(\w++)=([^\s"]++|"[^"]*+")(?:\s++|$)
  • 所有的量词都是所有格的(而不是 ,而不是 ),因为我写的模式不需要回溯。*+*+++

  • 您可以在中间找到用于匹配键值对的基本正则表达式。(\w++)=([^\s"]++|"[^"]*+")

  • \G是为了确保比赛从最后一场比赛结束的地方开始。它用于防止发动机在不匹配时“颠簸”。

  • \s*+并用于消耗多余的空间。我指定而不是防止被识别为有效输入。(?:\s++|$)(?:\s++|$)\s*+key="value"key=value

完整的示例代码可以在下面找到:

private static final Pattern KEY_VALUE = Pattern.compile("\\G\\s*+(\\w++)=([^\\s\"]++|\"[^\"]*+\")(?:\\s++|$)");

public static Map<String, String> parseKeyValue(String kvString) {
    Matcher matcher = KEY_VALUE.matcher(kvString);

    Map<String, String> output = new HashMap<String, String>();
    int lastIndex = -1;

    while (matcher.find()) {
        output.put(matcher.group(1), matcher.group(2));
        lastIndex = matcher.end();
    }

    // Make sure that we match everything from the input string
    if (lastIndex != kvString.length()) {
        return null;
    }

    return output;
}

您可能希望取消引用这些值,具体取决于您的要求。

您还可以重写函数以传递要提取的密钥,并在while循环中将它们选中,以避免存储您不关心的密钥。List

修改正则表达式

问题是由于外部重复是用递归实现的,导致输入字符串足够长。(?:[^"]|"[^"]*")*StackOverflowError

具体而言,在每次重复中,它匹配带引号的标记或单个非引号字符。因此,堆栈随非引号字符的数量线性增长并爆炸。

可以将 的所有实例替换为 。堆栈现在将随着引用令牌的数量线性增长,因此StackOverflowError不会发生,除非您在输入字符串中有数千个引用的令牌。(?:[^"]|"[^"]*")*[^"]*(?:"[^"]*"[^"]*)*

Pattern KEY_CAPTURE = Pattern.compile("app=github(?=[^\"]*(?:\"[^\"]*\"[^\"]*)*\\s+user=(?<user>\\S+))?(?=[^\"]*(?:\"[^\"]*\"[^\"]*)*\\s+repo=(?<repo>\\S+))?(?=[^\"]*(?:\"[^\"]*\"[^\"]*)*\\s+remote_address=(?<ip>\\S+))?(?=[^\"]*(?:\"[^\"]*\"[^\"]*)*\\s+now=\"(?<time>\\S+)\\+\\d\\d:\\d\\d\")?(?=[^\"]*(?:\"[^\"]*\"[^\"]*)*\\s+url=\"(?<url>\\S+)\")?(?=[^\"]*(?:\"[^\"]*\"[^\"]*)*\\s+referer=\"(?<referer>\\S+)\")?(?=[^\"]*(?:\"[^\"]*\"[^\"]*)*\\s+status=(?<status>\\S+))?(?=[^\"]*(?:\"[^\"]*\"[^\"]*)*\\s+elapsed=(?<elapsed>\\S+))?(?=[^\"]*(?:\"[^\"]*\"[^\"]*)*\\s+request_method=(?<requestmethod>\\S+))?(?=[^\"]*(?:\"[^\"]*\"[^\"]*)*\\s+created_at=\"(?<createdat>\\S+)(?:-|\\+)\\d\\d:\\d\\d\")?(?=[^\"]*(?:\"[^\"]*\"[^\"]*)*\\s+pull_request_id=(?<pullrequestid>\\d+))?(?=[^\"]*(?:\"[^\"]*\"[^\"]*)*\\s+at=(?<at>\\S+))?(?=[^\"]*(?:\"[^\"]*\"[^\"]*)*\\s+fn=(?<fn>\\S+))?(?=[^\"]*(?:\"[^\"]*\"[^\"]*)*\\s+method=(?<method>\\S+))?(?=[^\"]*(?:\"[^\"]*\"[^\"]*)*\\s+current_user=(?<user2>\\S+))?(?=[^\"]*(?:\"[^\"]*\"[^\"]*)*\\s+content_length=(?<contentlength>\\S+))?(?=[^\"]*(?:\"[^\"]*\"[^\"]*)*\\s+request_category=(?<requestcategory>\\S+))?(?=[^\"]*(?:\"[^\"]*\"[^\"]*)*\\s+controller=(?<controller>\\S+))?(?=[^\"]*(?:\"[^\"]*\"[^\"]*)*\\s+action=(?<action>\\S+))?");

它遵循正则表达式→等效扩展。哪个用作A或B取决于它们的重复次数 - 无论重复次数越多,应该是A,另一个应该是B。(A|B)*A*(BA*)*

深入探讨实施

StackOverflowErrorin 是一个已知问题,当您的模式包含非确定性的1 捕获/非捕获组(即您案例中的子模式)的重复时,可能会发生这种情况。Pattern(?:[^"]|"[^"]*")*

1 这是 Pattern 源代码中使用的术语,它可能旨在作为模式长度固定的指标。但是,无论实际模式如何,该实现都将交替|视为非确定性的。

将非确定性捕获/非捕获组的贪婪或懒惰重复编译为 / 类,这些类通过递归实现重复。因此,这种模式非常容易触发,特别是当组包含一个分支时,一次只有一个字符匹配。LoopLazyLoopStackOverflowError

另一方面,确定性2重复,所有格重复和独立组(又名原子组或非回溯组)的重复被编译成/类,在大多数情况下,它用循环处理重复,因此不会有。(?>...)CurlyGroupCurlyStackOverflowError

2 重复模式是一个字符类,或一个固定长度的捕获/非捕获组,没有任何交替

您可以在下面看到原始正则表达式的片段是如何编译的。记下有问题的部分,该部分以 开头,并将其与堆栈跟踪进行比较。Loop

app=github(?=(?:[^"]|"[^"]*")*\s+user=(?<user>\S+))?(?=(?:[^"]|"[^"]*")*\s+repo=(?<repo>\S+))?
BnM. Boyer-Moore (BMP only version) (length=10)
  app=github
Ques. Greedy optional quantifier
  Pos. Positive look-ahead
    GroupHead. local=0
    Prolog. Loop wrapper
    Loop [1889ca51]. Greedy quantifier {0,2147483647}
      GroupHead. local=1
      Branch. Alternation (in printed order):
        CharProperty.complement. S̄:
          BitClass. Match any of these 1 character(s):
            "
        ---
        Single. Match code point: U+0022 QUOTATION MARK
        Curly. Greedy quantifier {0,2147483647}
          CharProperty.complement. S̄:
            BitClass. Match any of these 1 character(s):
              "
          Node. Accept match
        Single. Match code point: U+0022 QUOTATION MARK
        ---
      BranchConn [7e41986c]. Connect branches to sequel.
      GroupTail [47e1b36]. local=1, group=0. --[next]--> Loop [1889ca51]
    Curly. Greedy quantifier {1,2147483647}
      Ctype. POSIX (US-ASCII): SPACE
      Node. Accept match
    Slice. Match the following sequence (BMP only version) (length=5)
      user=
    GroupHead. local=3
    Curly. Greedy quantifier {1,2147483647}
      CharProperty.complement. S̄:
        Ctype. POSIX (US-ASCII): SPACE
      Node. Accept match
    GroupTail [732c7887]. local=3, group=2. --[next]--> GroupTail [6c9d2223]
    GroupTail [6c9d2223]. local=0, group=0. --[next]--> Node [4ea5d7f2]
    Node. Accept match
  Node. Accept match
Ques. Greedy optional quantifier
  Pos. Positive look-ahead
    GroupHead. local=4
    Prolog. Loop wrapper
    Loop [402c5f8a]. Greedy quantifier {0,2147483647}
      GroupHead. local=5
      Branch. Alternation (in printed order):
        CharProperty.complement. S̄:
          BitClass. Match any of these 1 character(s):
            "
        ---
        Single. Match code point: U+0022 QUOTATION MARK
        Curly. Greedy quantifier {0,2147483647}
          CharProperty.complement. S̄:
            BitClass. Match any of these 1 character(s):
              "
          Node. Accept match
        Single. Match code point: U+0022 QUOTATION MARK
        ---
      BranchConn [21347df0]. Connect branches to sequel.
      GroupTail [7d382897]. local=5, group=0. --[next]--> Loop [402c5f8a]
    Curly. Greedy quantifier {1,2147483647}
      Ctype. POSIX (US-ASCII): SPACE
      Node. Accept match
    Slice. Match the following sequence (BMP only version) (length=5)
      repo=
    GroupHead. local=7
    Curly. Greedy quantifier {1,2147483647}
      CharProperty.complement. S̄:
        Ctype. POSIX (US-ASCII): SPACE
      Node. Accept match
    GroupTail [71f111ba]. local=7, group=4. --[next]--> GroupTail [9c304c7]
    GroupTail [9c304c7]. local=4, group=0. --[next]--> Node [4ea5d7f2]
    Node. Accept match
  Node. Accept match
LastNode.
Node. Accept match

答案 2

最终答案:

将此功能与其他功能移动到交替组中
(?:[^"]|"[^"]*")*

示例:https://ideone.com/YuVcMg

它不能被打破!


附注 - 我注意到你说你删除了一个换行符,最终
一条记录的末尾没有分隔符,
就像这样request_category=apiapp=github

这没关系,但是当它们击中它时,这些正则表达式会主要被它吹走
\S+

因此,最好替换为 ,
这在下面的正则表达式中没有完成。下面是添加的那个:\S+(?:(?!app=github)\S)+

"(?s)app=github(?>\\s+user=(?<user>(?:(?!app=github)\\S)+)|\\s+repo=(?<repo>(?:(?!app=github)\\S)+)|\\s+remote_address=(?<ip>(?:(?!app=github)\\S)+)|\\s+now=\\\\?\"(?<time>(?:(?!app=github)\\S)+)\\+\\d\\d:\\d\\d\\\\?\"|\\s+url=\\\\?\"(?<url>(?:(?!app=github)\\S)+)\\\\?\"|\\s+referer=\\\\?\"(?<referer>(?:(?!app=github)\\S)+)\\\\?\"|\\s+status=(?<status>(?:(?!app=github)\\S)+)|\\s+elapsed=(?<elapsed>(?:(?!app=github)\\S)+)|\\s+request_method=(?<requestmethod>(?:(?!app=github)\\S)+)|\\s+created_at=\\\\?\"(?<createdat>(?:(?!app=github)\\S)+)[-+]\\d\\d:\\d\\d\\\\?\"|\\s+pull_request_id=(?<pullrequestid>\\d+)|\\s+at=(?<at>(?:(?!app=github)\\S)+)|\\s+fn=(?<fn>(?:(?!app=github)\\S)+)|\\s+method=(?<method>(?:(?!app=github)\\S)+)|\\s+current_user=(?<user2>(?:(?!app=github)\\S)+)|\\s+content_length=(?<contentlength>(?:(?!app=github)\\S)+)|\\s+request_category=(?<requestcategory>(?:(?!app=github)\\S)+)|\\s+controller=(?<controller>(?:(?!app=github)\\S)+)|\\s+action=(?<action>(?:(?!app=github)\\S)+)|\"[^\"]*\"|(?!app=github).)+"

以及使用它的该示例的链接:https://ideone.com/hdwufO


正则表达式

生:

(?s)app=github(?>\s+user=(?<user>\S+)|\s+repo=(?<repo>\S+)|\s+remote_address=(?<ip>\S+)|\s+now=\\?"(?<time>\S+)\+\d\d:\d\d\\?"|\s+url=\\?"(?<url>\S+)\\?"|\s+referer=\\?"(?<referer>\S+)\\?"|\s+status=(?<status>\S+)|\s+elapsed=(?<elapsed>\S+)|\s+request_method=(?<requestmethod>\S+)|\s+created_at=\\?"(?<createdat>\S+)[-+]\d\d:\d\d\\?"|\s+pull_request_id=(?<pullrequestid>\d+)|\s+at=(?<at>\S+)|\s+fn=(?<fn>\S+)|\s+method=(?<method>\S+)|\s+current_user=(?<user2>\S+)|\s+content_length=(?<contentlength>\S+)|\s+request_category=(?<requestcategory>\S+)|\s+controller=(?<controller>\S+)|\s+action=(?<action>\S+)|"[^"]*"|(?!app=github).)+

弦:

"(?s)app=github(?>\\s+user=(?<user>\\S+)|\\s+repo=(?<repo>\\S+)|\\s+remote_address=(?<ip>\\S+)|\\s+now=\\\\?\"(?<time>\\S+)\\+\\d\\d:\\d\\d\\\\?\"|\\s+url=\\\\?\"(?<url>\\S+)\\\\?\"|\\s+referer=\\\\?\"(?<referer>\\S+)\\\\?\"|\\s+status=(?<status>\\S+)|\\s+elapsed=(?<elapsed>\\S+)|\\s+request_method=(?<requestmethod>\\S+)|\\s+created_at=\\\\?\"(?<createdat>\\S+)[-+]\\d\\d:\\d\\d\\\\?\"|\\s+pull_request_id=(?<pullrequestid>\\d+)|\\s+at=(?<at>\\S+)|\\s+fn=(?<fn>\\S+)|\\s+method=(?<method>\\S+)|\\s+current_user=(?<user2>\\S+)|\\s+content_length=(?<contentlength>\\S+)|\\s+request_category=(?<requestcategory>\\S+)|\\s+controller=(?<controller>\\S+)|\\s+action=(?<action>\\S+)|\"[^\"]*\"|(?!app=github).)+"

格式 化:

 (?s)
 app = github
 (?>
      \s+ 
      user =
      (?<user> \S+ )                # (1)
   |  
      \s+  repo =
      (?<repo> \S+ )                # (2)
   |  
      \s+ remote_address =
      (?<ip> \S+ )                  # (3)
   |  
      \s+ now= \\? "
      (?<time> \S+ )                # (4)
      \+ \d\d : \d\d \\? "
   |  
      \s+ url = \\? "
      (?<url> \S+ )                 # (5)
      \\? "
   |  
      \s+ referer = \\? "
      (?<referer> \S+ )             # (6)
      \\? "
   |  
      \s+ status =
      (?<status> \S+ )              # (7)
   |  
      \s+ elapsed =
      (?<elapsed> \S+ )             # (8)
   |  
      \s+ request_method =
      (?<requestmethod> \S+ )       # (9)
   |  
      \s+ created_at = \\? "
      (?<createdat> \S+ )           # (10)
      [-+] 
      \d\d : \d\d \\? "
   |  
      \s+ pull_request_id =
      (?<pullrequestid> \d+ )       # (11)
   |  
      \s+ at=
      (?<at> \S+ )                  # (12)
   |  
      \s+ fn=
      (?<fn> \S+ )                  # (13)
   |  
      \s+ method =
      (?<method> \S+ )              # (14)
   |  
      \s+ current_user =
      (?<user2> \S+ )               # (15)
   |  
      \s+ content_length =
      (?<contentlength> \S+ )       # (16)
   |  
      \s+ request_categor y=
      (?<requestcategory> \S+ )     # (17)
   |  
      \s+ controller =
      (?<controller> \S+ )          # (18)
   |  
      \s+ action =
      (?<action> \S+ )              # (19)
   |  
      " [^"]* "                     # None of the above, give quotes a chance
   |  
      (?! app = github )            # Failsafe, consume a character, advance by 1
      . 
 )+