Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
menu search
person
Welcome To Ask or Share your Answers For Others

Categories

My regex is taking increasingly long to match (about 30 seconds the 5th time) but needs to be applied for around 500 rounds of matches. I suspect catastrophic backtracking. Please help! How can I optimize this regex:

String regex = "<tr bgcolor="ffffff">\s*?<td width="20%"><b>((?:.|\s)+?): *?</b></td>\s*?<td width="80%">((?:.|\s)*?)(?=(?:</td>\s*?</tr>\s*?<tr bgcolor="ffffff">)|(?:</td>\s*?</tr>\s*?</table>\s*?<b>Tags</b>))";

EDIT: since it was not clear(my bad): i am trying to take a html formatted document and reformat by extracting the two search groups and adding formating afterwards.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
236 views
Welcome To Ask or Share your Answers For Others

1 Answer

The alternation (?:.|\s)+? is very inefficient, as it involves too much backtracking.

Basically, all variations of this pattern are extremely inefficient: (?:.|s)*?, (?:.| )*?, (?:.| )*? and there greedy counterparts, too ((?:.|s)*, (?:.| )*, (?:.| )*). (.|s)*? is probably the worst of them all.

Why?

The two alternatives, . and s may match the same text at the same location, the both match regular spaces at least. See this demo taking 3555 steps to complete and .*? demo (with s modifier) taking 1335 steps to complete.

Patterns like (?:.| )*? / (?:.| )* in Java often cause a Stack Overflow issue, and the main problem here is related to the use of alternation (that already alone causes backtracking) that matches char by char, and then the group is modified with a quantifier of unknown length. Although some regex engines can cope with this and do not throw errors, this type of pattern still causes slowdowns and is not recommended to use (only in ElasticSearch Lucene regex engine the (.| ) is the only way to match any char).

Solution

If you want to match any characters including whitespace with regex, do it with

[\s\S]*?

Or enable singleline mode with (?s) (or Pattern.DOTALL Matcher option) and just use . (e.g. (?s)start(.*?)end).

NOTE: To manipulate HTML, use a dedicated parser, like jsoup. Here is an SO post discussing Java HTML parsers.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
...