I have a text scattered with various strings, dates, tab characters and language codes. I want to extract the strings that follow a date+tab combination, and which are followed by a language code like '[en]', a tab character, and after which we don't have the string "BAD THINGS" (e.g. "2020-01-12STRING WE NEED[en]GOOD THINGS", as opposed to "2020-01-12STRING WE DON'T NEED[en]BAD THINGS").
Here is a short example text of what I'm working with:
2021-01-12This string is not needed [it]Bad thingsBad things 2021-01-12This string is also not needed [en]Bad thingsBad things 2021-01-11String 1 that is needed! [it]String 1 that is needed! is repeated hereNot interesting here 2021-01-11String 2 that is needed [fr]String 2 that is needed is repeated hereUnnecessary string 2021-01-11String 3 that is needed... [ru]String 3 that is needed... is repeated hereAnother part we're not interested in
I made this regex to capture all strings between a date and a language code:
(d{4}-d{2}-d{2}\t)(.*?)([w{2}]\t)
This works fine (see here). However, when I add a negative lookahead to exclude those followed by "Bad things", all my regex goes south:
(d{4}-d{2}-d{2}\t)(.*?)([w{2}]\t)(?!Bad things)
You can see the result here. I understand my lookahead somehow makes the regex greedy, but I have no idea how to avoid this, adding a ? after it doesn't work. Can you help me out here?