Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
menu search
person
Welcome To Ask or Share your Answers For Others

Categories

One particular quirk of the (otherwise quite powerful) re module in Python is that re.split() will never split a string on a zero-length match, for example if I want to split a string along word boundaries:

>>> re.split(r"s+|", "Split along words, preserve punctuation!")
['Split', 'along', 'words,', 'preserve', 'punctuation!']

instead of

['', 'Split', 'along', 'words', ',', 'preserve', 'punctuation', '!']

Why does it have this limitation? Is it by design? Do other regex flavors behave like this?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
715 views
Welcome To Ask or Share your Answers For Others

1 Answer

It's a design decision that was made, and could have gone either way. Tim Peters made this post to explain:

For example, if you split "abc" by the pattern x*, what do you expect? The pattern matches (with length 0) at 4 places, but I bet most people would be surprised to get

['', 'a', 'b', 'c', '']

back instead of (as they do get)

['abc']

Some others disagree with him though. Guido van Rossum doesn't want it changed due to backwards compatibility issues. He did say:

I'm okay with adding a flag to enable this behavior though.

Edit:

There is a workaround posted by Jan Burgy:

>>> s = "Split along words, preserve punctuation!"
>>> re.sub(r"s+|", 'f', s).split('f')
['', 'Split', 'along', 'words', ',', 'preserve', 'punctuation', '!']

Where 'f' can be replaced by any unused character.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
...