There are many questions and answers here on StackOverflow that assume a "letter" can be matched in a regexp by [a-zA-Z]
. However with Unicode there are many more characters that most people would regard as a letter (all the Greek letters, Cyrllic .. and many more. Unicode defines many blocks each of which may have "letters".
The Java definition defines Posix classes for things like alpha characters, but that is specified to only work with US-ASCII. The predefined character classes define words to consist of [a-zA-Z_0-9]
, which also excludes many letters.
So how do you properly match against Unicode strings? Is there some other library that gets this right?
See Question&Answers more detail:os