Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
menu search
person
Welcome To Ask or Share your Answers For Others

Categories

Any simple unicode string like ??? or ???????? matches in c# regex using the following pattern but they don’t match in java.

Can anyone explain this? How do I correct it for it to work in Java?

 "\b[\w\p{M}\u200B\u200C\u00AC\u001F\u200D\u200E\u200F]+\b"

c# code :(it matches the strings)

   private static readonly Regex s_regexEngine;


    private static readonly string s_wordPattern = @"[wp{M}u200Bu200Cu00ACu001Fu200Du200Eu200F]+";

    static PersianWordTokenizer()
    {
        s_regexEngine = new Regex(s_wordPattern, RegexOptions.Multiline);
    }

    public static List<string> Tokenize(string text, bool removeSeparators, bool standardized)
    {
        List<string> tokens = new List<string>();

        int strIndex = 0;
        foreach (Match match in s_regexEngine.Matches(text))
        {
            //Enter in this block
        }

java code:(it dosnt matches string)

 private static final String s_wordPattern = "\b[\w\p{M}\u200B\u200C\u00AC\u001F\u200D\u200E\u200F]+\b";

static
{
    s_regexpattern = Pattern.compile(Pattern.quote(s_wordPattern));
}

public static java.util.ArrayList<String> Tokenize(String text, boolean removeSeparators, boolean standardized)
{
    java.util.ArrayList<String> tokens = new java.util.ArrayList<String>();

    int strIndex = 0;
    s_regexEngine=s_regexpattern.matcher(text);
    while(s_regexEngine.find())
    {
              // it dosnt enter in this block
            }
question from:https://stackoverflow.com/questions/65917997/how-could-i-migrate-this-regex

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
1.7k views
Welcome To Ask or Share your Answers For Others

1 Answer

Look at the "any letter" unicode character class, p{L}, or at the Pattern.UNICODE_CHARACTER_CLASS parameter to the java Pattern.compile method.

I guess the second one, as being Java only, won't interest you, but is worth mentioning.

import java.util.regex.Pattern;

/**
 * @author Luc
 */
public class Test {

  /**
   * @param args
   */
  public static void main(final String[] args) {

    test("Bonjour");

    test("????????");

    test("世界人权宣言 ");
  }

  private static void test(final String text) {

    showMatch(Pattern.compile("\b\p{L}+\b"), text);

    showMatch(Pattern.compile("\b\w+\b", Pattern.UNICODE_CHARACTER_CLASS), text);
  }

  private static void showMatch(final Pattern pattern, final String text) {

    System.out.println("With pattern "" + pattern + "": " + text + " " + pattern.matcher(text).find());
  }

}

Results :

With pattern "w+": Bonjour true
With pattern "p{L}+": Bonjour true
With pattern "w+": ???????? true
With pattern "p{L}+": ???????? true
With pattern "w+": 世界人权宣言  true
With pattern "p{L}+": 世界人权宣言  true

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
...