Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
menu search
person
Welcome To Ask or Share your Answers For Others

Categories

For example, match "Nation" in ""???ér???????????????" without extra modules. Is it possible in new Perl versions (5.14, 5.15 etc)?

I found an answer! Thanks to tchrist

Rigth solution with UCA match (thnx to https://stackoverflow.com/users/471272/tchrist).

# found start/end offsets for matched utf-substring (without intersections)
use 5.014;
use strict; 
use warnings;
use utf8;
use Unicode::Collate;
binmode STDOUT, ':encoding(UTF-8)';
my $str  = "???ér???????????????" x 2;
my $look = "Nation";
my $Collator = Unicode::Collate->new(
    normalization => undef, level => 1
   );

my @match = $Collator->match($str, $look);
if (@match) {
    my $found = $match[0];
    my $f_len  = length($found);
    say "match result: $found (length is $f_len)"; 
    my $offset = 0;
    while ((my $start = index($str, $found, $offset)) != -1) {                                                  
        my $end   = $start + $f_len;
        say sprintf("found at: %s,%s", $start, $end);
        $offset = $end + 1;
    }
}

Wrong (but working) solution from http://www.perlmonks.org/?node_id=485681

Magic piece of code is:

    $str = Unicode::Normalize::NFD($str); $str =~ s/pM//g;

code example:

    use 5.014;
    use utf8;
    use Unicode::Normalize;

    binmode STDOUT, ':encoding(UTF-8)';
    my $str  = "???ér???????????????";
    my $look = "Nation";
    say "before: $str
";
    $str = NFD($str);
    # M is short alias for p{Mark} (http://perldoc.perl.org/perluniprops.html)
    $str =~ s/pM//og; # remove "marks"
    say "after: $str";?
    say "is_match: ", $str =~ /$look/i || 0;
See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
715 views
Welcome To Ask or Share your Answers For Others

1 Answer

Right solution with UCA (thnx to tchrist):

# found start/end offsets for matched s
use 5.014;
use utf8;
use Unicode::Collate;
binmode STDOUT, ':encoding(UTF-8)';
my $str  = "???ér???????????????" x 2;
my $look = "Nation";
my $Collator = Unicode::Collate->new(
    normalization => undef, level => 1
   );

my @match = $Collator->match($str, $look);
say "match ok!" if @match;

P.S. "Code that assumes you can remove diacritics to get at base ASCII letters is evil, still, broken, brain-damaged, wrong, and justification for capital punishment." ? tchrist Why does modern Perl avoid UTF-8 by default?


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share

548k questions

547k answers

4 comments

86.3k users

...