Perl Tip

最新推荐文章于 2025-09-12 15:58:40 发布

转载最新推荐文章于 2025-09-12 15:58:40 发布 · 110 阅读

0 ·

CC 4.0 BY-SA版权

原文链接：https://my.oschina.net/kuerant/blog/115177

文章标签：

#python #runtime

2019独角兽企业重金招聘Python工程师标准>>>

perl one line iconv

perl -mEncode -npe 'Encode::from_to($_, "utf-8", "gbk")'

perl -mEncode -npe '$_=Encode::encode("gbk", Encode::decode("utf-8", $_))'

------------------------------------------------------------------------------

use Encode;
$_="abc你好wert";
$a=decode('cp936',$_);
($x)=($a=~m/(\p{Han}+)/);
print encode('cp936',$x),"\n";

匹配所有非汉字：\P{Han}
匹配所有汉字： \p{Han}

The Perl FAQ entry How do I strip blank space from the beginning/end of a string? states that using

s/^\s+|\s+$//g;

is slower than doing it in two steps:

s/^\s+//;
s/\s+$//;

Why is this combined statement noticeably slower than the separate ones (for any input string)?

The Perl regex runtime runs much quicker when working with 'fixed' or 'anchored' substrings rather than 'floated' substrings. A substring is fixed when you can lock it to a certain place in the source string. Both '^' and '$' provide that anchoring. However, when you use alternation '|', the compiler doesn't recognize the choices as fixed, so it uses less optimized code to scan the whole string. And at the end of the process, looking for fixed strings twice is much, much faster than looking for a floating string once. On a related note, reading perl's regcomp.c will make you go blind.

Update: Here's some additional details. You can run perl with the '-Dr' flag if you've compiled it with debugging support and it'll dump out regex compilation data. Here's what you get:

~# debugperl -Dr -e 's/^\s+//g' Compiling REx `^\s+'
size 4 Got 36 bytes for offset annotations.
first at 2
synthetic stclass "ANYOF[\11\12\14\15 {unicode_all}]".
   1: BOL(2)
   2: PLUS(4)
   3:   SPACE(0)
   4: END(0)
stclass "ANYOF[\11\12\14\15 {unicode_all}]" anchored(BOL) minlen 1

# debugperl -Dr -e 's/^\s+|\s+$//g' Compiling REx `^\s+|\s+$'
size 9 Got 76 bytes for offset annotations.

   1: BRANCH(5)
   2:   BOL(3)
   3:   PLUS(9)
   4:     SPACE(0)
   5: BRANCH(9)
   6:   PLUS(8)
   7:     SPACE(0)
   8:   EOL(9)
   9: END(0)
minlen 1

Note the word 'anchored' in the first dump.

How do I strip blank space from the beginning/end of a string?

(contributed by brian d foy)

A substitution can do this for you. For a single line, you want to replace all the leading or trailing whitespace with nothing. You can do that with a pair of substitutions:

 s/^\s+//;
s/\s+$//;

You can also write that as a single substitution, although it turns out the combined statement is slower than the separate ones. That might not matter to you, though:

 s/^\s+|\s+$//g;

In this regular expression, the alternation matches either at the beginning or the end of the string since the anchors have a lower precedence than the alternation. With the /g flag, the substitution makes all possible matches, so it gets both. Remember, the trailing newline matches the \s+, and the $ anchor can match to the absolute end of the string, so the newline disappears too. Just add the newline to the output, which has the added benefit of preserving "blank" (consisting entirely of whitespace) lines which the ^\s+ would remove all by itself:

 while( <> ) {
    s/^\s+|\s+$//g;
    print "$_\n"; 
}

For a multi-line string, you can apply the regular expression to each logical line in the string by adding the /m flag (for "multi-line"). With the /m flag, the $ matches before an embedded newline, so it doesn't remove it. This pattern still removes the newline at the end of the string:

 $string =~ s/^\s+|\s+$//gm;

Remember that lines consisting entirely of whitespace will disappear, since the first part of the alternation can match the entire string and replace it with nothing. If you need to keep embedded blank lines, you have to do a little more work. Instead of matching any whitespace (since that includes a newline), just match the other whitespace:

 $string =~ s/^[\t\f ]+|[\t\f ]+$//mg;

转载于:https://my.oschina.net/kuerant/blog/115177