#返回字符串长度
use String::Multibyte;
$gbk_str="上大";
$gbk= String::Multibyte->new('GBK');
$gbk_len = $gbk->length($gbk_str);
Constructor
-
new(CHARSET)">
-
CHARSETis the charset name; exactly speaking, the file name of the definition file (without the suffix .pm). It returns the instance to tell methods in which charset the specified strings should be handled.CHARSETmay be a hashref; this is how to define a charset without .pm file.# see perlfaq6 :-)
my $martian = String::Multibyte->new({
charset => "martian",
regexp => '[A-Z][A-Z]|[^A-Z]',
});If true value is specified as
VERBOSE, the called method (exceptingislegal) will check its arguments and carps if any of them is not legally encoded.Otherwise such a check won't be carried out (saves a bit of time, but unsafe, though you can use the
islegalmethod if necessary).
$mbcs = String::Multibyte->new(CHARSET)
new(CHARSET,_VERBOSE)">
$mbcs = String::Multibyte->new(CHARSET, VERBOSE)
Check Whether the String is Legal
检测字符串是否是合法的GBK字符
-
islegal(LIST)">
-
Returns a boolean indicating whether all the strings in arguments are legally encoded in the concerned charset. Returns false even if one element is illegal in
LIST.
$mbcs->islegal(LIST)
Length
-
length(STRING)">
-
Returns the length in characters of the specified string.
$mbcs->length(STRING)
Reverse
字符串倒置
-
strrev(STRING)">
-
Returns a reversed string in characters.
$mbcs->strrev(STRING)
Search
搜索
-
index(STRING,_SUBSTR)">
-
Returns the position of the first occurrence of
SUBSTRinSTRINGat or afterPOSITION. IfPOSITIONis omitted, starts searching from the beginning of the string.If the substring is not found, returns
-1.
$mbcs->index(STRING, SUBSTR)
index(STRING,_SUBSTR,_POSITION)">
$mbcs->index(STRING, SUBSTR, POSITION)
反向搜索
-
rindex(STRING,_SUBSTR)">
-
Returns the position of the last occurrence of
SUBSTRinSTRINGat or afterPOSITION. IfPOSITIONis specified, returns the last occurrence at or before that position.If the substring is not found, returns
-1.
strspn(STRING,_SEARCHLIST)">
-
Returns returns the position of the first occurrence of any character not contained in the search list.
$mbcs->strspn("+0.12345*12", "+-.0123456789");
# returns 8.If the specified string does not contain any character in the search list, returns
0.The string consists of characters in the search list, the returned value equals the length of the string.
SEARCHLISTcan be anARRAYREF. e.g. if a charset treatsCRLFas a single character,"/r/n"is a one-element list of only"/r/n". A two-element list of"/r"and"/n"can be given as["/r", "/n"](of course"/n/r"is also ok since the character order ofSEARCHLISTdoesn't matter instrspn).
strcspn(STRING,_SEARCHLIST)">
-
Returns returns the position of the first occurrence of any character contained in the search list.
If the specified string does not contain any character in the search list, the returned value equals the length of the string.
SEARCHLISTcan be anARRAYREF. e.g. if a charset treatsCRLFas a single character,"/r/n"is a one-element list of only"/r/n". A two-element list of"/r"and"/n"can be given as["/r", "/n"](of course"/n/r"is also ok since the character order ofSEARCHLISTdoesn't matter instrcspn).
$mbcs->rindex(STRING, SUBSTR)
rindex(STRING,_SUBSTR,_POSITION)">
$mbcs->rindex(STRING, SUBSTR, POSITION)
$mbcs->strspn(STRING, SEARCHLIST)
搜索第一个串中不包含在第二个串的字符集合中的字符的位置
$mbcs->strcspn(STRING, SEARCHLIST)
Substring
子串
-
substr(STRING_or_SCALAR_REF,_OFFSET)">
-
It works like
CORE::substr, but using character semantics of multibyte charset encoding.If the
REPLACEMENTas the fourth argument is specified, replaces parts of theSCALARand returns what was there before.You can utilize the lvalue reference, returned if a reference of scalar variable is used as the first argument.
${ $mbcs->substr(/$str,$off,$len) } = $replace;
works like
CORE::substr($str,$off,$len) = $replace;The returned lvalue is not multibyte character-oriented but byte-oriented, then successive assignment may lead to odd results.
$mbcs->substr(STRING or SCALAR REF, OFFSET)
substr(STRING_or_SCALAR_REF,_OFFSET,_LENGTH)">
$mbcs->substr(STRING or SCALAR REF, OFFSET, LENGTH)
substr(SCALAR,_OFFSET,_LENGTH,_REPLACEMENT)">
$mbcs->substr(SCALAR, OFFSET, LENGTH, REPLACEMENT)
Split
分割
-
strsplit(SEPARATOR,_STRING)">
-
This function emulates
CORE::split, but splits on theSEPARATORstring, not by a pattern.If not in list context, only return the number of fields found, but does not split into the
@_array.If empty string is specified as
SEPARATOR, splits the specified string into characters.$bytes->strsplit('', 'This is perl.', 7);
# ('T', 'h', 'i', 's', ' ', 'i', 's perl.')
$mbcs->strsplit(SEPARATOR, STRING)
strsplit(SEPARATOR,_STRING,_LIMIT)">
$mbcs->strsplit(SEPARATOR, STRING, LIMIT)
Character Range
返回一定内码值区域内的所有字符的列表
-
mkrange(CHARLIST,_ALLOW_REVERSE)">
-
Returns the character list (not in list context, as a concatenated string) gained by parsing the specified character range.
The result depends on the the character order for the concerned charset. About the character order for each charset, see its definition file.
If the character order is undefined in the definition file, returns an identical string with the specified string.
A character range is specified with a hyphen (
'-', but exactly speaking,$obj->{hyphen}).The backslashed combinations
'/-'and'//'(exactly speaking,"$obj->{escape}$obj->{hyphen}"and"$obj->{escape}$obj->{escape}") are used instead of the characters'-'and'/', respectively. The hyphen at the beginning or the end of the range is also evaluated as the hyphen itself.For example,
$mbcs->mkrange('+/-0-9A-F')returns('+', '-', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', 'A', 'B', 'C', 'D', 'E', 'F')andscalar $mbcs->mkrange('A-P')returns'ABCDEFGHIJKLMNOP'.If true value is specified as the second argument, reverse character ranges such as
'9-0','Z-A'are allowed.$bytes = String::Multibyte->new('Bytes');
$bytes->mkrange('p-e-r-l', 1); # ponmlkjihgfefghijklmnopqrqponml
$mbcs->mkrange(CHARLIST, ALLOW_REVERSE)
Transliteration
搜索并且替换
-
strtr(STRING_or_SCALAR_REF,_SEARCHLIST,_REPLACEMENTLIST)">
-
Transliterates all occurrences of the characters found in the search list with the corresponding character in the replacement list.
If a reference of scalar variable is specified as the first argument, returns the number of characters replaced or deleted; otherwise, returns the transliterated string and the specified string is unaffected.
If
'h'modifier is specified, returns a hash of histogram in list context; a reference to hash of histogram in scalar context;SEARCHLIST and REPLACEMENTLIST
Character ranges (internally utilizing
mkrange()) are supported.If the
REPLACEMENTLISTis empty (specified as'', notundef, because the use of uninitialized value causes warning under -w option), theSEARCHLISTis replicated.If the replacement list is shorter than the search list, the final character in the replacement list is replicated till it is long enough (but differently works when the 'd' modifier is used).
SEARCHLISTandREPLACEMENTLISTcan be anARRAYREF. e.g. if a charset treats"/r/n"(CRLF) as a single character,"/r/n"is a one-element list of only"/r/n". A two-element list of"/r"and"/n"should be given as["/r", "/n"]. Of course"/n/r"is also ok but the character order is different; cf.strtr($str, ["/r", "/n"], ["/n", "/r"])that swaps"/n"and"/r".Each elements of
ARRAYREFcan include character ranges (the modifiersRandraffect their evaluation as usual).["A-C", "h-z"]is evaluated like"A-Ch-z"ifcharsetdoes not include grapheme"Ch". The former prevents"C"and"h"from evaluation as"Ch"even if thecharsetincluded grapheme"Ch".MODIFIER
c Complement the SEARCHLIST.
d Delete found but unreplaced characters.
s Squash duplicate replaced characters.
h Return a hash (or a hashref) of histogram.
R No use of character ranges.
r Allows to use reverse character ranges.
o Caches the conversion table internally.If
'R'modifier is specified,'-'is not evaluated as a meta character but hyphen itself like intr'''. Compare:$mbcs->strtr("90 - 32 = 58", "0-9", "A-J");
# output: "JA - DC = FI"
$mbcs->strtr("90 - 32 = 58", "0-9", "A-J", "R");
# output: "JA - 32 = 58"
# cf. ($str = "90 - 32 = 58") =~ tr'0-9'A-J';
# '0' to 'A', '-' to '-', and '9' to 'J'.If
'r'modifier is specified, reverse character ranges are allowed. e.g.$mbcs->strtr($str, "0-9", "9-0", "r")
is equivalent to
$mbcs->strtr($str, "0123456789", "9876543210")Caching the conversion table
If
'o'modifier is specified, the conversion table is cached internally. e.g.foreach (@source_strings) {
print $mbcs->strtr($_, $from_list, $to_list, 'o');
}will be almost as efficient as this:
$trans = $mbcs->trclosure($from_list, $to_list);
foreach (@source_strings) {
print &$trans($_);
}You can use whichever you like.
Without
'o',foreach (@source_strings) {
print $mbcs->strtr($_, $from_list, $to_list);
}will be very slow since the conversion table is made whenever the function is called.
$mbcs->strtr(STRING or SCALAR REF, SEARCHLIST, REPLACEMENTLIST)
strtr(STRING_or_SCALAR_REF,_SEARCHLIST,_REPLACEMENTLIST,_MODIFIER)">
$mbcs->strtr(STRING or SCALAR REF, SEARCHLIST, REPLACEMENTLIST, MODIFIER)
Generation of the Closure to Transliterate
返回一个指向一个搜索规则的函数的引用
-
trclosure(SEARCHLIST,_REPLACEMENTLIST)">
-
Returns a closure to transliterate the specified string. The return value is an only code reference, not blessed object. By use of this code ref, you can save yourself time as you need not specify arguments every time.
my $trans = $mbcs->trclosure($from_list, $to_list);
print &$trans ($string); # ok to perl 5.003
print $trans->($string); # perl 5.004 or betterThe functionality of the closure made by
trclosure()is equivalent to that ofstrtr(). Frankly speaking, thestrtr()callstrclosure()internally and uses the returned closure.SEARCHLISTandREPLACEMENTLISTcan be anARRAYREFsame asstrtr().
$mbcs->trclosure(SEARCHLIST, REPLACEMENTLIST)
trclosure(SEARCHLIST,_REPLACEMENTLIST,_MODIFIER)">
$mbcs->trclosure(SEARCHLIST, REPLACEMENTLIST, MODIFIER)
CAVEAT 
-
This modules supposes
$[is always equal to0, never1.
Grapheme manipulation
-
Since v. 1.01, manipulation of sequence of graphemes is to be supported.
In a grapheme-oriented manipulation, notice that the beginning and the end of a string are always on a grapheme boundary.
E.g. imagine a grapheme set where a grapheme comprises either a leading latin capital letter followed by one or more latin small letters, or a single byte. Such a set can be define as below.
$gra = String::Multibyte->new({
regexp => '[A-Z][a-z]*|[/x00-/xFF]',
});Think about
$gra->index("Perl", "Pe"). As both"Perl"and"Pe"are a single grapheme, they are not equal to each other. So the result of this must be-1(meaning no match).
$[
博客围绕Perl语言的字符串处理展开,介绍了字符串长度计算、合法性检测、倒置、搜索、子串提取、分割等功能,还提及返回特定内码值区域字符列表、搜索替换及生成搜索规则函数引用等内容,并给出了部分代码示例。
907

被折叠的 条评论
为什么被折叠?



