use Encode;
use strict;
my $str = "中国";
Encode::_utf8_on($str);
print length($str) . "\n";
Encode::_utf8_off($str);
print length($str) . "\n";
运行结果是:
程序代码:
2
6
这里我们使用Encode模块的_utf8_on函数和_utf8_off函数来开关字符串"中国"的utf8 flag. 可以看到, utf8 flag打开的时候, "中国"被当成utf8字符串处理, 所以其长度是2. utf8 flag关闭的时候, "中国"被当成octets(字节数组)处理, 出来的长度是6(我的编辑器用的是utf8编码, 如果你的编辑器用的是gb2312编码, 那么长度应该是4).
# gb2312 encoding chinese
use Encode;
my $a = "china----中国";
my $b = "china----中国";
my $stra = decode("gb2312",$a);
$stra =~ s/\W+//g;
print encode("gb2312",$stra),"\n";
输出:
china中国
encode函数顾名思义是用来编码Perl字符串的。它将Perl字符串中的字符用指定的编码格式编码,最终转化为字节流的形式,因此和Perl处理环境之外的事物打交道经常需要它。其格式很简单:
$octets = encode(ENCODING, $string [, CHECK])
decode函数则是用来解码字节流的。它按照你给出的编码格式解释给定的字节流,将其转化为使用utf8编码的Perl字符串,一般来说从终端或者文件取得的文本数据都应该用decode转换为Perl字符串的形式。
use Encode;
use Encode::CN; #可写可不写
$dat="测试文本";
$str=decode("gb2312",$dat);
@chars=split //,$str;
foreach $char (@chars) {
print encode("gb2312",$char),"/n";
}
1、查看可用编码
use Encode;
#Returns a list of canonical names of available encodings that have already been loaded
@list = Encode->encodings();
#get a list of all available encodings including those that have not yet been loaded
@all_encodings = Encode->encodings(":all");
#give the name of a specific module
@with_jp = Encode->encodings("Encode::JP");
@ebcdic = Encode->encodings("EBCDIC");
print "@list\n";
print "@all_encodings\n";
2、
Character :A character in the range 0 .. 2**32-1 (or more); what Perl's strings are made of.
byte :A character in the range 0..255; a special case of a Perl character.
octet :8 bits of data, with ordinal values 0..255; term for bytes passed to or from a non-Perl context, such as a disk file, standard I/O stream, database, command-line argument, environment variable, socket etc.
3、perl Encoding API:encode decode
#convert a string from Perl's internal format into ISO-8859-1
$octets = encode("iso-8859-1", $string);
#convert ISO-8859-1 data into a string in Perl's internal format
$string = decode("iso-8859-1", $octets);
4、
perl Encoding API:find_encoding
Returns the encoding object corresponding to ENCODING. Returns undef if no matching ENCODING is find. The returned object is what does the actual encoding or decoding.
my $enc = find_encoding("iso-8859-1");
while(<>) {
my $utf8 = $enc->decode($_);
... # now do something with $utf8;
}
find_encoding("latin1")->name; # iso-8859-1
5、perl Encoding API:from_to
[$length =] from_to($octets, FROM_ENC, TO_ENC [, CHECK])
from_to($octets, "iso-8859-1", "cp1250");
from_to($data, "iso-8859-1", "utf8"); #equal to
$data = encode("utf8", decode("iso-8859-1", $data));