Submitted a question on StackOverflow just now.
I intended to extract content from a web page which contains many unicode characters represented in the form of "%xx". As I used Perl module LWP to get web page, naturally handled these unicode characters using Perl Regex as below.
my $html = "%20%26%40 ";
$html =~ s#%([0-9a-f]+)#\x{\1}#ig;
print "$html\n";
But above code dosen't work, it output nothing but "00". Get stuck now ... Any hint would be appreciated.
Some people replied very quickly. Below are their answers.
Perl has functions built in the URI::Escape module for this already. You don't need to mess with regular expressions
use URI::Escape;
my $encode = uri_unescape($string);
See this page for more
Funny and ugly code :
my $html = "%20%26%40 ";
$html =~ s#%([0-9a-f]{2})#"chr(0x$1)"#igee;
print "$html\n";
Edit : (I'm obliged to say) this code is maybe cute, but do not use this in production ! (there are many cases where it's not working)
You can observe all the discussion here http://stackoverflow.com/questions/12144401/how-can-convert-character-xx-in-html-using-perl.
I should say StackOverflow is indeed a great place for technical people:-)
解决Perl提取Unicode字符问题
本文探讨了如何使用Perl处理包含Unicode字符的网页内容,并解决了使用正则表达式时遇到的问题,介绍了内置模块URI::Escape的使用方法及有趣的代码解决方案。
576

被折叠的 条评论
为什么被折叠?



