I'm trying to remove unused spans (i.e. those with no attribute) from HTML files, having already cleaned up all the attributes I didn't want with other regular expressions.
我正在嘗試從HTML文件中刪除未使用的跨度(即沒有屬性的跨度),已經清除了我不想要的所有屬性與其他正則表達式。
I'm having a problem with my regex not picking the correct pair of start and end tags to remove.
我的正則表達式沒有選擇正確的開始和結束標記來刪除。
my $a = 'a b c de';
$a =~ s/(.*?)/$1/g;
print "$a\
returns
a b c de
but I want it to return
但是我希望它能夠回歸
a b c de
Help appreciated.
4 个解决方案
#1
11
#!/usr/bin/perl
use strict;
use warnings;
use HTML::Parser;
my @print_span;
my $p = HTML::Parser->new(
start_h => [ sub {
my ($text, $name, $attr) = @_;
if ( $name eq 'span' ) {
my $print_tag = %$attr;
push @print_span, $print_tag;
return if !$print_tag;
}
print $text;
}, 'text,tagname,attr'],
end_h => [ sub {
my ($text, $name) = @_;
if ( $name eq 'span' ) {
return if !pop @print_span;
}
print $text;
}, 'text,tagname'],
default_h => [ sub { print shift }, 'text'],
);
$p->parse_file(\*DATA) or die "Err: $!";
$p->eof;
__END__
This is a titleThis is a header
a b c de
#2
9
Regex is insufficiently powerful to parse HTML (or XML). Any regex you can come up with will fail to match various formulations of even valid HTML (let alone real-world tag soup).
正則表達式不足以解析HTML(或XML)。你能想出的任何正則表達式都無法匹配甚至有效HTML的各種表述(更不用說現實世界的標簽湯)了。
This is a nesting problem. Regex can't normally handle nesting at all, but Perl has a non-standard extension to support regex recursion: (?n), where n is the group number to recurse into. So something like this would match both spans in your example:
這是一個嵌套問題。 Regex通常不能正常處理嵌套,但Perl有一個非標准擴展來支持正則表達式遞歸:(?n),其中n是要遞歸的組號。所以像這樣的東西會匹配你的例子中的兩個跨度:
(]*>.*+(?1)?.*+)
見perlfaq 6.11。
Unfortunately this still isn't enough, because it needs to be able to count both attributed and unattributed start-tags, allowing the end-tag to close either one. I can't think of a way this can be done without also matching the attributes span start-tags.
不幸的是,這還不夠,因為它需要能夠計算屬性和未分配的開始標記,允許
結束標記關閉任何一個。如果沒有匹配屬性span start-tags,我無法想到這樣做的方法。
You need an HTML parser for this, and you should be using one anyway because regex for HTML/XML is decidedly the Wrong Thing.
你需要一個HTML解析器,你應該使用一個,因為HTML / XML的正則表達式肯定是錯誤的。
#3
6
Don't use regexps for processing (HTML ==) XML. You never know what input you'll get. Consider this, valid HTML:
不要使用regexp進行處理(HTML ==)XML。你永遠不知道你會得到什么輸入。考慮一下這個有效的HTML:
a b c de
Would you have thought of that?
你能想到這個嗎?
Use an XML processor instead.
請改用XML處理器。
Also see the Related Questions (to the right) for your question.
另請參閱相關問題(右側)以了解您的問題。
#4
1
With all your help I've published a script that does everything I need.
在你的幫助下,我發布了一個腳本,可以完成我需要的一切。