perl html 去除,如何使用Perl正則表達式刪除未使用的嵌套HTML span標記？-优快云博客

I'm trying to remove unused spans (i.e. those with no attribute) from HTML files, having already cleaned up all the attributes I didn't want with other regular expressions.

我正在嘗試從HTML文件中刪除未使用的跨度(即沒有屬性的跨度),已經清除了我不想要的所有屬性與其他正則表達式。

I'm having a problem with my regex not picking the correct pair of start and end tags to remove.

我的正則表達式沒有選擇正確的開始和結束標記來刪除。

my $a = 'a b c de';

$a =~ s/(.*?)/$1/g;

print "$a\

returns

a b c de

but I want it to return

但是我希望它能夠回歸

a b c de

Help appreciated.

4 个解决方案

#!/usr/bin/perl

use strict;

use warnings;

use HTML::Parser;

my @print_span;

my $p = HTML::Parser->new(

start_h => [ sub {

my ($text, $name, $attr) = @_;

if ( $name eq 'span' ) {

my $print_tag = %$attr;

push @print_span, $print_tag;

return if !$print_tag;

}

print $text;

}, 'text,tagname,attr'],

end_h => [ sub {

my ($text, $name) = @_;

if ( $name eq 'span' ) {

return if !pop @print_span;

}

print $text;

}, 'text,tagname'],

default_h => [ sub { print shift }, 'text'],

);

$p->parse_file(\*DATA) or die "Err: $!";

$p->eof;

__END__

This is a title

This is a header

a b c de

Regex is insufficiently powerful to parse HTML (or XML). Any regex you can come up with will fail to match various formulations of even valid HTML (let alone real-world tag soup).

正則表達式不足以解析HTML(或XML)。你能想出的任何正則表達式都無法匹配甚至有效HTML的各種表述(更不用說現實世界的標簽湯)了。

This is a nesting problem. Regex can't normally handle nesting at all, but Perl has a non-standard extension to support regex recursion: (?n), where n is the group number to recurse into. So something like this would match both spans in your example:

這是一個嵌套問題。 Regex通常不能正常處理嵌套,但Perl有一個非標准擴展來支持正則表達式遞歸:(?n),其中n是要遞歸的組號。所以像這樣的東西會匹配你的例子中的兩個跨度:

(]*>.*+(?1)?.*+)

見perlfaq 6.11。

Unfortunately this still isn't enough, because it needs to be able to count both attributed and unattributed start-tags, allowing the end-tag to close either one. I can't think of a way this can be done without also matching the attributes span start-tags.

不幸的是,這還不夠,因為它需要能夠計算屬性和未分配的開始標記,允許

結束標記關閉任何一個。如果沒有匹配屬性span start-tags,我無法想到這樣做的方法。

You need an HTML parser for this, and you should be using one anyway because regex for HTML/XML is decidedly the Wrong Thing.

你需要一個HTML解析器,你應該使用一個,因為HTML / XML的正則表達式肯定是錯誤的。

Don't use regexps for processing (HTML ==) XML. You never know what input you'll get. Consider this, valid HTML:

不要使用regexp進行處理(HTML ==)XML。你永遠不知道你會得到什么輸入。考慮一下這個有效的HTML:

a b c de

Would you have thought of that?

你能想到這個嗎?

Use an XML processor instead.

請改用XML處理器。

Also see the Related Questions (to the right) for your question.

另請參閱相關問題(右側)以了解您的問題。

With all your help I've published a script that does everything I need.

在你的幫助下,我發布了一個腳本,可以完成我需要的一切。