perl html 去除,如何使用Perl正則表達式刪除未使用的嵌套HTML span標記?

I'm trying to remove unused spans (i.e. those with no attribute) from HTML files, having already cleaned up all the attributes I didn't want with other regular expressions.

我正在嘗試從HTML文件中刪除未使用的跨度(即沒有屬性的跨度),已經清除了我不想要的所有屬性與其他正則表達式。

I'm having a problem with my regex not picking the correct pair of start and end tags to remove.

我的正則表達式沒有選擇正確的開始和結束標記來刪除。

my $a = 'a b c de';

$a =~ s/(.*?)/$1/g;

print "$a\

returns

a b c de

but I want it to return

但是我希望它能夠回歸

a b c de

Help appreciated.

4 个解决方案

#1

11

#!/usr/bin/perl

use strict;

use warnings;

use HTML::Parser;

my @print_span;

my $p = HTML::Parser->new(

start_h => [ sub {

my ($text, $name, $attr) = @_;

if ( $name eq 'span' ) {

my $print_tag = %$attr;

push @print_span, $print_tag;

return if !$print_tag;

}

print $text;

}, 'text,tagname,attr'],

end_h => [ sub {

my ($text, $name) = @_;

if ( $name eq 'span' ) {

return if !pop @print_span;

}

print $text;

}, 'text,tagname'],

default_h => [ sub { print shift }, 'text'],

);

$p->parse_file(\*DATA) or die "Err: $!";

$p->eof;

__END__

This is a title

This is a header

a b c de

#2

9

Regex is insufficiently powerful to parse HTML (or XML). Any regex you can come up with will fail to match various formulations of even valid HTML (let alone real-world tag soup).

正則表達式不足以解析HTML(或XML)。你能想出的任何正則表達式都無法匹配甚至有效HTML的各種表述(更不用說現實世界的標簽湯)了。

This is a nesting problem. Regex can't normally handle nesting at all, but Perl has a non-standard extension to support regex recursion: (?n), where n is the group number to recurse into. So something like this would match both spans in your example:

這是一個嵌套問題。 Regex通常不能正常處理嵌套,但Perl有一個非標准擴展來支持正則表達式遞歸:(?n),其中n是要遞歸的組號。所以像這樣的東西會匹配你的例子中的兩個跨度:

(]*>.*+(?1)?.*+)

見perlfaq 6.11。

Unfortunately this still isn't enough, because it needs to be able to count both attributed and unattributed start-tags, allowing the end-tag to close either one. I can't think of a way this can be done without also matching the attributes span start-tags.

不幸的是,這還不夠,因為它需要能夠計算屬性和未分配的開始標記,允許

結束標記關閉任何一個。如果沒有匹配屬性span start-tags,我無法想到這樣做的方法。

You need an HTML parser for this, and you should be using one anyway because regex for HTML/XML is decidedly the Wrong Thing.

你需要一個HTML解析器,你應該使用一個,因為HTML / XML的正則表達式肯定是錯誤的。

#3

6

Don't use regexps for processing (HTML ==) XML. You never know what input you'll get. Consider this, valid HTML:

不要使用regexp進行處理(HTML ==)XML。你永遠不知道你會得到什么輸入。考慮一下這個有效的HTML:

a b c de

Would you have thought of that?

你能想到這個嗎?

Use an XML processor instead.

請改用XML處理器。

Also see the Related Questions (to the right) for your question.

另請參閱相關問題(右側)以了解您的問題。

#4

1

With all your help I've published a script that does everything I need.

在你的幫助下,我發布了一個腳本,可以完成我需要的一切。

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值