regex pattern in python for parsing html

本文介绍了一种使用Python正则表达式从网页中抓取标题的方法,并对比了不同正则表达式的匹配效果。同时推荐使用BeautifulSoup等HTML解析库来简化这一过程。

I am learning to use both the re module and the urllib module in python and attempting to write a simple web scraper. Here's the code I've written to scrape just the title of websites:

#!/usr/bin/python

import urllib
import re

urls=["http://google.com","https://facebook.com","http://reddit.com"]

i=0

these_regex="<title>(.+?)</title>"
pattern=re.compile(these_regex)

while(i<len(urls)):
        htmlfile=urllib.urlopen(urls[i])
        htmltext=htmlfile.read()
        titles=re.findall(pattern,htmltext)
        print titles
        i+=1

This gives the correct output for Google and Reddit but not for Facebook - like so:

['Google']
[]
['reddit: the front page of the internet']

This is because, I found that on Facebook's page the title tag is as follows: <title id="pageTitle">. To accomodate for the additional id=, I modified the these_regex variable as follows: these_regex="<title.+?>(.+?)</title>". But this gives the following output:

[]
['Welcome to Facebook \xe2\x80\x94 Log in, sign up or learn more']
[]

How would I combine both so that I can take into account any additional parameters passed within the title tag?

share improve this question
 
6 
You really want to use a proper HTML parser; I recommend you look at BeautifulSoup instead. –  Martijn Pieters  Nov 18 '13 at 10:53
2 
stackoverflow.com/a/1732454/72746 –  Axarydax  Nov 18 '13 at 10:54

3 Answers

up vote 9 down vote accepted

You are using a regular expression, and matching HTML with such expressions get too complicated, too fast.

Use a HTML parser instead, Python has several to choose from. I recommend you use BeautifulSoup, a popular 3rd party library.

BeautifulSoup example:

from bs4 import BeautifulSoup

response = urllib2.urlopen(url)
soup = BeautifulSoup(response.read(), from_encoding=response.info().getparam('charset'))
title = soup.find('title').text

Since a title tag itself doesn't contain other tags, you can get away with a regular expression here, but as soon as you try to parse nested tags, you will run into hugely complex issues.

Your specific problem can be solved by matching additional characters within the title tag, optionally:

r'<title[^>]*>([^<]+)</title>'

This matches 0 or more characters that are not the closing > bracket. The '0 or more' here lets you match both extra attributes and the plain <title> tag.

share improve this answer
 

It is recommended that you use beautiful soup or other parser to parse html but if you badly want regex the following piece of code would do the job

The regex code::

<title.*?>(.+?)</title>

How it works:

Regular expression visualization

Produces:

['Google']
['Welcome to Facebook - Log In, Sign Up or Learn More']
['reddit: the front page of the internet']
share improve this answer
【轴承故障诊断】基于融合鱼鹰和柯西变异的麻雀优化算法OCSSA-VMD-CNN-BILSTM轴承诊断研究【西储大学数据】(Matlab代码实现)内容概要:本文提出了一种基于融合鱼鹰和柯西变异的麻雀优化算法(OCSSA)优化变分模态分解(VMD)参数,并结合卷积神经网络(CNN)与双向长短期记忆网络(BiLSTM)的轴承故障诊断模型。该方法利用西储大学公开的轴承数据集进行验证,通过OCSSA算法优化VMD的分解层数K和惩罚因子α,有效提升信号分解精度,抑制模态混叠;随后利用CNN提取故障特征的空间信息,BiLSTM捕捉时间序列的动态特征,最终实现高精度的轴承故障分类。整个诊断流程充分结合了信号预处理、智能优化与深度学习的优势,显著提升了复杂工况下轴承故障诊断的准确性与鲁棒性。; 适合人群:具备一定信号处理、机器学习及MATLAB编程基础的研究生、科研人员及从事工业设备故障诊断的工程技术人员。; 使用场景及目标:①应用于旋转机械设备的智能运维与故障预警系统;②为轴承等关键部件的早期故障识别提供高精度诊断方案;③推动智能优化算法与深度学习在工业信号处理领域的融合研究。; 阅读建议:建议读者结合MATLAB代码实现,深入理解OCSSA优化机制、VMD参数选择策略以及CNN-BiLSTM网络结构的设计逻辑,通过复现实验掌握完整诊断流程,并可进一步尝试迁移至其他设备的故障诊断任务中进行验证与优化。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值