regex pattern in python for parsing html

最新推荐文章于 2024-05-04 18:48:33 发布

转载最新推荐文章于 2024-05-04 18:48:33 发布 · 523 阅读

Python 同时被 2 个专栏收录

164 篇文章

订阅专栏

Regular Expression

3 篇文章

订阅专栏

本文介绍了一种使用Python正则表达式从网页中抓取标题的方法，并对比了不同正则表达式的匹配效果。同时推荐使用BeautifulSoup等HTML解析库来简化这一过程。

regex pattern in python for parsing HTML title tags

up vote 6 down vote favorite

I am learning to use both the re module and the urllib module in python and attempting to write a simple web scraper. Here's the code I've written to scrape just the title of websites:

#!/usr/bin/python

import urllib
import re

urls=["http://google.com","https://facebook.com","http://reddit.com"]

i=0

these_regex="<title>(.+?)</title>"
pattern=re.compile(these_regex)

while(i<len(urls)):
        htmlfile=urllib.urlopen(urls[i])
        htmltext=htmlfile.read()
        titles=re.findall(pattern,htmltext)
        print titles
        i+=1

This gives the correct output for Google and Reddit but not for Facebook - like so:

['Google']
[]
['reddit: the front page of the internet']

This is because, I found that on Facebook's page the title tag is as follows: <title id="pageTitle">. To accomodate for the additional id=, I modified the these_regex variable as follows: these_regex="<title.+?>(.+?)</title>". But this gives the following output:

[]
['Welcome to Facebook \xe2\x80\x94 Log in, sign up or learn more']
[]

How would I combine both so that I can take into account any additional parameters passed within the title tag?

edited Nov 18 '13 at 11:06

Martijn Pieters ♦

465k 62 1125 1295

asked Nov 18 '13 at 10:52

i.h4d35

1,041 6 21 50

You really want to use a proper HTML parser; I recommend you look at BeautifulSoup instead. – Martijn Pieters♦ Nov 18 '13 at 10:53

stackoverflow.com/a/1732454/72746 – Axarydax Nov 18 '13 at 10:54

add a comment

3 Answers

active oldest votes

up vote 9 down vote accepted

You are using a regular expression, and matching HTML with such expressions get too complicated, too fast.

Use a HTML parser instead, Python has several to choose from. I recommend you use BeautifulSoup, a popular 3rd party library.

BeautifulSoup example:

from bs4 import BeautifulSoup

response = urllib2.urlopen(url)
soup = BeautifulSoup(response.read(), from_encoding=response.info().getparam('charset'))
title = soup.find('title').text

Since a title tag itself doesn't contain other tags, you can get away with a regular expression here, but as soon as you try to parse nested tags, you will run into hugely complex issues.

Your specific problem can be solved by matching additional characters within the title tag, optionally:

r'<title[^>]*>([^<]+)</title>'

This matches 0 or more characters that are not the closing > bracket. The '0 or more' here lets you match both extra attributes and the plain <title> tag.

edited Nov 18 '13 at 11:23

answered Nov 18 '13 at 10:56

Martijn Pieters ♦

465k 62 1125 1295

add a comment

up vote 8 down vote

It is recommended that you use beautiful soup or other parser to parse html but if you badly want regex the following piece of code would do the job

The regex code::

<title.*?>(.+?)</title>

How it works:

Regular expression visualization

Produces:

['Google']
['Welcome to Facebook - Log In, Sign Up or Learn More']
['reddit: the front page of the internet']

edited Nov 18 '13 at 12:43

answered Nov 18 '13 at 11:06

K DawG

4,749 3 16 46