I am learning to use both the re module and the urllib module in python and attempting to write a simple web scraper. Here's the code I've written to scrape just the title of websites:
#!/usr/bin/python
import urllib
import re
urls=["http://google.com","https://facebook.com","http://reddit.com"]
i=0
these_regex="<title>(.+?)</title>"
pattern=re.compile(these_regex)
while(i<len(urls)):
htmlfile=urllib.urlopen(urls[i])
htmltext=htmlfile.read()
titles=re.findall(pattern,htmltext)
print titles
i+=1
This gives the correct output for Google and Reddit but not for Facebook - like so:
['Google']
[]
['reddit: the front page of the internet']
This is because, I found that on Facebook's page the title tag is as follows: <title id="pageTitle">. To accomodate for the additional id=, I modified the these_regex variable as follows: these_regex="<title.+?>(.+?)</title>". But this gives the following output:
[]
['Welcome to Facebook \xe2\x80\x94 Log in, sign up or learn more']
[]
How would I combine both so that I can take into account any additional parameters passed within the title tag?

本文介绍了一种使用Python正则表达式从网页中抓取标题的方法,并对比了不同正则表达式的匹配效果。同时推荐使用BeautifulSoup等HTML解析库来简化这一过程。


5万+






