Screen scraping 3

最新推荐文章于 2025-09-12 15:58:40 发布

转载最新推荐文章于 2025-09-12 15:58:40 发布 · 110 阅读

0 ·

CC 4.0 BY-SA版权

原文链接：http://www.cnblogs.com/bluescorpio/archive/2012/05/22/2513951.html

文章标签：

#python

本文介绍了一个使用Python中的BeautifulSoup库从指定网站抓取招聘信息的例子。通过解析HTML文档，提取了所有职位标题及其链接，并按字母顺序输出了去除重复后的职位名称。

Use BeautifulSoup

from urllib import urlopen
from bs4 import BeautifulSoup as BS

text = urlopen("http://www.python.org/community/jobs/").read()
soup = BS(text.decode('gbk', 'ignore'))

jobs = set()
for header in soup('h2'):
    links = header('a', 'reference')
    if not links:
        continue
    link = links[0]
    jobs.add('%s (%s)' % (link.string, link['href']))
        
print '\n'.join(sorted(jobs, key = lambda s: s.lower()))
eliminate duplicates and print the names in sorted order

soup('h2'): to get a list of all h2 elements
header('a', 'reference') to get a list of child elements of the reference class