Google Python Class 之——正则表达式提取html网页数据字段-优快云博客

本文链接：https://blog.youkuaiyun.com/catharryy/article/details/48245539

本文介绍了一种使用Python正则表达式从特定HTML文件中提取年份、姓名及排名的方法。通过对HTML内容进行解析，该方法能有效地获取指定格式的数据，并将其整理成易于处理的列表形式。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

需要提取的内容格式：

Here's what the html looks like in the baby.html files:
...
<h3 align="center">Popularity in 1990</h3>
....
<tr align="right"><td>1</td><td>Michael</td><td>Jessica</td>
<tr align="right"><td>2</td><td>Christopher</td><td>Ashley</td>
<tr align="right"><td>3</td><td>Matthew</td><td>Brittany</td>
...

输出要求：

"""
    Given a file name for baby.html, returns a list starting with the year string
    followed by the name-rank strings in alphabetical order.
    ['2006', 'Aaliyah 91', Aaron 57', 'Abagail 895', ' ...]
    """

解决思路：

提取main命令参数，按照文件名依次读取，按行匹配，姓名排序，dict存储，输出结果到文件

<span style="font-size:18px;">#!/usr/bin/python
# Copyright 2010 Google Inc.
# Licensed under the Apache License, Version 2.0
# http://www.apache.org/licenses/LICENSE-2.0

# Google's Python Class
# http://code.google.com/edu/languages/google-python-class/

import sys
import re


"""Baby Names exercise

Define the extract_names() function below and change main()
to call it.

For writing regex, it's nice to include a copy of the target
text for inspiration.

Here's what the html looks like in the baby.html files:
...
<h3 align="center">Popularity in 1990</h3>
....
<tr align="right"><td>1</td><td>Michael</td><td>Jessica</td>
<tr align="right"><td>2</td><td>Christopher</td><td>Ashley</td>
<tr align="right"><td>3</td><td>Matthew</td><td>Brittany</td>
...

Suggested milestones for incremental development:
 -Extract the year and print it
 -Extract the names and rank numbers and just print them
 -Get the names data into a dict and print it
 -Build the [year, 'name rank', ... ] list and print it
 -Fix main() to use the extract_names list
"""


def extract_names(filename):
    """
    Given a file name for baby.html, returns a list starting with the year string
    followed by the name-rank strings in alphabetical order.
    ['2006', 'Aaliyah 91', Aaron 57', 'Abagail 895', ' ...]
    """
    # +++your code here+++
    file_output = open('reTestFile/output.txt', 'a+')   # summary output file
    file_raw = open('reTestFile/'+filename, 'rU')       # input single file

    # extract year
    # <h3 align="center">Popularity in 1990</h3>
    dict_show = []
    for a_line in file_raw:
        match_year = re.search(r'>Popularity in\s(\w+)<', a_line)
        match_name_and_rank = re.search(r'<tr align="right"><td>(\w+)</td><td>(\w+)</td><td>(\w+)</td>', a_line)
        if match_year:
            year = match_year.group(1)                       # 1990
            # print >> file_output, year
        if match_name_and_rank:
            rank = match_name_and_rank.group(1)
            name = match_name_and_rank.group(2)
            # print >> file_output, name+rank
            dict_show.append(name+' '+rank)
    dict_show.sort()
    dict_show.insert(0, year)
    print >> file_output, dict_show
    file_output.write('\n')
    file_raw.close()
    file_output.close()
    return


def main():
    # This command-line parsing code is provided.
    # Make a list of command line arguments, omitting the [0] element
    # which is the script itself.
    args = sys.argv[1:]

    if not args:
        print 'usage: [--summaryfile] file [file ...]'
        sys.exit(1)

    # Notice the summary flag and remove it from args if it is present.
    summary = False
    if args[0] == '--summaryfile':
        summary = True
        del args[0]

    # +++your code here+++
    # For each filename, get the names, then either print the text output
    # or write it to a summary file
    if summary:
        for a_file in args:
            extract_names(a_file)


if __name__ == '__main__':
    main()
</span>