在使用 Python Beautiful Soup 解析 HTML 时,我遇到了一个问题。我想获取一个具有特定 ID 的表格中的数据,但我总是得到一个 None 结果。以下是问题的 HTML 代码示例:
<table cellspacing="0" cellpadding="3" border="0" id="ctl00_SPWebPartManager1_g_c001c0d9_0cb8_4b0f_b75a_7cc3b6f7d790_ctl00_HistoryData1_gridHistoryData_DataGrid1" style="width:100%;border-collapse:collapse;">
<tr class="gridHeader" valign="top">
<td class="titleGridRegNoB" align="center" valign="top"><span dir=RTL>שווי שוק (אלפי ש"ח)</span></td>
<td class="titleGridReg" align="center" valign="top">הון רשום למסחר</td>
<td class="titleGridReg" align="center" valign="top">שער נמוך</td><td class="titleGridReg" align="center" valign="top">שער גבוה</td>
<td class="titleGridReg" align="center" valign="top">שער בסיס</td>
<td class="titleGridReg" align="center" valign="top">שער פתיחה</td><td class="titleGridReg" align="center" valign="top"><span dir="rtl">שער נעילה (באגורות)</span></td>
<td class="titleGridReg" align="center" valign="top">שער נעילה מתואם</td><td class="titleGridReg" align="center" valign="top">תאריך</td>
</tr>
<tr onmouseover="this.style.backgroundColor='#FDF1D7'" onmouseout="this.style.backgroundColor='#ffffff'">
... And so on
以下是我的代码示例:
html = br.response().read()
soup = BeautifulSoup(html)
table = soup.find(lambda tag: tag.name=='table' and tag.has_key('id') and tag['id']=="ctl00_SPWebPartManager1_g_c001c0d9_0cb8_4b0f_b75a_7cc3b6f7d790_ctl00_HistoryData1_gridHistoryData_DataGrid1")
rows = table.findAll(lambda tag: tag.name=='tr')
In [100]: print table
None
- 解决方案
根据问题的分析,有两种不同的解决方案:
解决方案 1:
仔细检查代码后,发现问题出在 find() 方法的语法上。find() 方法的第一个参数应该是一个字符串,指定要查找的标签名称。在你的代码中,你使用了 lambda 表达式来指定标签名称,这可能导致问题。因此,你可以将 lambda 表达式替换为一个字符串,如下所示:
table = soup.find('table', id="ctl00_SPWebPartManager1_g_c001c0d9_0cb8_4b0f_b75a_7cc3b6f7d790_ctl00_HistoryData1_gridHistoryData_DataGrid1")
解决方案 2:
另一个问题可能是 HTML 代码的编码方式。如果 HTML 代码不是 UTF-8 编码的,你可能会遇到解析问题。因此,你可以尝试将 HTML 代码解码为 UTF-8,然后再解析,如下所示:
html = br.response().read().decode('utf-8')
soup = BeautifulSoup(html)
table = soup.find('table', id="ctl00_SPWebPartManager1_g_c001c0d9_0cb8_4b0f_b75a_7cc3b6f7d790_ctl00_HistoryData1_gridHistoryData_DataGrid1")
代码示例:
# coding=utf-8
from bs4 import BeautifulSoup
html = u"""
<table cellspacing="0" cellpadding="3" border="0" id="ctl00_SPWebPartManager1_g_c001c0d9_0cb8_4b0f_b75a_7cc3b6f7d790_ctl00_HistoryData1_gridHistoryData_DataGrid1" style="width:100%;border-collapse:collapse;">
<tr class="gridHeader" valign="top">
<td class="titleGridRegNoB" align="center" valign="top"><span dir=RTL>שווי שוק (אלפי ש"ח)</span></td><td class="titleGridReg" align="center" valign="top">הון רשום למסחר</td><td class="titleGridReg" align="center" valign="top">שער נמוך</td><td class="titleGridReg" align="center" valign="top">שער גבוה</td><td class="titleGridReg" align="center" valign="top">שער בסיס</td><td class="titleGridReg" align="center" valign="top">שער פתיחה</td><td class="titleGridReg" align="center" valign="top"><span dir="rtl">שער נעילה (באגורות)</span>
</td><td class="titleGridReg" align="center" valign="top">שער נעילה מתואם</td><td class="titleGridReg" align="center" valign="top">תאריך</td>
</tr><tr onmouseover="this.style.backgroundColor='#FDF1D7'" onmouseout="this.style.backgroundColor='#ffffff'">
"""
soup = BeautifulSoup(html)
print soup.find_all("table",
id="ctl00_SPWebPartManager1_g_c001c0d9_0cb8_4b0f_b75a_7cc3b6f7d790_ctl00_HistoryData1_gridHistoryData_DataGrid1")
我希望这些解决方案能够帮助你解决问题。