第1部分:网页搜集 (Part 1: Web Scraping)
Disclaimer: The information in this post is for educational purposes only and should not be taken as financial or investment advice.
免责声明:这篇文章中的信息仅用于教育目的,不应被视为财务或投资建议。
In a 3-part series of blog posts, I describe how to scrape a list of companies in the Russell 2000 index from CNN, store them in a database, and combine them with some data from the Finnhub API to find the biggest losers to short (no offense to anyone who works for these companies!).
在一个由三部分组成的系列博客文章中,我描述了如何从CNN中抓取Russell 2000指数中的公司列表,将它们存储在数据库中,并将它们与Finnhub API中的一些数据结合起来,以找出最大的失败者(对在这些公司工作的任何人都不会犯罪!)。
In Part 1, I describe the process of creating a function to get the list of Russell 2000 companies and their data provided in tabular form and return a list of this data. In Part 2, I’ll describe how to store and retrieve this data in SQLite. Then in Part 3, I’ll introduce the Finnhub API that provides a wide assortment of stock market data for free and add some data to the companies that could be useful for determining a short position. I’ll also use termtables to make the output in the terminal a little easier to read.
在第1部分中,我描述了创建函数的过程,该函数获取以表格形式提供的Russell 2000公司列表及其数据,并返回此数据列表。 在第2部分中 ,我将描述如何在SQLite中存储和检索此数据。 然后在第3部分中,我将介绍Finnhub API ,该API免费提供各种股票市场数据,并向公司添加一些数据,这些数据对于确定空头头寸很有用。 我还将使用术语表使终端中的输出更易于阅读。
I was looking for a web API which lists the companies in the Russell 2000 index. I didn’t find one, but what I did find is a list that CNN keeps updated with the statistics I was looking for:- Symbol (Ticker)- Price- YTD % ChangeAlong with a few other useful numbers (P/E, Volume, etc). I’ve been writing a lot of Python lately and decided to try a web scraper. It turned out to be quite easy to scrape the data from the paginated table. The CNN Terms of Use say “You may download copyrighted material for your personal use only.” Since I’m using this scraper only for myself, onward.
我正在寻找一个Web API,其中列出了Russell 2000索引中的公司。 我没有找到一个,但是我找到的是一个CNN会随着我一直在寻找的统计信息而更新的列表 :-符号(股票代码)-价格-年初至今百分比变化以及其他一些有用的数字(市盈率,交易量等)。 最近,我一直在编写大量Python,并决定尝试使用网络抓取工具。 事实证明,从分页表中抓取数据非常容易。 CNN使用条款说: “您只能下载受版权保护的材料供您个人使用。” 由于我以后只为自己使用此刮板。
From searching the internet I found that BeautifulSoup is the most popular Python library for web scraping, so I installed that with Pip package manager for this project. A popular library for making http requests is the aptly-named Requests library. These are the two libraries I brought into my project with Pip.
通过搜索互联网,我发现BeautifulSoup是最受欢迎的Web抓取Python库,因此我为此项目使用了Pip包管理器进行了安装。 一个用于发出http请求的流行库是恰当命名的Requests库。 这是我通过Pip带入项目的两个库。
The steps I needed to do to scrape the data consisted the following:1. Iterate through the CNN pages. The one and only query parameter in the URLs for these pages is ?page=, so I need to retrieve each page number until receiving a 404 or page without the table. I’ll use the Requests library to make the http request for each page of the list. 2. In each iteration (after each page request that returns content), use BeautifulSoup to find the table in the DOM and parse the table rows to create a list of values for that company. Note: If page=1, I also want to get the top table row with the headers.3. Return a list of lists from the function — a list of the table rows converted to lists. I chose to make a list of lists rather than a list of dictionaries because each row has values for the same fields.
我需要抓取数据的步骤包括以下步骤:1。 遍历CNN页面。 这些页面的URL中唯一且唯一的查询参数是?page = ,因此我需要检索每个页面编号,直到收到没有该表的404或页面。 我将使用请求库为列表的每个页面发出http请求。 2.在每次迭代中(在每个返回内容的页面请求之后),使用BeautifulSoup在DOM中查找表并解析表行以创建该公司的值列表。 注意:如果page = 1,我也想获得带有标题的顶部表格行。 3.从函数返回列表列表-转换为列表的表行列表。 我选择列出列表而不是列出字典,因为每一行都有相同字段的值。
So… this is a very short script. You’ll be pleasantly surprised how little code is involved.
因此,这是一个非常简短的脚本。 您会惊讶地发现只涉及很少的代码。
说明 (Explanation)
Before starting the script I installed the libraries. In the terminal:
在开始脚本之前,我已经安装了库。 在终端中:
pip install requests
pip install bs4
The pip package manager comes with Python, so if you’ve installed Python you should have access to it. You may want to create a virtual environment, but I’m not going to include it here for the sake of simplicity.
pip软件包管理器是Python随附的,因此,如果您已安装Python,则应该可以使用它。 您可能想创建一个虚拟环境,但是为了简单起见,我不会在此处包括它。
import requests
from bs4 import BeautifulSoup
import timeCNN_URL="https://money.cnn.com/data/markets/russell?page="
At the top of the file I’ve imported the Requests library, the BeautifulSoup web scraping library and the time module, which is already included with Python. The CNN_URL string will have a page number appended to it each time through a loop that fetches the pages.
在文件的顶部,我导入了Requests库,BeautifulSoup Web抓取库和time模块,它们已经包含在Python中。 每次通过获取页面的循环, CNN_URL字符串都会在页面上附加页码。
There’s only one function in this file, I’ve predictably named scrape_russell_2000 that takes no arguments. In the function, I’ve setup a few variables ahead of the scraping loop:
这个文件中只有一个函数,我可以预测它的名字为scrape_russell_2000 ,它不带任何参数。 在函数中,我在抓取循环之前设置了一些变量:
def scrape_russell_2000():
page_exists = True
page_num = 1
russell_2000_data = []
I’m setting up a page_exists variable which will be True until it passes the last page to break a while loop. The page_num variable will be added to the CNN_URL string for the http request and then incremented for the next one. The russell_2000_data empty list will store all the lists with the company data.
我正在设置一个page_exists变量,该变量将为True,直到它通过最后一页才能中断while循环。 page_num变量将添加到http请求的CNN_URL字符串中,然后为下一个增加。 russell_2000_data空列表将存储所有包含公司数据的列表。
Now onto the while loop. While page_exists is True, do all the scraping stuff.
现在进入while循环。 当page_exists为True时,执行所有抓取操作。
def scrape_russell_2000():
page_exists = True
page_num = 1
russell_2000_data = [] while page_exists:
get_headers = (page_num == 1) resp = requests.get(CNN_URL + str(page_num))
if resp.status_code == requests.codes.ok:
page_num += 1
soup = BeautifulSoup(resp.content, 'html.parser')
table_div = soup.find(id="wsod_indexConstituents")
if not table_div:
page_exists = False
return russell_2000_data
...
else:
page_exists = False
return russell_2000_data
This while loop will run until one of two things happens:a. It retrieves a page that doesn’t have an element with the id “wsod_indexConstituents” (found using the browser’s web development tools). This is the id of the table element that holds the company data.b. The URL it tries to get doesn’t return a status code of 200, which is the same as requests.code.okIf the page is found, it increments the page number by 1. Then, I create a BeautifulSoup object around the content of the response that contains a bunch of utility methods for working with the content. Finally, I create a variable to hold all the content in the parent element of the table — the element with id=”wsod_indexConstituents”.
该while循环将一直运行到发生以下两种情况之一为止: 它检索没有ID为“ wsod_indexConstituents” (使用浏览器的Web开发工具找到)的元素的页面。 这是保存公司数据的表格元素的ID。b。 它尝试获取的URL不返回状态码200 ,这与request.code.ok相同。如果找到该页面,它将页面号增加1。然后,我围绕该内容创建一个BeautifulSoup对象。响应包含一堆用于处理内容的实用方法。 最后,我创建一个变量以将所有内容保留在表的父元素中,该父元素的ID为“ wsod_indexConstituents ”。
If the program doesn’t exit out of the while loop from those two conditions above, it’s right past the line that checks
如果程序没有从上述两个条件中退出while循环,则它正好超出了检查的行
if not table div:
At this point I want all the rows in the table. But there’s one little complication. I want to send back the table headers along with the company data so that I don’t have to hard-code them into the program. So if it’s the first page of results, I want the first row (with index 0). Otherwise, I want all the rows starting with index 1.
此时,我想要表中的所有行。 但是有一点复杂。 我想将表头和公司数据一起发回,这样就不必将它们硬编码到程序中。 因此,如果它是结果的第一页,我想要第一行(索引为0)。 否则,我希望所有从索引1开始的行。
... if get_headers:
header_row = table_div.find_all('tr')[0]
hds = header_row.find_all('th')
headers = [h.text for h in hds]
russell_2000_data.append(headers)
print(headers)
There is a list comprehension in the block above:
在上面的块中有一个列表理解 :
headers = [h.text for h in hds]
This is short-hand way to iterate through an iterable data-type (in this case a list) and do something to each item. In this case, I’m pulling the inner-text of the table header elements.
这是迭代可迭代数据类型(在本例中为列表)并对每个项目执行操作的简便方法。 在这种情况下,我要提取表头元素的内部文本。
The next lines always execute if it hits a page that exists:
如果下一行到达存在的页面,则总是执行以下几行:
... company_rows = table_div.find_all('tr')[1:]
for row in company_rows:
company_data = []
row_data = row.find_all('td')
symbol = row_data[0].text.replace("\xa0", " ").split(" ")[0]
name = " ".join(row_data[0].text.replace("\xa0", " ").split(" ")[1:])
company_data.append(symbol)
company_data.append(name)
for td in row_data[1:]:
company_data.append(td.text)
russell_2000_data.append(company_data)
print(company_data)
The block above is getting all the rows from the table except the first and essentially appending the values of the columns to a new list called company_data. The only complications are:1. The symbol and company name are in one string and therefore one column of the table.2. There is a Unicode hex character xa0 that has to be removed so the string containing the symbol and company name can be split.
上面的代码块是从表中获取所有行,除了第一个,实际上是将列的值附加到一个名为company_data的新列表中。 唯一的并发症是:1.。 符号和公司名称在一个字符串中,因此在表的一栏中。2。 必须删除Unicode十六进制字符xa0 ,以便可以拆分包含符号和公司名称的字符串。
symbol = row_data[0].text.replace("\xa0", " ").split(" ")[0]
In the line above, we know that both the symbol and company are in row_data[0], because that holds the text from the first ‘td’ element. The character xa0 is removed and then the string is split at all space characters. Then I grab the first part of the string at index 0.
在上面的行中,我们知道符号和公司都在row_data [0]中,因为它保存了第一个'td'元素中的文本。 删除字符xa0 ,然后在所有空格字符处分割字符串。 然后,我在索引0处抓取字符串的第一部分。
name = " ".join(row_data[0].text.replace("\xa0", " ").split(" ")[1:])
The next line does the same replace and split as the line above but retains all the string fragments split after index 0 (denoted by [1:]) and then joins them with spaces again. So if the string fragments were for example:
下一行与上一行进行相同的替换和拆分,但保留所有在索引0 之后拆分的字符串片段(用[1:]表示),然后再次将它们与空格连接。 因此,如果字符串片段是例如:
['AEIS', 'ADVANCED', 'ENERGY', 'INDS', 'IN']
Slicing from index 1 forward:
从索引1向前切片:
['AEIS', 'ADVANCED', 'ENERGY', 'INDS', 'IN'][1:]
= ['ADVANCED', 'ENERGY', 'INDS', 'IN']
Then rejoining with a space:
然后重新加入空格:
" ".join(['AEIS', 'ADVANCED', 'ENERGY', 'INDS', 'IN'])
gives
给
'ADVANCED ENERGY INDS IN'
The following for loop just grabs the text from all the other ‘td’ table data elements and appends them to the company_data list since they don’t need to altered in any way.
以下for循环仅从所有其他“ td”表数据元素中获取文本并将其附加到company_data列表中,因为它们无需进行任何更改。
for td in row_data[1:]:
company_data.append(td.text)
Finally, the company_data list is appended to the russell_2000_data list.
最后,company_data列表将附加到russell_2000_data列表。
russell_2000_data.append(company_data)
print(company_data)
You’ll notice I’m printing the headers and the company_data lists as they’re created just to verify the script is running correctly.
您会发现我在创建标头和company_data列表时打印它们,只是为了验证脚本是否正常运行。
At the bottom of the while loop is a time.sleep(1) command. I added this so that all the requests don’t hit CNN’s site too close together and it’s interpreted as DDOS. It may not have caused an issue, but is included just to be safe. That was the sole reason for importing the time module.
在while循环的底部是time.sleep(1)命令。 我添加了它,以便所有请求都不会在CNN的站点之间靠得太近,并且将其解释为DDOS 。 它可能不会引起问题,但为了安全起见将其包括在内。 那是导入时间模块的唯一原因。
The following code allows the script to run as a stand-alone program:
以下代码允许脚本作为独立程序运行:
if __name__ == "__main__":
scrape_russell_2000()
最后的想法 (Final Thoughts)
You’ll notice that the correctness of this script relies on something beyond my control — the element id containing the table. If that were to change, the script would fail. Alternatively, one could just find table elements in the page, but if there was more than one table and they don’t have element id’s (as this table doesn’t), that would cause issues too.
您会注意到,此脚本的正确性取决于我无法控制的东西-包含表的元素ID。 如果要更改,脚本将失败。 或者,可以在页面中找到表元素,但是如果有多个表并且它们没有元素ID(因为该表没有),那也会引起问题。
Typically, one writes some tests on what the scraper returns. When those tests fail, it’s time to look at the DOM of the page being scraped and see what changed.
通常,人们会对刮板返回的内容进行一些测试。 当这些测试失败时,是时候查看正在抓取的页面的DOM并查看发生了什么变化。
In Part II, I’ll be importing this data into SQLite by calling this function from another script. Storing the this data in a database will make data retrieval a lot faster if I only want to update the data once a day or less frequently.
在第二部分中,我将通过从另一个脚本调用此函数将数据导入SQLite。 如果我只想每天更新一次或不那么频繁,则将这些数据存储在数据库中将使数据检索快得多。
Thanks for reading! If you have any suggestions on optimizing the script, error corrections, or any insights in general, please let me know in the comments!
谢谢阅读! 如果您对优化脚本,错误更正或任何一般性见解有任何建议,请在评论中告知我!