数据(二)
网页抓取
-
The goal is to extract(提取) data from website
-
Noisy, weak labels, can be spammy(垃圾内容)
-
Available at scale(大规模可用)
-
E.g. price comparison/tracking website(例如价格比较/跟踪网站)
-
-
Many ML datasets are obtained(的获得) by web scraping
-
E.g. ImageNet, Kinetics(图像网, 动力学)
-
-
Web crawling VS scrapping
-
Crawling: indexing whole pages on Internet(在互联网上索引整个页面)
-
Scraping: scraping particular data from web pages of a website(从网站的网页中抓取特定数据)
-
Tools
-
“curl” often doesn’t work
-
Website owners use various ways to stop bots(网站所有者使用各种方法来阻止机器人)
-
-
Use headless browser(无头浏览器): a web browser without a GUI
-
You need a lot of new IPs, easy to get through public clouds
-
In all IPv4 IPs, AWS owns 1.75%, Azure 0.55%, GCP 0.25%(简单点上IP代理池)
-
from selenium import webdriver
chrome_options = webdriver.Chromeoptions()
chrome options.headless = True
chrome webdriver.Chrome(
chrome_options=chrome_options)
page = chrome.get(url)
Craw individual pages
bs4可以直接看这篇文章了
Extract data
-
Identify the HTML elements through Inspect(标识 HTML元素通过检查)
按F12使用开发者工具
-
Repeat the previous process to extract other field data()