机器学习---基本ML建模(数据二)_ml 统一机器学习模型非关系数据-优快云博客

数据(二)

网页抓取

The goal is to extract(提取) data from website
- Noisy, weak labels, can be spammy(垃圾内容)
- Available at scale(大规模可用)
- E.g. price comparison/tracking website(例如价格比较/跟踪网站)
Many ML datasets are obtained(的获得) by web scraping
- E.g. ImageNet, Kinetics(图像网，动力学)
Web crawling VS scrapping
- Crawling: indexing whole pages on Internet(在互联网上索引整个页面)
- Scraping: scraping particular data from web pages of a website(从网站的网页中抓取特定数据)

Tools

“curl” often doesn’t work
- Website owners use various ways to stop bots(网站所有者使用各种方法来阻止机器人)
Use headless browser(无头浏览器): a web browser without a GUI
You need a lot of new IPs, easy to get through public clouds
- In all IPv4 IPs, AWS owns 1.75%, Azure 0.55%, GCP 0.25%(简单点上IP代理池)

from selenium import webdriver

chrome_options = webdriver.Chromeoptions()
chrome options.headless = True
chrome webdriver.Chrome(
    chrome_options=chrome_options)
    
page = chrome.get(url)

Craw individual pages

Spider理论系列-bs4浅辄的技术博客51CTO博客

bs4可以直接看这篇文章了

Extract data