机器学习---基本ML建模(数据二)

数据(二)

网页抓取

  • The goal is to extract(提取) data from website

    • Noisy, weak labels, can be spammy(垃圾内容)

    • Available at scale(大规模可用)

    • E.g. price comparison/tracking website(例如价格比较/跟踪网站)

  • Many ML datasets are obtained(的获得) by web scraping

    • E.g. ImageNet, Kinetics(图像网, 动力学)

  • Web crawling VS scrapping

    • Crawling: indexing whole pages on Internet(在互联网上索引整个页面)

    • Scraping: scraping particular data from web pages of a website(从网站的网页中抓取特定数据)

Tools

  • “curl” often doesn’t work

    • Website owners use various ways to stop bots(网站所有者使用各种方法来阻止机器人)

  • Use headless browser(无头浏览器): a web browser without a GUI

  • You need a lot of new IPs, easy to get through public clouds

    • In all IPv4 IPs, AWS owns 1.75%, Azure 0.55%, GCP 0.25%(简单点上IP代理池)

from selenium import webdriver
​
chrome_options = webdriver.Chromeoptions()
chrome options.headless = True
chrome webdriver.Chrome(
    chrome_options=chrome_options)
    
page = chrome.get(url)

Craw individual pages

Spider理论系列-bs4浅辄的技术博客51CTO博客

bs4可以直接看这篇文章了

Extract data

  • Identify the HTML elements through Inspect(标识 HTML元素通过检查)

按F12使用开发者工具

  • Repeat the previous process to extract other field data()

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值