DataCamp “Data Scientist with Python track” 第六章 Importing Data in Python (Part 2) 学习笔记

最新推荐文章于 2024-06-12 09:45:08 发布

原创最新推荐文章于 2024-06-12 09:45:08 发布 · 473 阅读

0 ·

CC 4.0 BY-SA版权

文章标签：

#数据科学 #DataCamp #Python #爬虫

本文介绍如何使用Python的urllib和requests库进行HTTP请求，解析HTML数据，以及使用BeautifulSoup提取信息。同时，文章深入讲解了JSONs在实时通信中的应用，以及如何通过Twitter API获取和分析数据。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

Performing HTTP requests in Python using urllib

终于到了从网页上爬数据的部分，与前面一样，这里我们先与HTTP建立连接并发送请求（前往website）

*注意最后关闭连接

# Import packages
from urllib.request import urlopen, Request

# Specify the url
url = "http://www.datacamp.com/teach/documentation"

# This packages the request: request
request = Request(url)

# Sends the request and catches the response: response
response = urlopen(request)

# Print the datatype of response
print(type(response))

# Be polite and close the response!
response.close()

当然我们还可以用requests命令在一行中同时实现打包和发送request并得到response：

# Import package
import requests

# Specify the url: url
url = "http://www.datacamp.com/teach/documentation"

# Packages the request, send the request and catch the response: r
r = requests.get(url)

# Extract the response: text
text = r.text

# Print the html
print(text)

Scraping the web in Python

HTML: mix of unstructured and structured data.

Use the BeautifulSoup package to parse, prettify and extract information from HTML.

Turning a webpage into data using BeautifulSoup: getting the hyperlinks

这里引入了BeautifulSoup，一个奇怪的名字，我们可以对生成的soup进行很多的操作，其中就有在soup中寻找字符，并在字符中进行二次URL搜索，如下：

# Import packages
import requests
from bs4 import BeautifulSoup

# Specify url
url = 'https://www.python.org/~guido/'

# Package the request, send the request and catch the response: r
r = requests.get(url)

# Extracts the response as html: html_doc
html_doc = r.text

# create a BeautifulSoup object from the HTML: soup
soup = BeautifulSoup(html_doc)

# Print the title of Guido's webpage
print(soup.title)

# Find all 'a' tags (which define hyperlinks): a_tags
a_tags = soup.find_all('a')

# Print the URLs to the shell
for link in a_tags:
    print(link.get('href'))

Introduction to APIs and JSONs

接下来我们就可以引入JSONs（JavaScript Object Notation），which is a real-time server-to-browser communication，而且具有可读性。通过type指令我们可以看到json_data是dict，with keys and value。

*文件 .json

API部分注意url = 'http://www.omdbapi.com/?t=hackers' 中的“ /? ”，这是query string，表示Return data for a movie with title (t) ‘Hackers’。

Note: 最近OMDB更改了他们的API，所以我们还需要特地注明一个API key。This means you'll have to add another argument to the URL: apikey=ff21610b.

Diving deep into the Twitter API

接下来我们就可以调用API接口与Twitter取得联系，进行发推、“爬推”等等一系列操作。其中使用的package是tweepy：The package tweepy is great at handling all the Twitter API OAuth Authentication details，我们所需要做的就是提供authentication credentials。

# Import package
import tweepy

# Store OAuth authentication credentials in relevant variables
access_token = "1092294848-aHN7DcRP9B4VMTQIhwqOYiB14YkW92fFO8k8EPy"
access_token_secret = "X4dHmhPfaksHcQ7SCbmZa2oYBBVSD2g8uIHXsp5CTaksx"
consumer_key = "nZ6EA0FxZ293SxGNg8g8aP0HM"
consumer_secret = "fJGEodwe3KiKUnsYJC3VRndj7jevVvXbK2D5EiJ2nehafRgA6i"

# Pass OAuth details to tweepy's OAuth handler
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)

现在我们就将利用已经定义好的function：MyStreamListener进行“监听”工作了，同时我们还可以设定关键词过滤从而得到想要的推特。美国刚刚结束了中期选举，这里用了一系列很有意思的关键词:)

如有需要可以查看MyStreamListener的code。

# Initialize Stream listener
l = MyStreamListener()

# Create your Stream object with authentication
stream = tweepy.Stream(auth, l)

# Filter Twitter Streams to capture data by the keywords:
stream.filter(track = ['clinton', 'trump', 'sanders', 'cruz'])

跳过中间的存储Twitter Data的过程，我们可以将Twitter Data转换成熟悉的pandas中的DataFrame，并且从每条推特提取中不同的两列信息：text和language。

# Import package
import pandas as pd

# Build DataFrame of tweet texts and languages
df = pd.DataFrame(tweets_data, columns=['text','lang'])

# Print head of DataFrame
print(df.head())

一切大功告成之后我们就可以进行图表化等等视觉方式对我们的数据进行浅层分析了，这用到了前几章学到的技巧，值得一提的是，不管评价好坏，川普的提及率还是最高的。