Python爬虫初探

最新推荐文章于 2024-05-30 10:42:31 发布

原创最新推荐文章于 2024-05-30 10:42:31 发布 · 250 阅读

0 ·

CC 4.0 BY-SA版权

文章标签：

#python #爬虫 #百度资讯

Python 专栏收录该内容

9 篇文章

订阅专栏

以下功能均在jupyter notebook上实现。

python及相应爬虫工具安装请参考博客：https://blog.youkuaiyun.com/sinat_40471574/article/details/91354263

一、爬虫介绍：

1. 非结构化数据(没有固定格式)：如网页资料，必须通过ETL(Extract(抽取) Transformation(转换) Loading(组成))工具将数据转化为结构化数据才能取用。

Raw Data(原始数据) --> ELT Script(ETL脚本) --> Tidy Data(结构化数据)

3. 爬虫工具：

Chrome控制台：检查 --> Network --> (js,css,img,doc)(刷新)

通过pip安装套件：pip install requests

pip install BeautifulSoup4

infoLite

二、基础功能介绍：

三. 爬虫实例：获取用户当前位置信息，并获取附近学校、医院、商场最近一年的新闻。

import requests,json
import string
import datetime
from bs4 import BeautifulSoup

dateNow = datetime.datetime.now().strftime('%Y-%m-%d')
dateNow = int(dateNow[0:4])*12 + int(dateNow[5:7])


#利用用户当前的IP进行定位1- - 
def getIpAddress():
    res = requests.get('https://apis.map.qq.com/ws/location/v1/ip?key=E3YBZ-XBBKU-XPSVV-BXTQF-X26AS-7LFDD')
    res.encoding = 'utf-8'
    ipJson = json.loads(res.text)
    ipGet = ipJson['result']['ip']
    latGet = ipJson['result']['location']['lat']
    lngGet = ipJson['result']['location']['lng']
    locationGet = ipJson['result']['ad_info']['nation'] + ipJson['result']['ad_info']['province'] + ipJson['result']['ad_info']['city']
    print('所在地：',locationGet)
    print('经度：',latGet,'纬度：',lngGet,'\n')
    if len(ipJson['result']['ad_info']['city']) > 0:
        locationGet = ipJson['result']['ad_info']['city']
    elif len(ipJson['result']['ad_info']['province']) > 0:
        locationGet = ipJson['result']['ad_info']['province']
    else:
        locationGet = '中国'
    return locationGet
#利用用户当前的IP进行定位- -1


#获取新闻的标题、链接、发布者、时间2- -
def getNews(searchContent,page):
    
    #P1 获取网页源码
    page = page*10
    page='%d' %page
    url = 'http://www.baidu.com/s?tn=news&rtt=4&bsst=1&cl=2&wd=' + searchContent + '&pn='+ page
    res = requests.get(url)
    res.encoding = 'utf-8'
    soup = BeautifulSoup(res.text,'html.parser')
    
    #P2 获取新闻列表
    for news in soup.select('.result'):
        title = news.select('a')[0].text.strip()
        link = news.select('a')[0]['href']
        timeMedia = news.select('p')[0].text
        timeMedia = timeMedia.replace('\n',' ').replace('\n',' ').strip().replace(' ','').replace('\t','')
        time = timeMedia[-16:]
        media = timeMedia[:-17]
        print(title,'\n',link,'\n')
    #P3 判断文章是否到底，时间是否超限
    switchPages = soup.select('#page')
    lastPage = switchPages[0].select('a')[-1].text.replace('>','')
    dateArticle = int(time[0:4])*12 + int(time[5:7])
    dateInterval = dateNow - dateArticle
    if lastPage != '下一页' or dateInterval>=12:
        return -1
    else:
        return 0     
#获取新闻的标题、链接、发布者、时间- -2


#查询用户所在地附近的商场，学校，医院3- -
def getOrg(keyword,page):
    address = getIpAddress()
    page='%d' %page
    url = 'https://apis.map.qq.com/ws/place/v1/search?boundary=region('+ address + ',0)&keyword='+ keyword + '&page_size=20&page_index='+ page + '&orderby=_distance&key=E3YBZ-XBBKU-XPSVV-BXTQF-X26AS-7LFDD'
    res = requests.get(url)
    orgJson = json.loads(res.text)
    for item in orgJson['data']:
        print(item['title'],'\n经度:',item['location']['lat'],'\t纬度：',item['location']['lng'],'\n')
#查询用户所在地的商场，学校，医院- -3





#1.获取实体的位置信息：

#address = getIpAddress()


#2. 获取附近的医院、学校、商场的信息
#getOrg('医院',1)


#3. 获取最近一年的新闻

#isEnd = getNews('哈尔滨购物',0)
#if(isEnd == -1):
#    print('已加载全部新闻')
#else:
#    print('\n准备加载下一页...')