Day3_代理和css解析

最新推荐文章于 2025-10-13 09:41:33 发布

原创最新推荐文章于 2025-10-13 09:41:33 发布 · 155 阅读

0 ·

CC 4.0 BY-SA版权

Python 同时被 2 个专栏收录

30 篇文章

订阅专栏

笔记

29 篇文章

订阅专栏

本文介绍了如何获取网络代理IP并用于爬虫，通过Python的requests库实现数据抓取。同时，详细讲解了如何使用BeautifulSoup库进行CSS选择器选择标签，获取内容和属性，并展示了CSV文件的读写操作。内容涵盖了代理获取、网页抓取、HTML解析和数据处理的基础知识。

Day3_代理和css解析

获取代理

找到购买代理网站,推荐蘑菇代理.
根据api获取ip地址,然后用新的ip地址给proxies赋值

def get_proxy_ips():
    api = 'http://piping.mogumiao.com/proxy/api/get_ip_bs?appKey=3ee6f035175f4b508d8a825da0fb3833&count=5&expiryDate=0&format=2&newLine=3'
    response = requests.get(api)
    if response.status_code == 200:
        if response.text[0] == "{":
            print('获取代理失败,提取频繁')
        else:
            print(response.text.split('\n')[:-1])
            return response.text.split('\n')[:-1]
    else:
        print('获取代理失败')


def get_net_data():
    url = 'https://movie.douban.com/top250'
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.85 Safari/537.36'
    }
    # 代理
    ips = get_proxy_ips()
    if ips:
        proxies = {
            'http': ips[0],  # 'http': 'ip地址:端口号'
            'https': ips[0]
        }
        response = requests.get(url, headers=headers, proxies=proxies)
        print(response.text)
    else:
        print('没有成功获取代理!')

if __name__ == '__main__':
    get_net_data()

同时获取多个ip地址进行爬虫.

import time
import requests
def get_proxy_ips():
    api = 'http://piping.mogumiao.com/proxy/api/get_ip_bs?appKey=3ee6f035175f4b508d8a825da0fb3833&count=5&expiryDate=0&format=2&newLine=3'
    response = requests.get(api)
    if response.status_code == 200:
        if response.text[0] == "{":
            print('获取代理失败,提取频繁')
        else:
            print(response.text.split('\n')[:-1])
            return response.text.split('\n')[:-1]
    else:
        print('获取代理失败')

def get_net_data():
    url = 'https://movie.douban.com/top250'
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.85 Safari/537.36'
    }
    # 代理
    while True:
        # 获取5个代理ip
        ips = get_proxy_ips()
        if not ips:
            time.sleep(1)
            continue
        ips = iter(ips)
        for ip in ips:
            proxies = {
                'http': ip,  # 'http': 'ip地址:端口号'
                'https': ip
            }
            try:
                response = requests.get(url, headers=headers, proxies=proxies, timeout=2)
                if response.status_code == 200:
                    print(response.text)
                    return
                else:
                    print('数据请求失败!')
            except:
                print('超时,继续请求!')

if __name__ == '__main__':
    get_net_data()

bs4的使用(从bs4导入BeautifulSoup模块)

创建解析器对象

BeautifulSoup(需要解析的html字符串,解释器名称)

bs = BeautifulSoup(data, 'lxml')

根据css选择器获取标签
- select(css选择器) - 选择器选中的所有标签
- select_one(css选择器) - 获取选择器中的第一个标签
获取标签内容
- 标签对象.string - 获取标签的文字内容(如果标签中还有子标签的话,返回None)
- 标签对象.get_text() - 获取标签的文字内容(如果有子标签,会将子标签中文本一同取出)
- 标签对象.contents - 获取标签中文字内容和子标签,返回的是一个列表
获取标签属性
- 1标签对象.attrs[属性名]

img = bs.select_one('img')
print(img.attrs['src'])

在指定标签中获取子标签
- 标签对象.select(css选择器) - 获取指定标签中选择器选中的所有子标签
- 标签对象.select_one(css选择器) - 获取指定标签中选择器选中的第一个子标签

div = bs.select_one('div')
p1 = div.select_one('p')

csv文件操作

将数据写入csv中

用列表方式写入数据

with open('./file/test.csv', 'w', newline='', encoding='utf-8') as f:
    writer = csv.writer(f)
    writer.writerow(['姓名', '性别', '年龄', '分数'])
    writer.writerows([
        ['张三', '男', 28, 98],
        ['小明', '男', 19, 72],
        ['小花', '女', 20, 99]
    ])

用字典提供数据

with open('./file/dictest.csv', 'w', newline='', encoding='utf-8') as f:
    writer = csv.DictWriter(f, ['name', 'age', 'gender', 'score'])
    # 第一行内容
    writer.writerow({'name': '姓名', 'age': '年龄', 'gender': '性别', 'score': '分数'})
    # writer.writeheader()  # 将key值作为第一行
    # 写一行
    writer.writerow({'name': '张三', 'age': 28, 'gender': '男', 'score': 98})
    # 写多行
    writer.writerows([
        {'name': '张三', 'age': 28, 'gender': '男', 'score': 98},
        {'name': '小明', 'age': 19, 'gender': '男', 'score': 72},
        {'name': '小花', 'age': 20, 'gender': '女', 'score': 99}
    ])

读取csv文件内容

一行数据对应一个列表

with open('./file/dictest.csv', 'rt', newline='', encoding='utf-8') as f:
    reader = csv.reader(f)
    print(next(reader))
    print(list(reader))

一行数据对应一个字典(第一行作为key)

with open('./file/dictest.csv', 'rt', newline='', encoding='utf-8') as f:
    reader = csv.DictReader(f)
    print(next(reader))
    print(next(reader))
    print(next(reader))
    print(next(reader))