python爬取网页UnicodeEncodeError: 'gbk' codec can't encode character '\xa0' in posi

最新推荐文章于 2024-01-31 09:00:00 发布

布吃

最新推荐文章于 2024-01-31 09:00:00 发布

阅读量796

点赞数 1

CC 4.0 BY-SA版权

分类专栏： python 文章标签：爬虫格式

本文链接：https://blog.youkuaiyun.com/younger_to_older/article/details/98783495

python 专栏收录该内容

12 篇文章

订阅专栏

本文介绍如何使用Python进行网页代码抓取，并解决了在抓取过程中遇到的编码问题，包括从网页直接获取代码和从本地文件读取两种情况。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >


（1）爬取网页代码格式问题
def get_html(url):
    try:
        response = requests.get(url)
        if response.status_code == 200:
            return response.text
        ......
        return get_html(url)


def get_index(keyword,page):
   ......
    html = get_html(url)
    print(html.decode('utf_8'))

在获取响应的返回值后加

return response.text.encode('utf-8')

将相同默认编码转为utf-8

（2）如果是以文件形式保存的网页代码，则在打开文件时加上：

def get_html():

      with open('principle_test.txt', "r",encoding='utf-8') as f:  # 设置文件对象
        html = f.read()
        f.close()
    return html


def get_index:
......
    html = html
    print(html.decode('utf_8'))