使用html tidy转换出现error:can't open"",使用tidylib解决不规则网页问题

本文介绍了一种使用Python爬虫技术从慧聪网抓取空调产品信息的方法,包括名称、价格等详细数据,并展示了如何解析网页内容。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

from tidylib importtidy_documentimportrequestsfrom lxml importetree

url= "https://s.hc360.com/?w=%BF%D5%B5%F7"s=requests.Session()#proxies={#"https":"https://192.168.43.1:1800"#}

headers ={"Host": "s.hc360.com","user-agent": "Mozilla/5.0 (X11; Linux x86_64)"

"AppleWebKit/537.36 (KHTML, like Gecko)"

"Chrome/67.0.3396.87 Safari/537.36","cookies": "hcsearchurlport=1; visitid_time=2018-8-27%2011%3A0%3A1; hc360visitid=C81E205C97B000016A91177717E91DA7; hc360first_time=2018-08-27; hcbrowserid=C81E205CA0100001BD5AC09D4D99F260; hckIndex=C81E205CA0A00001613715F0773DE730; topmatchkey=; Hm_lvt_1437b8f613f9fcba581e33d8d178e1f5=1535338807; hccordet=00; hcpreurl=; hclastsearchkeyword=%u7A7A%u8C03; Hm_lvt_e1e386be074a459371b2832363c0d7e7=1535338809; hc360sessionid=C81E205EE6000001B74C508019802980; hc360sessionid=C81E205EE6000001B74C508019802980; hc360firstvisittime=1535338811032; hc360firstvisittime=1535338811032; _ga=GA1.2.1558106052.1535338807; _gid=GA1.2.1254508720.1535338856; hc360analyid=C81E24E7EC600001801D413B1900ED60; hc360analycopyid=C81E24E7EC6000016E7E7D601F0098B0; Hm_lpvt_1437b8f613f9fcba581e33d8d178e1f5=1535345882;"

"Hm_lpvt_e1e386be074a459371b2832363c0d7e7=1535345884; hc5minbeat=1535346223259"}

response= s.get(url, headers=headers).text

response, errors=tidy_document(response)

res=etree.HTML(response)

items= res.xpath('//div/ul/li[contains(@class,"grid-list")]')print(response)print(len(items))for item initems:

d=dict()

d['name'] = item.xpath('.//div[2]/dl/dd/p/a/text()')

d['title'] = item.xpath('.//div[@class="NewItem"]/div/a/@title')

d['price'] = item.xpath('.//div[2]/dl/dt/span/text()')

d['bcid'] = item.xpath('.//div[@class="NewItem"]/@data-bcid')

d['username'] =item.xpath('.//div[@class="NewItem"]/@data-username')

d['tel'] =item.xpath('./@data-telphone')

d['businid'] =item.xpath('./@data-businid')

d['sellerproviderid'] =item.xpath('./@data-sellerproviderid')

d['supcatid'] =item.xpath('./@data-supcatid')

d['main_product'] =item.xpath('.//div[@class="NewItem"]/@data-obj')print(d)

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值