【案例:知网】爬取指定检索条件下的数据

完整源码 :优快云资源

客户需求

1、论文使用数据,需要爬取特定检索条件下的所有结果
2、题录信息组成的excel表格,里面要有篇名、作者、刊名、发表时间、被引用、下载次数。

在这里插入图片描述

问题

1、爬取10页左右会跳验证码
2、登录非登录状态下数据位置不一致

解决方案

1、模拟post请求获取接口数据
2、selenium刷新按钮,跳过验证码

网页解析源码

		  data ={'IsSearch': 'false', 'QueryJson': '{"Platform":"","DBCode":"CFLQ","KuaKuCode":"","QNode":{"QGroup":[{"Key":"Subject","Title":"","Logic":4,"Items":[],"ChildItems":[{"Key":"input[data-tipid=gradetxt-1]","Title":"主题","Logic":0,"Items":[{"Key":"","Title":"深度学习","Logic":1,"Name":"SU","Operate":"%=","Value":"深度学习","ExtendType":1,"ExtendValue":"中英文对照","Value2":""}],"ChildItems":[]}]},{"Key":"ControlGroup","Title":"","Logic":1,"Items":[],"ChildItems":[{"Key":".tit-startend-yearbox","Title":"","Logic":1,"Items":[{"Key":".tit-startend-yearbox","Title":"出版年度","Logic":1,"Name":"YE","Operate":"","Value":"2000","ExtendType":2,"ExtendValue":"","Value2":"2020","BlurType":""}],"ChildItems":[]},{"Key":".extend-tit-checklist","Title":"","Logic":1,"Items":[{"Key":0,"Title":"CSCD","Logic":2,"Name":"CSD","Operate":"=","Value":"Y","ExtendType":14,"ExtendValue":"","Value2":"","BlurType":""}],"ChildItems":[]}]}]}}', 'SearchSql': '2827E4B6502D8710F4C63FA68A0E7A152D8972E3EF5541A46DDE8B3A62A549C1B093F0A10875FEBDE5B17F4F918A6F10CAA2EA622595552DEE59C9627930D8AAE3535B937469326E18AA84865BAD3DC6C1A7DA64988D2A81F1A429AD691BEC3C662D26E88914ED474507E583978013FB39E843499D82596B2618CD689FA015E1201CDC81DB346AAE41EC83FB19ABED536E48A1FE92964DABD22B02545450128F5FB427A463A372F788658E5867448A03EE1DCAA4FC961F876DD0AFD069A75A73EBFF54828C099FA915A530221B8DFCFA79F23DF449882ECC6212DF7BC60BE0614D31D7AB141699CFCCC4A03576FED38D858B28511956AEA35353EDCB665F5271D9D935D6E2468296DF5963EDDF6BAC5FABC0661D7469B83BA323A18B86B275DFF84E92A932F09456D32F6AF176EC9D9F0C350500201E46CEC205EB4D51FAFBC092B275896F59931EBAE053C2E8B1CE77BF4DCA8E9F6CBF81456E2AD2F3DB0E212F697F5264A17D7E9A9613EC93B411CDB13B390ACC56650E9492DF610561045B273B10A6A80E83D1D88FFE7E5CB6D2C2B77133B8D304412723AC0F1F26A92FB370EAED837121783CAD3D7EF825151B8D5C12816097DBB2C19346D397B0939E45E1728D11FB067C3CBA2A4748BAF8C435220CFBEA669916F87FEB510BC5187CBD839FDD6B3403C6004D18F7D65BFF580ED4F445153D890BBDFCB9F8CE455CDE0F9101C2CD8AA615FA83B88C413057454832A96F35AF8F05F750ADB59D5A5D7C5FA3A4B5E218921D843B47DFA6163495E51EC686D6D1F16493768169A1826D919D98EB0231B64C2609114348FCA76B4BCBB6411A9FC5E28939D3EDB77BF67509171407B061137932142C62CA99BB78C3B1056A5607154FDCEBE9D2A4746525F18A5D5E9E84E0AF241838F3761F35853D56F1F0D381471896F120936CDBB5F379F3014F50EF00613DDAC5CBEEF7C0710F865D7EC50F9FFBD1548373BA65DD1678DEBB6D3AB42F7610745C5E6A223BDF60D9988B5263E816F7D1AC220A6C1B412FA23F21A356D55B7BF8B8C58B276A71C2E1CA547376412D0548F4AB7301CFE6979378A7F8D4A4D3ADE9E01B31BCD4AD35E2B8024FB4276F88680D44751DD6A061DCCC7133A9FE1C7AC6DF0C89E9501C6F775E63721F7414ABC778337ADB31B101D115A177E3E645D1D9E93D72509A7D5C01E10D91EC4E13B3E69BC6AAA6BBE73FC80178E11A9257F36DF6750DABC4B3955218BCC8EA3F86015419CFE54116C2A78C4B1B3E9E2157A4E050BE969B079142C9C03164BEC2F54314E1050E27723B8A903A90EDCD8A02A9C57964D37EC8A55CE8A6D2514EC84097B977AC0B929A0FCA5E', 'PageName': 'DefaultResult', 'HandlerId': '3', 'DBCode': 'CFLQ', 'KuaKuCodes': '', 'CurPage': f'{num}', 'RecordsCntPerPage': '20', 'CurDisplayMode': 'listmode', 'CurrSortField': '%e5%8f%91%e8%a1%a8%e6%97%b6%e9%97%b4%2f(%e5%8f%91%e8%a1%a8%e6%97%b6%e9%97%b4%2c%27TIME%27)', 'CurrSortFieldType': 'desc', 'IsSortSearch': 'false', 'IsSentenceSearch': 'false'}
          res = requests.post(url=url, headers=headers, data=data)
          html = res.text
          html = etree.HTML(html)
          # print(res.text)
          # trs = html.xpath('//tbody//tr')#登录状态下
          trs = html.xpath('//table//tr')#未登录状态
          res_pd = pd.DataFrame()
          for tr in trs[1:]:#登录状态for tr in trs:
               pd_ = pd.DataFrame()
               name = tr.xpath('.//td[@class = "name"]//text()')
               name = ''.join(name).strip()#多段文本需要用空值拼接成完整标题
               author = tr.xpath('.//td[@class = "author"]//a//text()')
               source = tr.xpath('.//td[@class = "source"]//a/text()')[0]
               date = tr.xpath('.//td[@class = "date"]//text()')[0].strip()
               try:#防止空值报错
                    quote = tr.xpath('.//td[@class = "quote"]//text()')[1].strip()
               except:
                    quote = tr.xpath('.//td[@class = "quote"]//text()')[0].strip()
               try:
                    download = tr.xpath('.//td[@class = "download"]//text()')[1].strip()
               except:
                    download = tr.xpath('.//td[@class = "download"]//text()')[0].strip()

               pd_.loc[0,'篇名'] = name
               pd_.loc[0,'作者'] = ','.join(author)#author是列表,要转化
               pd_.loc[0,'刊名'] = source
               pd_.loc[0,'发表时间'] = date
               pd_.loc[0,'被引'] = quote
               pd_.loc[0,'下载'] = download
               pd_.loc[0,'页码'] = num
               res_pd = pd.concat([res_pd,pd_])

结果呈现

在这里插入图片描述

类似需求

'''数据代采集,爬虫、脚本定制,欢迎咨询,免费测试'''

在这里插入图片描述

评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

DM。

对您有帮助的话,打赏一下吧~~

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值