1.网页信息爬取
首先进入知乎热门榜单页面:https://www.zhihu.com/hot,使用requests库对页面进行爬取,其中需要注意的是:
请求头headers的user-agent应设置为Mozilla/5.0,将程序伪装成浏览器,否则服务器会判定你的程序是python爬虫,进而影响爬取;
url="https://www.zhihu.com/hot" headers={'User-Agent':'Mozilla/5.0', 'cookie':'_xsrf=0NKUqgDc8ezRmsJGb1xC5ukDIxHhxeMq; _zap=bfe65d37-53d8-46d3-ac1e-784e06dcf8a9; d_c0="ALCgxU862A6PTvBEmeag_oGAglx-a-SfU-g=|1547808969"; z_c0="2|1:0|10:1547808985|4:z_c0|92:Mi4xNFpjWUJBQUFBQUFBc0tERlR6cllEaVlBQUFCZ0FsVk4yZjR1WFFCcy1xaEJMbWpyNGNUSkJSY1JacnlXYTJQUWhn|ec267d1d4420cb5fdcfbad75dcf91d0216c07ec185736bbb8595d4b82628cf41"; __utmv=51854390.100--|2=registration_date=20170209=1^3=entry_date=20170209=1; tst=r; __gads=ID=ac032a91f2a254f3:T=1553527557:S=ALNI_MahJ6DLYcUbqypiEeyeB2-kDGPIYg; q_c1=1d097810e84e467388c20b8f87e71621|1554087131000|1550025265000; __utmc=51854390; __utma=51854390.820497302.15500