不知道为啥这个js网页是curl不下来的,就将html节点复制了下来,然后用了如下代码爬取:
'''
//*[@id="catalog-undefined"]/span/a
/body/div[1]/div[5]/div/div/div[1]/div[2]/div[1]/div[1]/nav/div/div[2]/div/div/div/div[1]/div[3]/div[1]/div/div/div[8]/span[3]/span/div/span/span/a
/body/div[1]/div[5]/div/div/div[1]/div[2]/div[1]/div[1]/nav/div/div[2]/div/div/div/div[1]/div[3]/div[1]/div/div/div[11]/span[3]/span/div/span/span/a
/body/div[1]/div[5]/div/div/div[1]/div[2]/div[1]/div[1]/nav/div/div[2]/div/div/div/div[1]/div[3]/div[1]/div/div//span[3]/span/div/span/span/a
'''
xpath='/html/body/div[1]/div[5]/div/div/div[1]/div[2]/div[1]/div[1]/nav/div/div[2]/div/div/div/div[1]/div[3]/div[1]/div/div//span[3]/span/div/span/span/a'
file='yashu_body.html'
base='https://www.yuque.com'
from lxml import etree
html = etree.parse(file, etree.HTMLParser())
for e in html.xpath(xpath):
print(e.attrib['title']+'\n'+base+e.attrib['href'])
不知道有没有更好的方式。
针对无法通过curl获取的JS网页,博主通过复制HTML节点并使用Python爬虫进行导航栏名称和URL的抓取,探讨了可能的优化方案。
2474

被折叠的 条评论
为什么被折叠?



