利用python脚本自动下载ICML会议接受的文章

最新推荐文章于 2022-09-27 20:27:17 发布

weixin_33735077

最新推荐文章于 2022-09-27 20:27:17 发布

阅读量213

点赞数

CC 4.0 BY-SA版权

文章标签： python 爬虫

原文链接：https://my.oschina.net/zhangwenwen/blog/513415

2019独角兽企业重金招聘Python工程师标准>>>

最近需要下载ICML会议2015年接受的最新的文章，但是到官网一看，那么多的文章，如果我一篇一篇点击下载的话，什么时候是个头呢？于是就想着用python脚本对文章的页面进行处理，得到相关文章的url，然后进行下载。

通过观察ICML会议的Accepted Papers发现，其的结构还是比较整齐的，其中我们需要的信息的代码片段如下：

<div class="paper">
    <p class="title">Approval Voting and Incentives in Crowdsourcing</p>
    <p class="details">
        <span class="authors">Nihar Shah,
        
            Dengyong Zhou,
        
            Yuval Peres</span>
    </p>
    <p class="links">[<a href="shaha15.html">abs</a>]
        [<a href="shaha15.pdf">pdf</a>]
        [<a href="shaha15-supp.pdf">supplementary</a>]
    </p>
</div>

只要我们提取到了title和具体文章的连接这件事计算完成了。

提取html的相关的内容一般有两种方式：

对html文档进行解析
利用正则表达式进行内容匹配

对html文档进行解析要比利用正则表达式进行内容匹配要慢，但是对于我的这个小的数据处理，速度不是首要的要求，最重要的是能够实现。所以就试着用了下HtmlPaper，但是这好像不是我要的，用起来比较困难，就转而使用python的正则表达式来进行匹配。为了匹配以上我们需要的内容，我写了如下的正则表达式，并对文章的标题和url进行了分组。

<div.*?class="paper".*?>[\s\S]*?<p.*?class="title".*?>([\s\S]*?)</p>[\s\S]*?<a.*?href="(.*?.pdf)">pdf</a>[\s\S]*?</div>

整个python脚本的流程是：

得到要处理的html文档
对文章的标题和url进行提取
对url的资源进行下载并保存为标题对应的pdf文档

全部的代码如下：

# -*- coding: utf-8 -*-  
import urllib2
import re

def getDocument():
    url='http://jmlr.org/proceedings/papers/v37/' 
    response=urllib2.urlopen(url)
    return response.read()
    
def download(url,file):
    """download the file 

    @parameters
    url:the resource of the file 
    file:the name to save the file
    """
    f=urllib2.urlopen(url)
    with open(file+'.pdf','wb') as output:
        output.write(f.read())
    
def  process(document):
    #print document
    p=re.compile('<div.*?class="paper".*?>[\s\S]*?<p.*?class="title".*?>([\s\S]*?)</p>[\s\S]*?<a.*?href="(.*?.pdf)">pdf</a>[\s\S]*?</div>',re.IGNORECASE)
    m=p.finditer(document)
    url='http://jmlr.org/proceedings/papers/v37/'
    for i in m:
        print 'title:',i.group(1)
        print 'url:',url+i.group(2)
        print 'downloading....'
        download(url+i.group(2),i.group(1))
        
if __name__ == '__main__':
    process(getDocument())

运行以上脚本：

在对应为文件夹下，可以看到下载的papers：

打开其中一篇，也能够正常显示：

ps：唯一不足的是，我们可以看到有的文章是有补充的，但是在我写正则表达式的时候没有试验成功，也没有再深究，有知道的同学不吝赐教。因为是有的文章有，有的文章是没有的嘛，所以我想就是若存在则匹配，若不存在，则匹配不到，由于对正则表达式不是很熟悉，先到这里，以后找到解决方式的话再更新。没有技术难度，仅作日常记录。

update:

经过实际的操作，发现python直接下载的速度好慢好慢，一个解决的办法就是将所有的论文的url都提取了，然后再用迅雷进行批量下载。我把ICML-2015的会议文章的全部的链接保存在了文件中，分享在了百度云上，有需要的可以进行下载。

分享链接：http://pan.baidu.com/s/1hqJFV2s

Update:

最近把这个小程序在python3上进行了运行，提取ICML的文章的标题并输出到文件的代码如下：

# -*- coding: utf-8 -*-  
import urllib.request
import re
 
def getDocument():
    url='http://jmlr.org/proceedings/papers/v37/' 
    response=urllib.request.urlopen(url)
    return response.read().decode('gbk')
     
def download(url,file):
    """download the file 

    @parameters
    url:the resource of the file 
    file:the name to save the file
    """
    f=urllib2.urlopen(url)
    with open(file+'.pdf','wb') as output:
        output.write(f.read())
     
def  process(document):
    p=re.compile('<p.*?class="title".*?>([\s\S]*?)</p>',re.IGNORECASE)
    m=p.finditer(document)
    file=open('ICML-title.txt','a',-1)
    for i in m:
        print ('title:',i.group(1))
        file.write(i.group(1)+'\n')
    file.close()

         
if __name__ == '__main__':
    process(getDocument())

NIPS2016文章下载python3源码：

import urllib.request
import re

def Main():
    host='https://papers.nips.cc'
    list='https://papers.nips.cc/book/advances-in-neural-information-processing-systems-29-2016'
    file=urllib.request.urlopen(list)
    content=file.read().decode('utf-8')
    file.close()
    p=re.compile('<a.*?href="/paper/([\s\S]*?)">([\s\S]*?)</a>',re.IGNORECASE)
    m=p.finditer(content)
    save_dir='NIPS2016/'
    for i in m:
        url=i.group(1)
        name=i.group(2)
        name=name.replace(':','-')
        name_re=re.compile('<[\s\S]*?>[\s\S]*?</[\s\S]*?>')
        name=name_re.subn('',name)[0]
        name=name.replace('/', ' or ')
        print("url:"+url)
        print("name:"+name)
        download_url=host+'/paper/'+i.group(1)+".pdf"
        print("download_url:"+download_url)
        paper_file=urllib.request.urlopen(download_url)
        with open(save_dir+name+'.pdf','wb') as ouput:
            ouput.write(paper_file.read())
            ouput.close()
            paper_file.close()


if __name__ == '__main__':
    Main()

转载于:https://my.oschina.net/zhangwenwen/blog/513415