基于golang的爬虫demo,爬取豆瓣点评文章信息(movie、book、music)
注意:仅供学习交流,任何非法使用与作者无关!
目录
0x00、背景介绍
上篇文章博主已经介绍、分析,以及上传了源代码,如何获取爬取的信息,如何通过浏览器检查设置来定位HTML中各个标签的元素,本文再一次实践一下,爬取豆瓣点评的文章。
demo仍然是基于golang语言编写开发。仍然写出详细的获取、设计、编码过程。将支持图书、电影、音乐的爬取,以及按照热度,起始页、终止页自定义来爬取数据。灵活度更高。
0x01、找规律
豆瓣点评不同与微博,微博更注重个人信息的收集保护,而豆瓣本身就是一个分享点评的平台,共享是平台的特色,所以连最最基本用户登录信息也不用写入,直接匿名进行爬取,数量好像也没有额外的限制。(博主本身没有测试过最大爬取数量)
点击【读书】->【书评】
拿到最受欢迎书评的URL:
https://book.douban.com/review/best/?start=0(首页)
https://book.douban.com/review/best/?start=20(第二页)
然后对比获取最新发表的URL关键字:
https://book.douban.com/review/latest/?start=0(首页)
https://book.douban.com/review/latest/?start=20(第二页)
// 主要遍历页数伪代码如下
i = 1, max = 10;
type = "best";
u := "https://book.douban.com/review/" + type + "/?start=";
for i=1; i<max; i++{
url := u + i * 10;
r = RequestUrl(url);
res = append(res , r)
}
0x02、定位HTML元素
然后根据首页的图书点评获取每一条数据。
这样就拿到了每条中大的div,然后逐步分解,像TCP/IP协议一样一层一层的剥来。如下完整的html的div数据。
<div class="main review-item" id="13642776">
<a class="subject-img" href="https://book.douban.com/subject/35397746/"> <img alt="虚无时代" title="虚无时代" src="https://img1.doubanio.com/view/subject/m/public/s33901818.jpg" rel="v:image"> </a>
<header class="main-hd">
<a href="https://www.douban.com/people/weizhoushiwang/" class="avator">
<img width="24" height="24" src="https://img1.doubanio.com/icon/u1679535-8.jpg">
</a>
<a href="https://www.douban.com/people/weizhoushiwang/" class="name">维舟</a>
<span class="allstar50 main-title-rating" title="力荐"></span>
<span content="2021-06-29" class="main-meta">2021-06-29 07:49:56</span>
</header>
<div class="main-bd">
<h2><a href="https://book.douban.com/review/13642776/">一切坚固的东西都烟消云散了</a></h2>
<div id="review_13642776_short" class="review-short" data-rid="13642776">
<div class="short-content">
自从尼采喊出“上帝死了”以来,现代人一直生活在一种悬浮的失重状态之中。一旦作为世人心灵枢结的那个象征消亡,人们的行事也就失去了原有的分寸和准绳,因为已经没有什么最高律法能禁止你做任何事,于是“一切皆有可能”,这既给了人极大的自由,但以往被禁止的恶行也随之横...
(<a href="javascript:;" id="toggle-13642776-copy" class="unfold" title="展开">展开</a>)
</div>
</div>
<div id="review_13642776_full" class="hidden">
<div id="review_13642776_full_content" class="full-content"></div>
</div>
<div class="action">
<a href="javascript:;" class="action-btn up" data-rid="13642776" title="有用">
<img src="https://img3.doubanio.com/f/zerkalo/536fd337139250b5fb3cf9e79cb65c6193f8b20b/pics/up.png">
<span id="r-useful_count-13642776">
58
</span>
</a>
<a href="javascript:;" class="action-btn down" data-rid="13642776" title="没用">
<img src="https://img3.doubanio.com/f/zerkalo/68849027911140623cf338c9845893c4566db851/pics/down.png">
<span id="r-useless_count-13642776">
</span>
</a>
<a href="https://book.douban.com/review/13642776/#comments" class="reply ">4回应</a>
<a href="javascript:;;" class="fold hidden">收起</a>
</div>
</div>
</div>
如此便获取到元素,接下来进行匹配就可以了。
0x03、设计过程
首先总结一下找到的分页规律,然后电影、图书、音乐,都是类似的方式。
拿到最受欢迎书评的URL:
https://book.douban.com/review/best/?start=0(首页)
https://book.douban.com/review/best/?start=20(第二页)
然后对比获取最新发表的URL关键字:
https://book.douban.com/review/latest/?start=0(首页)
https://book.douban.com/review/latest/?start=20(第二页)
接下来进行每条元素的定位分析
<h2><a href="https://book.douban.com/review/9593425/">爱是天时地利的迷信</a></h2>
标题:<span property="v:summary">爱是天时地利的迷信</span>
内容:data-original="1"><p></p><p>喜欢乍见...<div class="copyright">
信息:<header class="main-hd"> <a href="https://www.douban.com/people/3551583/"><span>慕容复:</span></a>
评论:<a href="https://book.douban.com/subject/30245411/">简·爱</a>
将爬取的数据写入文件,此时含有图片,文本选取markdown格式进行保存,美观大方。
0x04、编码
直接运行,不需要cookie等认证信息。
package main
import (
"fmt"
"io"
"net/http"
"os"
"regexp"
"strconv"
"strings"
)
func WriteFile(title, result, path string) {
title = strings.Replace(title, "|", "", -1)
filename := path + "\\" + title + ".md"
f, err := os.Create(filename)
if err != nil {
fmt.Printf("os.Create err = ", err)
}
f.Write([]byte(result))
f.Close()
}
//匹配标题结果
func MatchTitle(data string) (title string) {
retitle := regexp.MustCompile(`<span property="v:summary">(.*)</span>`)
if retitle == nil {
fmt.Println("regexp.MustCompile title err")
}
titles := retitle.FindAllStringSubmatch(data, 1)
for _, data := range titles {
title = data[1]
}
return
}
//匹配内容
func MatchContent(data string) (content string) {
recontent := regexp.MustCompile(`data-original="(.*?)">(?s:(.*?))<div class="main-author">`)
if recontent == nil {
fmt.Println("regexp.MustCompile content err")
}
contents := recontent.FindAllStringSubmatch(data, 1)
for _, data := range contents {
content = data[2]
}
return
}
//过滤为MD格式
func MdReplace(content string) (result string) {
result = strings.Replace(content, "\t", "", -1)
result = strings.Replace(result, "\n", "", -1)
result = strings.Replace(result, " ", "", -1)
return
}
//过滤为原始文本
func ContentReplace(content string) (result string) {
result = strings.Replace(content, "\t", "", -1)
result = strings.Replace(result, "<p>", "", -1)
result = strings.Replace(result, "</p>", "", -1)
result = strings.Replace(result, "\n", "", -1)
result = strings.Replace(result, " ", "", -1)
return
}
func SpiderPage(url string) (title, content string) {
result, err := GetPageClient(url)
if err != nil {
fmt.Println("GetPageClient err", err)
}
title = MatchTitle(result)
content = MatchContent(result)
//过滤内容
content = MdReplace(content)
content = "<h1>" + title + "</h1>\n<div>" + content
return
}
func GetPageClient(url string) (result string, err error) {
//伪装成浏览器客户端
client := &http.Client{}
req, err1 := http.NewRequest("GET", url, nil)
if err1 != nil {
err = err1
return
}
req.Header.Set("User-Agent", "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.181 Safari/537.36")
resp, err2 := client.Do(req)
if err2 != nil {
err = err2
return
}
defer resp.Body.Close()
for {
buf := make([]byte, 1024*4)
n, err := resp.Body.Read(buf)
if n == 0 {
break
}
if err != nil && err != io.EOF {
fmt.Println("resp.Body.Read err = ", err)
break
}
result += string(buf[:n])
}
return
}
//爬取每页中每一个内容的url
func GetEveryPage(i int, book, best, path string, page chan int) {
url := "https://" + book + ".douban.com/review/" + best + "/?start=" + strconv.Itoa((i-1)*20)
fmt.Printf("第%d页%s的url = %s...\n", i, best, url)
result, err := GetPageClient(url)
if err != nil {
fmt.Println("GetPageClient err", err)
}
//匹配结果
re := regexp.MustCompile(`<h2><a href="(.*)"`)
if re == nil {
fmt.Println("regexp.MustCompile err")
}
singleUrls := re.FindAllStringSubmatch(result, -1)
for _, data := range singleUrls {
title, content := SpiderPage(data[1])
WriteFile(title, content, path)
}
page <- i
}
func GetWork(book, best string, start, end int, path string) {
page := make(chan int, 10)
for i := start; i <= end; i++ {
go GetEveryPage(i, book, best, path, page)
}
for i := start; i <= end; i++ {
fmt.Printf("%s第%d页获取成功...\n", book, <-page)
}
}
//https://book.douban.com/review/best/?start=0
func main() {
var start, end int
var path, book, best, yes string
for {
fmt.Println("请输入爬取类型(如book、movie、music)")
fmt.Scan(&book)
fmt.Println("请输入爬取热度(如best、latest)")
fmt.Scan(&best)
fmt.Println("请输入起始页")
fmt.Scan(&start)
fmt.Println("请输入结束页")
fmt.Scan(&end)
fmt.Println(`请输入保存路径(如C:\Users\Desktop)`)
fmt.Scan(&path)
GetWork(book, best, start, end, path)
fmt.Println("爬取完成,是否继续爬取?Y/N")
fmt.Scan(&yes)
if yes == "N" || yes == "n" {
break
}
}
}
0x05、转换markdown存储
主要是替换了字符,转换为markdown,适配并不完美,但是已经很好看了。
//过滤为MD格式
func MdReplace(content string) (result string) {
result = strings.Replace(content, "\t", "", -1)
result = strings.Replace(result, "\n", "", -1)
result = strings.Replace(result, " ", "", -1)
return
}
//过滤为原始文本
func ContentReplace(content string) (result string) {
result = strings.Replace(content, "\t", "", -1)
result = strings.Replace(result, "<p>", "", -1)
result = strings.Replace(result, "</p>", "", -1)
result = strings.Replace(result, "\n", "", -1)
result = strings.Replace(result, " ", "", -1)
return
}
func SpiderPage(url string) (title, content string) {
result, err := GetPageClient(url)
if err != nil {
fmt.Println("GetPageClient err", err)
}
title = MatchTitle(result)
content = MatchContent(result)
//过滤内容
content = MdReplace(content)
content = "<h1>" + title + "</h1>\n<div>" + content
return
}
0x06、效果图
注意:仅供学习交流,任何非法使用与作者无关!