Java爬虫 springboot框架下新浪微博爬虫

最新推荐文章于 2025-07-01 09:10:36 发布

原创最新推荐文章于 2025-07-01 09:10:36 发布 · 3.7k 阅读

6 ·

CC 4.0 BY-SA版权

文章标签：

#Java #爬虫 #微博爬虫 #Jsoup

Java学习同时被 3 个专栏收录

13 篇文章

订阅专栏

MySQL

2 篇文章

订阅专栏

设计模式

2 篇文章

订阅专栏

博主习惯用Java写爬虫，分享了两个爬虫案例。一是爬取mm131网站图片，虽遇403问题但可暴力刷新解决；二是爬取新浪博客图片，遇到未授权、405报错等问题，最终采用mybatis+MySQL、springboot解决，还附上相关代码类及工程链接。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

这个题目，讲真，我也奇怪，我为什么写个爬虫需要用到这种大型框架，最开始，刚开始接触爬虫的时候，我写的爬虫，只要能获得我想要的数据，那就是成功的，完美的，没有bug的，哪怕他慢，哪怕操作繁琐且复杂，只要获取我想要的数据，那就是成功的。

后来，闲来无事，Java用习惯了，不想重新用Python写爬虫了，怎么办呢，Java写呗，反正都是case by case的，其实在写这个爬虫的时候，我不止一次的问自己，写代码的时间，估计你自己手动都能下载完了吧。

写了两个case，一个是爬取mm131网站的，这个难度不大，因为不需要登录状态，服务器也不需要检测请求频率，不过我还是把IP伪装了一下，20个线程跑满直接抓取，中间出了点小插曲，就是这个网站在我已经爬取完一次存了十几个G图片的之后一两天，也不知道是新配置的nginx，还是出了问题，动不动就爆403，但是仔细看了看，问题不大，失败了强制刷新，多刷新几次就可以了，所以直接暴力在catch里面做了迭代调用，不过效果还可以。有兴趣的话可以下载看看，工程比较简单，只需要修改配置文件，创建一个放置图片的文件夹就可以直接爬取，有兴趣可以clone下来看看https://github.com/gsy44355/mm131pic.git

第二个case是我抓新浪博客的，这个真的是，头大。背景呢，是由于我关注的一个博主实在是太高产了，导致我根本没办法下载所有原图，哪有那么多时间刷微博啊= =，所以想着写个爬虫一次性爬取完所有的图片，这样就省事多了。但是新浪毕竟是大公司，所以啊，之前开20个线程爬取，一直会报错，报错都是未授权，这种未授权的错误，千万不能强行持续重传，会导致自己账号cookie被封，我被封了两次cookie，还改了一次密码= = 真担心自己号没了。

那么该如何爬取呢？首先，要保证速度Thread.sleep(1000)，就可以了，那好，加个这个，然后重新爬取。。诶，刚下载了两个图片，凉了，又是405报错。。。我于是痛定思痛，是什么问题呢？其实出在每一次如果只用内存保存链接，会导致这次失败了，又去做一次无用功。好嘛，mybatis+MySQL，顺便也加个log吧，自己试了几分钟以后发现，还是springboot简单，整合，直接使用就是了，哪有那么复杂。

附上Crawlerbase类，WeiboCrawler类供大家参考，提出建议，因为base类希望能够尽可能的设计通用，详细的代码可以clone https://github.com/gsy44355/springboot-start.git，我应该会把这个维护起来的，不过这个工程内容比较多，不适合单独研究爬虫，不过可以直接用Test来运行你想运行的代码，目前启动速度还是在秒级的。

package com.gsy.springboot.start.serviceImpl;

import com.gsy.springboot.start.mapper.TbCrawlerUrlCustomMapper;
import com.gsy.springboot.start.mapper.auto.TbCrawlerUrlMapper;
import com.gsy.springboot.start.pojo.TbCrawlerUrl;
import com.gsy.springboot.start.service.CrawlerBaseService;
import com.gsy.springboot.start.util.LogUtil;
import com.gsy.springboot.start.util.crawler.CrawlerSpecialFunc;
import com.gsy.springboot.start.util.crawler.CreateHeaderMap;
import com.gsy.springboot.start.util.crawler.WebCrawlerUtil;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.dao.DuplicateKeyException;
import org.springframework.stereotype.Service;
import org.springframework.transaction.annotation.Transactional;

/**
 * Created By Gsy on 2019/5/18
 */
@Service
public class CrawlerBaseServiceImpl implements CrawlerBaseService {
    @Autowired
    TbCrawlerUrlMapper tbCrawlerUrlMapper;
    @Autowired
    TbCrawlerUrlCustomMapper tbCrawlerUrlCustomMapper;
    @Override
    public synchronized String getUrl(String type) {
        TbCrawlerUrl tbCrawlerUrl = tbCrawlerUrlCustomMapper.getOneUrl(type);
        if (tbCrawlerUrl == null){
            return null;
        }
        tbCrawlerUrl.setBusy("1");
        tbCrawlerUrlMapper.updateByPrimaryKeySelective(tbCrawlerUrl);
        return tbCrawlerUrl.getUrl();
    }

    @Override
    public int updateUrlToNoUse(String url) {
        return tbCrawlerUrlMapper.updateByPrimaryKeySelective(new TbCrawlerUrl(url,"0"));
    }

    @Override
    public int addUrl(TbCrawlerUrl tbCrawlerUrl) {
        try{
            return tbCrawlerUrlMapper.insertSelective(tbCrawlerUrl);
        }catch (DuplicateKeyException e){
            LogUtil.info(this.getClass(),"Crawler获取到重复Url={}",tbCrawlerUrl.getUrl());
            return 1;
        }
    }

    @Override
    public int deleteUrl(String url) {
        return tbCrawlerUrlMapper.deleteByPrimaryKey(url);
    }

    @Override
    public int deleteAll() {
        return tbCrawlerUrlCustomMapper.deleteAll();

    }

    @Override
    public void doCrawler(String type,long sleepTime,CrawlerSpecialFunc crawlerSpecialFunc) {
        int errorCount = 0;
        while(true){
            String url = null;
            try {
                if(sleepTime != 0 ){
                    Thread.sleep(sleepTime);
                }
                url = this.getUrl(type);
                if(url == null){
                    break;
                }
                LogUtil.info(this.getClass(),"获取到Url={}",url);
                crawlerSpecialFunc.specialFunc(url);
                this.deleteUrl(url);
            }catch (Exception e){
                errorCount++;
                LogUtil.error(this.getClass(),"抓取异常，决定需要如何处理",e);
                this.updateUrlToNoUse(url);
                if (errorCount >100){
                    break;
                }
            }
        }
    }
}

下面是case by case 的微博爬虫，startNew 和 reStart就是两个入口方法。

==================================================================================

尴尬的修改了一次，发现这玩意竟然不能直接多线程操作- -

package com.gsy.springboot.start.serviceImpl;

import com.gsy.springboot.start.pojo.TbCrawlerUrl;
import com.gsy.springboot.start.service.CrawlerBaseService;
import com.gsy.springboot.start.service.WeiboCrawlerService;
import com.gsy.springboot.start.util.LogUtil;
import com.gsy.springboot.start.util.crawler.CreateHeaderMap;
import com.gsy.springboot.start.util.crawler.WebCrawlerUtil;
import org.apache.commons.lang3.StringUtils;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.scheduling.annotation.Async;
import org.springframework.scheduling.annotation.EnableAsync;
import org.springframework.stereotype.Service;

import java.util.ArrayList;
import java.util.List;
import java.util.ResourceBundle;

/**
 * Created By Gsy on 2019/5/18
 */
@Service
@EnableAsync
public class WeiboCrawlerServiceImpl implements WeiboCrawlerService {
    @Autowired
    CrawlerBaseService crawlerBaseService;
    @Override
    public void startNew() {
        crawlerBaseService.deleteAll();
        ResourceBundle resourceBundle = ResourceBundle.getBundle("crawler/start");
        for (int i = Integer.parseInt(resourceBundle.getString("countStart") ); i <Integer.parseInt(resourceBundle.getString("countEnd") ); i++) {
            crawlerBaseService.addUrl(new TbCrawlerUrl(resourceBundle.getString("mainUrl").replace("@replace@",""+i),"1","0"));
        }
        reStart();
    }

    @Override
    public void reStart() {
        getUrl();
        getPicUrl();
        List<Thread> list = new ArrayList<>();
        for (int i = 0; i < 20; i++) {
            LogUtil.info(this.getClass(),"创建线程={}",""+i);
            Thread thread = new Thread(() -> getPic());
            list.add(thread);
            thread.start();
        }
        for (Thread thread:list) {
            try {
                thread.join();
            } catch (InterruptedException e) {
                e.printStackTrace();
            }
        }

    }

    /**
     * 获取图片源链接，会有两种方式  针对的是https://weibo.cn
     */
    public void getUrl() {
        crawlerBaseService.doCrawler("1",1000,(s) -> {
                String html = WebCrawlerUtil.getWebHtml(s, CreateHeaderMap.getMapByName("crawler/page"),"utf-8");
                Document document = Jsoup.parse(html);
                Elements elements = document.getElementsByTag("a");
                for (Element e : elements) {
                    String url = e.attr("href");
                    if (url.matches(".*?picAll.*?")) {
                        crawlerBaseService.addUrl(new TbCrawlerUrl(url,"1","0"));
                        LogUtil.info(this.getClass(),"存入的AllUr={}" + url);
                    } else if (url.matches(".*?oripic.*?")) {
                        if(!s.matches("https://weibo.cn/u/6697930990[?]filter=2&page=\\d+")){
                            url = "https://weibo.cn"+url;
                        }
                        crawlerBaseService.addUrl(new TbCrawlerUrl(url,"2","0"));
                        LogUtil.info(this.getClass(),"存入的Url={}" + url);
                    }
                }
        });
    }

    /**
     * 获取图片真实链接，进行了一次302跳转
     */
    public void getPicUrl(){
        crawlerBaseService.doCrawler("2",1000,(url) -> {
            String picUrl = WebCrawlerUtil.get302Location(url,CreateHeaderMap.getMapByName("crawler/picR"));
            if(StringUtils.isNotEmpty(picUrl)){
                crawlerBaseService.addUrl(new TbCrawlerUrl(picUrl,"3","0"));
            }
        });
    }

    /**
     * 真实获取图片，这个没有session，多线程随便跑
     */
    @Async
    public void getPic() {
        crawlerBaseService.doCrawler("3",0,url -> {
            WebCrawlerUtil.getWebPicture(url, url.substring(url.lastIndexOf("/")), CreateHeaderMap.getMapByNameWithRandomIp("crawler/picture"), ResourceBundle.getBundle("crawler/start").getString("dir"));
            LogUtil.info(this.getClass(),"保存图片={}" + url);
        });
    }
}

后面整理好会维护文档和发布到github上，有什么问题可以留言讨论，希望指出我的不足。