一：获取到所有待收集信息的项目列表

最新推荐文章于 2025-07-04 14:02:30 发布

原创最新推荐文章于 2025-07-04 14:02:30 发布 · 884 阅读

0 ·

CC 4.0 BY-SA版权

java_github_crawler 专栏收录该内容

5 篇文章

订阅专栏

本文介绍了一种使用Java实现的GitHub项目爬虫程序，详细解释了如何从Awesome-java页面抓取项目列表，分析页面结构，利用OkHttp和Jsoup获取项目详情，并通过GitHub API结合Gson解析JSON数据，获取star、fork和issue数量。

获取到所有待收集信息的项目列表
遍历项目列表依次获取到每个项目的主页信息进一步就可以直到该项目的star数 fork数 issue数
把这些数据存储到mysql中
写一个简单服务器来展示数据库中的数据(通过图标的形式看到一个更直观的效果)

一.获取到所有待收集信息的项目列表

自己写一个爬虫程序,访问Awesome-java这个页面,然后进一步获取这个页面上所有项目链接的信息

1.1观察页面结构

我是先分析页面结构,就是chrom的开发者工具,然后发现Awesome-Java里的项目都是由多个ul标签套着li标签有规律排版的
li就是我想要获取的每一个单独项目
li里面又有个a标签代表着超链接信息,li里面的body部分是这个项目的简单描述信息
在这里插入图片描述

1.2分析页面结构

那么观察完我们就知道Awesome-Java的页面结构了这样我们就可以写程序来获取这个页面相关信息了
----1.2.1如何获取到页面内容
构造一个http请求发送给服务器,这里我借助了一个第三方库OkHttp的库,用这个库获取到页面
这里我用maven的方式导入
利用OkHttp的库我就可以根据url获取到对应页面的内容,

试着打印一下

import okhttp3.Call;
import okhttp3.OkHttpClient;
import okhttp3.Request;
import okhttp3.Response;

import java.io.IOException;

public class crawler {
    public static void main(String[] args) throws IOException {
        //获取okHttpClient
        OkHttpClient okHttpClient = new OkHttpClient();
        //创建一个Request对象
        Request request = new Request.Builder().url("https://github.com/akullpp/awesome-java/blob/master/README.md").build();
        //创建一个call对象,这个对象负载进行一次网络访问操作
        Call call = okHttpClient.newCall(request);
        //call提交到服务器,返回一个response对象
        Response response = call.execute();
        //判定响应是否成功
        if (!response.isSuccessful()){
            System.out.println("请求失败!");
            return;
        }
        System.out.println(response.body().string());
    }
}

请求之后返回的内容是一个html结构的内容看起来还是很复杂的,所以我们需要进一步分析提取我们需要的内容

----1.2.2分析页面结构
用字符串方式分析这个页面结构比较麻烦,这里我用了一个第三方库jsoup来分析html页面结构
用Jsoup这个类分析刚刚得到的html内容,就会生成一个Document对象,把字符串转换成了一个树形结构的文档
文档就可以getElementTag去获取各种标签,每个Element就对应一个标签
每个Element里面的内容就是我们要进行排行的项目的内容啦
这个时候创建一个类来表示项目

public class Project {
    private String name;//名称
    private String url;//url地址
    private String description;//描述


    private int stars;//点赞数
    private int fork;//贡献人数
    private int openIssiue;//bug数或者需求
}

再一一排查后(因为有的li标签并不是代表一个项目,我们需要将它们筛选掉)
排查后一个标签就对应一个个项目的关键信息了

li标签的text是项目的描述Description li标签里嵌套a标签
a标签的text是项目名称,a里的href参数是url

public class Crawler {
    private HashSet<String> urlBlackList = new HashSet<>();//黑名单
    {
        urlBlackList.add("https://github.com/events");
        urlBlackList.add("https://github.community");
        urlBlackList.add("https://github.com/about");
        urlBlackList.add("https://github.com/pricing");
        urlBlackList.add("https://github.com/contact");
        urlBlackList.add("https://github.com/security");
        urlBlackList.add("https://github.com/site/terms");
        urlBlackList.add("https://github.com/site/privacy");
    }
    public static void main(String[] args) throws IOException {
        Crawler crawler = new Crawler();
        String htmlBody = crawler.getPage("https://github.com/akullpp/awesome-java/blob/master/README.md");
        List<Project> list = crawler.parageProjectList(htmlBody);
        System.out.println(list);

    }
    public String getPage(String url) throws IOException {
        //获取okHttpClient
        OkHttpClient okHttpClient = new OkHttpClient();
        //创建一个Request对象
        Request request = new Request.Builder().url(url).build();
        //创建一个call对象,这个对象负载进行一次网络访问操作
        Call call = okHttpClient.newCall(request);
        //call提交到服务器,返回一个response对象
        Response response = call.execute();
        //判定响应是否成功
        if (!response.isSuccessful()){
            System.out.println("请求失败!");
            return null;
        }
        return response.body().string();
    }
    public List<Project> parageProjectList(String htmlBody){
        //使用Jsoup分析页面结构,获取所有li标签
        List<Project> projects = new ArrayList<>();
        Document document = Jsoup.parse(htmlBody);
        Elements elements = document.getElementsByTag("li");
        for (Element element : elements){
            Elements allElements = element.getElementsByTag("a");
            if (allElements.size() == 0){
                continue;
            }
            Project project = new Project();
            Element link = allElements.get(0);
            String name = link.text();
            String url = link.attr("href");
            String description = element.text();
            if (!url.startsWith("https://github.com")){
                continue;
            }

            if (urlBlackList.contains(url)){
                continue;
            }
            project.setName(name);
            project.setUrl(url);
            project.setDescription(description);
            projects.add(project);
        }
        return projects;
    }
}

走到这步我们就可以获得了AwesomeJava的所有list了

二:遍历项目列表依次获取到每个项目的主页信息进一步就可以得到该项目的star数 fork数 issue数

然后还是先观察,观察这些项目的html页面,其实GitHub会对外提供一组AP,让别人方便爬虫的实现,同时也通过API更好的限制爬取效率
如果直接访问html页面可能被反爬虫干掉,用api的话就可以更稳定的把数据拿到
通过GitHub提供的API我们就可以获取某个项目/仓库的相关信息,这里我们也是用OkhttpClient对象去访问GitHubapi
curl https://api.github.com/repos/doov-io/doov
返回的是一个json格式的文件,json格式的特点是以键值对的方式来组织数据
这里我用Gson来解析json数据

2.1初步了解Gson

json格式的特点是以键值对的方式来组织数据

public class TestGson {
    public static void main(String[] args) {
        //1.先创建一个Gson对象
        Gson gson = new GsonBuilder().create();
        
        //2.键值对数据转成json格式字符串
        HashMap<String,String> hashMap = new HashMap<>();
        hashMap.put("行者","武松");
        hashMap.put("花和尚","鲁智深");
        hashMap.put("及时雨","宋江");
        String result = gson.toJson(hashMap);
        System.out.println(result);
    }
}

{“行者”:“武松”,“花和尚”:“鲁智深”,“及时雨”:“宋江”}

将Json字符串转成键值对形式

public class TestGson {
    static class Test{
        int aaa;
        int bbb;
    }
    public static void main(String[] args) {
        //1.先创建一个Gson对象
        Gson gson = new GsonBuilder().create();
        //3.把Json格式字符串转成键值对
         String jsonString = "{ \"aaa\":1, \"bbb\":2}" ;
        //Test.class取出当前类的类对象
        Test t = gson.fromJson(jsonString,Test.class);
        System.out.println(t.aaa);
        System.out.println(t.bbb);
    }
}

2.2调用Github的API获取每个项目的页面

	//根据url获取仓库名字
	private String getRepoName(String url) {
        int lastOne = url.lastIndexOf("/");
        int lastTwo = url.lastIndexOf("/",lastOne - 1);
        if (lastOne == -1 || lastTwo == -1 ){
            System.out.println("当前url不合法");
            return null;
        }
        return url.substring(lastTwo+1);
    }
    //根据仓库名字获取每个项目的页面
    private String getRepoInfo(String repoName) throws IOException {
        String username = "superQlee";
        String password = "nobody577";
        //进行身份认证,把用户名密码加密之后得到一个字符串,放到http head头中
        String credential = Credentials.basic(username,password);
        String url = "https://api.github.com/repos/" + repoName;
        Request request = new Request.Builder().url(url).header("Authorization",credential).build();

        Call call = okHttpClient.newCall(request);
        Response response = call.execute();
        if (!response.isSuccessful()){
            System.out.println("请求Github API仓库失败!");
            return null;
        }
        return response.body().string();
    }

走到这一步,我们就可以知道每个项目的仓库页面了,这个时候就可以用Gson进一步去把关键信息提取出来,放到项目列表里了

2.3 Gson解析项目API仓库,获取star等关键信息

在这里插入图片描述
我在项目中也用到反射,就是我用GitHub的api获取到一个项目页面的json格式,json里面的有几个键值对是我想要拿到的
我用hashMap存储Json里的内容
然后我用Gson中反射机制来处理json字符串,
gson.fromJson()首先会拿到我这个HashMap的.class对象,然后就知道HashMap这个类对象的所有属性,
然后就可以把json里的内容填充到hashMap对象里面

	//利用了Gson的反射
	public void parseRepoInfo(String jsonString,Project project){
        Type type = new TypeToken<HashMap<String,Object>>(){}.getType();
        HashMap<String,Object> hashMap = gson.fromJson(jsonString,type);
        Double starCount =(Double) hashMap.get("stargazers_count");
        Double forkCount =(Double) hashMap.get("forks_count");
        Double openIssueCount =(Double) hashMap.get("open_issues_count");
        project.setStars(starCount.intValue());
        project.setFork(forkCount.intValue());
        project.setOpenIssiue(openIssueCount.intValue());

    }