Seimi基础系列2-SeimiCrawler整合Mybatis存储数据

最新推荐文章于 2024-08-09 07:18:35 发布

weixin_33790053

最新推荐文章于 2024-08-09 07:18:35 发布

阅读量327

点赞数

CC 4.0 BY-SA版权

文章标签： java 爬虫数据库

原文链接：https://my.oschina.net/u/589889/blog/719305

本文介绍如何将SeimiCrawler与Mybatis进行整合，实现从网页抓取数据并存储到数据库的过程。SeimiCrawler是一款Java爬虫框架，能够简化爬虫系统的开发流程。文中详细展示了依赖配置、数据表结构定义、Model对象创建及Mybatis配置等步骤。

2019独角兽企业重金招聘Python工程师标准>>>

最近关注SeimiCrawler整合Mybatis的朋友比较多，故仅以此文抛砖引玉。如果是不了解SeimiCrawler的朋友也可以通过此文简单了解下SeimiCrawler。

SeimiCrawler简介

SeimiCrawler是一个敏捷的，独立部署的，支持分布式的Java爬虫框架，希望能在最大程度上降低新手开发一个可用性高且性能不差的爬虫系统的门槛，以及提升开发爬虫系统的开发效率。在SeimiCrawler的世界里，绝大多数人只需关心去写抓取的业务逻辑就够了，其余的Seimi帮你搞定。设计思想上SeimiCrawler受Python的爬虫框架Scrapy启发，同时融合了Java语言本身特点与Spring的特性，并希望在国内更方便且普遍的使用更有效率的XPath解析HTML，所以SeimiCrawler默认的HTML解析器是JsoupXpath(独立扩展项目，非jsoup自带),默认解析提取HTML数据工作均使用XPath来完成（当然，数据处理亦可以自行选择其他解析器）。并结合SeimiAgent彻底完美解决复杂动态页面渲染抓取问题。

项目源码

Github托管

下面正式开始整合Mybatis的内容。数据库以MySQL为例。

依赖

<dependency>
	<groupId>cn.wanghaomiao</groupId>
          <artifactId>SeimiCrawler</artifactId>
          <version>1.2.0</version>
</dependency>
<dependency>
	<groupId>org.apache.commons</groupId>
	<artifactId>commons-dbcp2</artifactId>
	<version>2.1.1</version>
</dependency>
<dependency>
	<groupId>org.apache.commons</groupId>
	<artifactId>commons-pool2</artifactId>
	<version>2.4.2</version>
</dependency>
<dependency>
	<groupId>mysql</groupId>
	<artifactId>mysql-connector-java</artifactId>
	<version>5.1.37</version>
</dependency>
<dependency>
	<groupId>org.mybatis</groupId>
	<artifactId>mybatis-spring</artifactId>
	<version>1.3.0</version>
</dependency>
<dependency>
	<groupId>org.mybatis</groupId>
	<artifactId>mybatis</artifactId>
	<version>3.4.1</version>
</dependency>

数据表结构

假设建有数据库，库名为xiaohuo，内含表结构如下：

CREATE TABLE `blog` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `title` varchar(300) DEFAULT NULL,
  `content` text,
  `update_time` timestamp NOT NULL DEFAULT '0000-00-00 00:00:00' ON UPDATE CURRENT_TIMESTAMP,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;

对应的Model对象

package cn.wanghaomiao.model;

import cn.wanghaomiao.seimi.annotation.Xpath;
import org.apache.commons.lang3.StringUtils;
import org.apache.commons.lang3.builder.ToStringBuilder;

/**
 * Xpath语法可以参考 http://jsoupxpath.wanghaomiao.cn/
 * @since 2015/10/27.
 */
public class BlogContent {
    private Integer id;

    @Xpath("//h1[@class='postTitle']/a/text()|//a[@id='cb_post_title_url']/text()")
    private String title;

    //也可以这么写 @Xpath("//div[@id='cnblogs_post_body']//text()")
    @Xpath("//div[@id='cnblogs_post_body']/allText()")
    private String content;

    public Integer getId() {
        return id;
    }

    public void setId(Integer id) {
        this.id = id;
    }

    public String getTitle() {
        return title;
    }

    public void setTitle(String title) {
        this.title = title;
    }

    public String getContent() {
        return content;
    }

    public void setContent(String content) {
        this.content = content;
    }

    @Override
    public String toString() {
        if (StringUtils.isNotBlank(content)&&content.length()>100){
            //方便查看截断下
            this.content = StringUtils.substring(content,0,100)+"...";
        }
        return ToStringBuilder.reflectionToString(this);
    }
}

整合Mybatis的配置文件

resources下添加 mybatis-config.xml文件

一些基本的全局设置

<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE configuration
        PUBLIC "-//mybatis.org//DTD Config 3.0//EN"
        "http://mybatis.org/dtd/mybatis-3-config.dtd">
<configuration>
    <settings>
        <setting name="mapUnderscoreToCamelCase" value="true"/>
    </settings>
</configuration>

resources下添加seimi-mybatis.xml文件

<?xml version="1.0" encoding="UTF-8"?>
<beans xmlns="http://www.springframework.org/schema/beans"
       xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
       xmlns:context="http://www.springframework.org/schema/context"
       xsi:schemaLocation="http://www.springframework.org/schema/beans http://www.springframework.org/schema/beans/spring-beans.xsd http://www.springframework.org/schema/context http://www.springframework.org/schema/context/spring-context.xsd">


    <context:annotation-config />
    <bean id="mybatisDataSource" class="org.apache.commons.dbcp2.BasicDataSource">
        <property name="driverClassName" value="${database.driverClassName}"/>
        <property name="url" value="${database.url}"/>
        <property name="username" value="${database.username}"/>
        <property name="password" value="${database.password}"/>
    </bean>

    <bean id="sqlSessionFactory" class="org.mybatis.spring.SqlSessionFactoryBean" abstract="true">
        <property name="configLocation" value="classpath:mybatis-config.xml"/>
    </bean>
    <bean id="seimiSqlSessionFactory" parent="sqlSessionFactory">
        <property name="dataSource" ref="mybatisDataSource"/>
    </bean>
    <bean class="org.mybatis.spring.mapper.MapperScannerConfigurer">
        <property name="basePackage" value="cn.wanghaomiao.dao.mybatis"/>
        <property name="sqlSessionFactoryBeanName" value="seimiSqlSessionFactory"/>
    </bean>
</beans>

配置文件中的${database.driverClassName}是由于SeimiCrawler的demo工程还有动态配置的相关设置，此处亦可直接写死，不必再读其他配置。

在cn.wanghaomiao.dao.mybatis目录下添加DAO

package cn.wanghaomiao.dao.mybatis;

import cn.wanghaomiao.model.BlogContent;
import org.apache.ibatis.annotations.Insert;
import org.apache.ibatis.annotations.Options;
import org.apache.ibatis.annotations.Param;

/**
 * @since 2016/7/27.
 */
public interface MybatisStoreDAO {

    @Insert("insert into blog (title,content,update_time) values (#{blog.title},#{blog.content},now())")
    @Options(useGeneratedKeys = true, keyProperty = "blog.id")
    int save(@Param("blog") BlogContent blog);
}

至此，Mybatis部分的已经就绪了。

使用

package cn.wanghaomiao.crawlers;

import cn.wanghaomiao.dao.mybatis.MybatisStoreDAO;
import cn.wanghaomiao.model.BlogContent;
import cn.wanghaomiao.seimi.annotation.Crawler;
import cn.wanghaomiao.seimi.def.BaseSeimiCrawler;
import cn.wanghaomiao.seimi.struct.Request;
import cn.wanghaomiao.seimi.struct.Response;
import cn.wanghaomiao.xpath.model.JXDocument;
import org.springframework.beans.factory.annotation.Autowired;

import java.util.List;

/**
 * 将解析出来的数据直接存储到数据库中,整合mybatis实现
 *
 * @author 汪浩淼 [et.tw@163.com]
 * @since 2016/07/27.
 */
@Crawler(name = "mybatis")
public class DatabaseMybatisDemo extends BaseSeimiCrawler {
    @Autowired
    private MybatisStoreDAO storeToDbDAO;

    @Override
    public String[] startUrls() {
        return new String[]{"http://www.cnblogs.com/"};
    }

    @Override
    public void start(Response response) {
        JXDocument doc = response.document();
        try {
            List<Object> urls = doc.sel("//a[@class='titlelnk']/@href");
            logger.info("{}", urls.size());
            for (Object s : urls) {
                push(Request.build(s.toString(), "renderBean"));
            }
        } catch (Exception e) {
            e.printStackTrace();
        }
    }

    public void renderBean(Response response) {
        try {
            BlogContent blog = response.render(BlogContent.class);
            logger.info("bean resolve res={},url={}", blog, response.getUrl());
            //使用神器paoding-jade存储到DB
            int changeNum = storeToDbDAO.save(blog);
            int blogId = blog.getId();
            logger.info("store success,blogId = {},changeNum={}", blogId, changeNum);
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}

接下来简单启动下，

public class Boot {
    public static void main(String[] args){
        Seimi s = new Seimi();
        s.start("mybatis");
    }
}

可以看到如下日志：

00:25:18 INFO  c.w.crawlers.DatabaseMybatisDemo - store success,blogId = 257,changeNum=1
00:25:18 INFO  c.w.crawlers.DatabaseMybatisDemo - bean resolve res=cn.wanghaomiao.model.BlogContent@3edc08c3[id=<null>,title=CoordinatorLayout自定义Bahavior特效及其源码分析CoordinatorLayout自定义Bahavior特效及其源码分析,content=@[CoordinatorLayout, Bahavior] CoordinatorLayout是android support design包中可以算是最重要的一个东西，运用它可以做出一些不错的特效...],url=http://www.cnblogs.com/soaringEveryday/p/5711545.html
00:25:18 INFO  c.w.crawlers.DatabaseMybatisDemo - store success,blogId = 258,changeNum=1
00:25:18 INFO  c.w.crawlers.DatabaseMybatisDemo - store success,blogId = 259,changeNum=1
00:25:18 INFO  c.w.crawlers.DatabaseMybatisDemo - store success,blogId = 260,changeNum=1

整合完毕！