<转>当MongoDB遇见Spark

本文探讨了MongoDB在Spark生态系统中的角色,展示了如何使用MongoDB替代HDFS,并提供了MongoDB Spark Connector的配置与示例代码。

适宜读者人群

  • 正在使用Mongodb的开发者

传统Spark生态系统 和 MongoDB在Spark生态的角色

传统Spark生态系统

Spark生态系统

那么Mongodb作为一个database, 可以担任什么样的角色呢? 就是数据存储这部分, 也就是图中的黑色圈圈HDFS的部分, 如下图

用MongoDB替换HDFS后的Spark生态系统

Spark+Mongodb生态系统

为什么要用MongoDB替换HDFS

  1. 存储方式上, HDFS以文件为单位,每个文件64MB~128MB不等, 而MongoDB作为文档数据库则表现得更加细颗粒化
  2. MongoDB支持HDFS所没有的索引的概念, 所以在读取上更加快
  3. MongoDB支持的增删改功能比HDFS更加易于修改写入后的数据
  4. HDFS的响应级别为分钟, 而MongoDB通常是毫秒级别
  5. 如果现有数据库已经是MongoDB的话, 那就不用再转存一份到HDFS上了
  6. 可以利用MongoDB强大的Aggregate做数据的筛选或预处理

MongoDB Spark Connector介绍

  1. 支持读取和写入,即可以将计算后的结果写入MongoDB
  2. 将查询拆分为n个子任务, 如Connector会将一次match,拆分为多个子任务交给spark来处理, 减少数据的全量读取

MongoDB Spark 示例代码

计算用类型Type=1的message字符数并按userid进行分组
开发Maven dependency配置

这里用的是mongo-spark-connector_2.11 的2.0.0版本和spark的spark-core_2.11的2.0.2版本

    <dependency>
        <groupId>org.mongodb.spark</groupId>
        <artifactId>mongo-spark-connector_2.11</artifactId>
        <version>2.0.0</version>
    </dependency>

    <dependency>
        <groupId>org.apache.spark</groupId>
        <artifactId>spark-core_2.11</artifactId>
        <version>2.0.2</version>
    </dependency>
示例代码
    import com.mongodb.spark._
    import org.apache.spark.{SparkConf, SparkContext}
    import org.bson._


    val conf = new SparkConf()
      .setMaster("local")
      .setAppName("Mingdao-Score")
      //同时还支持mongo驱动的readPreference配置, 可以只从secondary读取数据
      .set("spark.mongodb.input.uri", "mongodb://xxx.xxx.xxx.xxx:27017,xxx.xxx.xxx:27017,xxx.xxx.xxx:27017/inputDB.collectionName")
      .set("spark.mongodb.output.uri", "mongodb://xxx.xxx.xxx.xxx:27017,xxx.xxx.xxx:27017,xxx.xxx.xxx:27017/outputDB.collectionName")

    val sc = new SparkContext(conf)
    // 创建rdd
    val originRDD = MongoSpark.load(sc)

    // 构造查询
    val dateQuery = new BsonDocument()
      .append("$gte", new BsonDateTime(start.getTime))
      .append("$lt", new BsonDateTime(end.getTime))
    val matchQuery = new Document("$match", BsonDocument.parse("{\"type\":\"1\"}"))

    // 构造Projection
    val projection1 = new BsonDocument("$project", BsonDocument.parse("{\"userid\":\"$userid\",\"message\":\"$message\"}")

    val aggregatedRDD = originRDD.withPipeline(Seq(matchQuery, projection1))

    //比如计算用户的消息字符数
    val rdd1 = aggregatedRDD.keyBy(x=>{
      Map(
        "userid" -> x.get("userid")
      )
    })

    val rdd2 = rdd1.groupByKey.map(t=>{
      (t._1, t._2.map(x => {
        x.getString("message").length
      }).sum)
    })

    rdd2.collect().foreach(x=>{
        println(x)
    })

    //保持统计结果至MongoDB outputurl 所指定的数据库
    MongoSpark.save(rdd2)

总结

MongoDB Connector 的文档只有基础的示例代码, 具体详情需要看GitHub中的example和部分源码

参考链接

这是我的xml文件内容:<?xml version="1.0" encoding="UTF-8"?> <project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd"> <modelVersion>4.0.0</modelVersion> <groupId>org.example</groupId> <artifactId>spark-mongodb-processor</artifactId> <version>1.0-SNAPSHOT</version> <properties> <maven.compiler.source>8</maven.compiler.source> <maven.compiler.target>8</maven.compiler.target> <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding> <!-- 统一版本配置 --> <spark.version>3.5.0</spark.version> <scala.binary.version>2.12</scala.binary.version> <scala.version>2.12.18</scala.version> <mongodb.connector.version>10.2.1</mongodb.connector.version> <mongodb.driver.version>4.11.1</mongodb.driver.version> <hadoop.version>3.3.6</hadoop.version> <log4j.version>2.20.0</log4j.version> <shade.plugin.version>3.5.1</shade.plugin.version> <scala.maven.plugin.version>4.8.1</scala.maven.plugin.version> <compiler.plugin.version>3.11.0</compiler.plugin.version> </properties> <dependencies> <!-- Spark 核心依赖 --> <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-core_${scala.binary.version}</artifactId> <version>${spark.version}</version> <scope>provided</scope> </dependency> <!-- Spark SQL 依赖 --> <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-sql_${scala.binary.version}</artifactId> <version>${spark.version}</version> <scope>provided</scope> </dependency> <!-- MongoDB Spark Connector --> <dependency> <groupId>org.mongodb.spark</groupId> <artifactId>mongo-spark-connector_${scala.binary.version}</artifactId> <version>${mongodb.connector.version}</version> <!-- 简化排除配置 --> <exclusions> <exclusion> <groupId>org.mongodb</groupId> <artifactId>*</artifactId> </exclusion> </exclusions> </dependency> <!-- MongoDB 驱动依赖 - 统一版本 --> <dependency> <groupId>org.mongodb</groupId> <artifactId>mongodb-driver-sync</artifactId> <version>${mongodb.driver.version}</version> </dependency> <!-- Scala 语言依赖 --> <dependency> <groupId>org.scala-lang</groupId> <artifactId>scala-library</artifactId> <version>${scala.version}</version> </dependency> <!-- Hadoop 客户端 - 更新到较新版本 --> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-client</artifactId> <version>${hadoop.version}</version> <scope>provided</scope> </dependency> <!-- 日志处理 --> <dependency> <groupId>org.apache.logging.log4j</groupId> <artifactId>log4j-api</artifactId> <version>${log4j.version}</version> </dependency> <dependency> <groupId>org.apache.logging.log4j</groupId> <artifactId>log4j-core</artifactId> <version>${log4j.version}</version> </dependency> <!-- 添加 SLF4J 桥接,避免日志冲突 --> <dependency> <groupId>org.apache.logging.log4j</groupId> <artifactId>log4j-slf4j-impl</artifactId> <version>${log4j.version}</version> </dependency> </dependencies> <build> <plugins> <!-- Scala 编译插件 --> <plugin> <groupId>net.alchim31.maven</groupId> <artifactId>scala-maven-plugin</artifactId> <version>${scala.maven.plugin.version}</version> <executions> <execution> <id>scala-compile-first</id> <phase>process-resources</phase> <goals> <goal>add-source</goal> <goal>compile</goal> </goals> </execution> <execution> <id>scala-test-compile</id> <phase>process-test-resources</phase> <goals> <goal>testCompile</goal> </goals> </execution> </executions> <configuration> <scalaVersion>${scala.version}</scalaVersion> <args> <arg>-target:jvm-1.8</arg> </args> </configuration> </plugin> <!-- Java 编译插件 --> <plugin> <groupId>org.apache.maven.plugins</groupId> <artifactId>maven-compiler-plugin</artifactId> <version>${compiler.plugin.version}</version> <configuration> <source>${maven.compiler.source}</source> <target>${maven.compiler.target}</target> <encoding>${project.build.sourceEncoding}</encoding> </configuration> </plugin> <!-- Shade 打包插件 --> <plugin> <groupId>org.apache.maven.plugins</groupId> <artifactId>maven-shade-plugin</artifactId> <version>${shade.plugin.version}</version> <executions> <execution> <phase>package</phase> <goals> <goal>shade</goal> </goals> <configuration> <shadedArtifactAttached>true</shadedArtifactAttached> <shadedClassifierName>shaded</shadedClassifierName> <createDependencyReducedPom>true</createDependencyReducedPom> <transformers> <transformer implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer"> <!-- 请确认这个主类路径是否正确 --> <mainClass>EmotionAnalysis.EmoAnalysis</mainClass> </transformer> <transformer implementation="org.apache.maven.plugins.shade.resource.AppendingTransformer"> <resource>reference.conf</resource> </transformer> <transformer implementation="org.apache.maven.plugins.shade.resource.ServicesResourceTransformer"/> </transformers> <filters> <filter> <artifact>*:*</artifact> <excludes> <exclude>META-INF/*.SF</exclude> <exclude>META-INF/*.DSA</exclude> <exclude>META-INF/*.RSA</exclude> <exclude>module-info.class</exclude> <!-- 排除不必要的许可证文件 --> <exclude>LICENSE</exclude> <exclude>NOTICE</exclude> </excludes> </filter> </filters> <!-- 优化包含配置 --> <artifactSet> <includes> <include>org.mongodb:*</include> <include>org.mongodb.spark:*</include> <include>org.apache.logging.log4j:*</include> </includes> </artifactSet> </configuration> </execution> </executions> </plugin> </plugins> </build> </project>
最新发布
10-24
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值