- 查询elasticsearch数据
elasticsearch用1.7jdk,spark装的1.8jdk,搞了两天spark终于能访问es了,还要下载elasticsearch_spark的jar包。
新建maven工程,添加依赖:
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.2.0</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>2.2.0</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.elasticsearch</groupId>
<artifactId>elasticsearch-spark-20_2.11</artifactId>
<version>5.5.1</version>
<exclusions>
<exclusion>
<artifactId>log4j-over-slf4j</artifactId>
<groupId>org.slf4j</groupId>
</exclusion>
</exclusions>
</dependency>
其实我们既可以通过elasticsearch-spark包访问es,也可以通过spark-sql以sql方式访问es。我们先看看第一种方式。
-
- 第一种JavaEsSpark方式
上代码:
String query = "{\"query\":{\"term\":{\"bizCode\": \"140000000040\"}}}";
JavaPairRDD<String, Map<String, Object>> esRDD = JavaEsSpark.esRDD(sc, "cmall_order/order",query);
long count = esRDD.count();
System.out.println("total count:"+count);
query有3种方式:
# uri (or parameter) query
es.query = ?q=costinl
# query dsl
es.query = { "query" : { "term" : { "user" : "costinl" } } }
# external resource
es.query = org/mypackage/myquery.json
官方建议采用第三种,将查询条件写在json文件中,es.query指定json文件路径即可,json文件作为资源打包在jar中。
-
- 第二种spark sql方式
private static void read_es_sql(JavaSparkContext sc){
SparkSession spark = SparkSession
.builder()
.appName("Java Spark SQL basic example")
.config("pushdown", "true")
.config("es.nodes", "10.37.154.83")
.config("es.port", "9200")
.getOrCreate();
Dataset<Row> rows = spark.read().format("org.elasticsearch.spark.sql").load("mydb/order")
.select(col("skuId"),col("orderId")).filter(col("skuId").equalTo("140000000040"));
rows.show();
long count = rows.count();
System.out.println("total count:"+count);
Encoder<TestBean> testEncoder = Encoders.bean(TestBean.class);
Dataset<Row> df=spark.read().format("org.elasticsearch.spark.sql").load("test/testbean");
df.createOrReplaceGlobalTempView("table1");
Dataset<TestBean> selects = spark.sql("SELECT myid,name,age FROM global_temp.table1 where age>13").as(testEncoder);
selects.show();
TestBean bean1 = new TestBean("4","hello",1222);
Dataset<TestBean> javaBeanDS = spark.createDataset(
Collections.singletonList(bean1),
testEncoder
);
spark.close();
}
- 写入elasticsearch数据
- 第一种是通过JavaEsSpark方式写入
SparkConf sparkConf = new SparkConf();
sparkConf.setAppName("Demo_Mysql2");
sparkConf.set("pushdown", "true");
sparkConf.set("es.nodes", "10.37.154.83");
sparkConf.set("es.port", "9200");
JavaSparkContext sc = null;
try {
sc = new JavaSparkContext(sparkConf);
}catch (Exception e) {
e.printStackTrace();
} finally {
if (sc != null) {
sc.stop();
}
}
private static void insert_es(JavaSparkContext sc) {
TestBean b1 = new TestBean("1","name1",12);
TestBean b2 = new TestBean("2","name3",34);
JavaRDD<TestBean> javaRDD = sc.parallelize( ImmutableList.of(b1, b2));
Map<String,String> map=new HashMap<String,String>();
map.put("es.mapping.id" , "myid");
JavaEsSpark.saveToEs(javaRDD, "test/testbean",map);
}
第二种是通过spark sql方式写入
//创建SparkSession
SparkSession spark = SparkSession.builder()
.appName("Java Spark SQL basic example")
.config("pushdown", "true")
.config("es.nodes", "10.37.154.83")
.config("es.port", "9200").getOrCreate();
TestBean bean1 = new TestBean("4","hello",1222);
Dataset<TestBean> javaBeanDS = spark.createDataset(
Collections.singletonList(bean1),
testEncoder
);
// set primary key for es
Map<String,String> map=new HashMap<String,String>();
map.put("es.mapping.id" , "myid");
javaBeanDS.write().mode(SaveMode.Append).format("org.elasticsearch.spark.sql").options(map).save("test/testbean");
spark.close();