实例代码
public class TextSearch {
public static void main(String[] args) {
JavaSparkContext sc = new JavaSparkContext();
JavaRDD<String> textFile = sc.textFile("/home/spark_work/df.txt");
JavaRDD<Row> rowRDD = textFile
.map(RowFactory::create);
List<StructField> fields = Arrays.asList(
DataTypes.createStructField("line", DataTypes.StringType, true));
StructType schema = DataTypes.createStructType(fields);
SQLContext sqlContext = new SQLContext(sc);
Dataset df = sqlContext.createDataFrame(rowRDD, schema);
Dataset infos = df.filter(col("line").like("%INFO%"));
long count = infos.count();
System.out.println("########################Text Search########################");
System.out.println(String.format("info count: %d", count));
df.printSchema();
System.out.println("########################Text Search########################");
}
}
maven依赖
务必保证远程spark版本和本地依赖一致
<dependencies>
<dependency>
<groupId>org.projectlombok</groupId>
<artifactId>lombok</artifactId>
<version>1.18.8</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.4.3</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-sql -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.12</artifactId>
<version>2.4.3</version>
</dependency>
</dependencies>
代码打包
mvn clean package
上传jar包
将jar包上传到服务器上后,使用以下命令提交至spark执行,请根据自己的代码进行修改。
spark-submit --class com.bigdata.examples.TextSearch --master spark://ip:port --executor-memory 512M --driver-java-options "-Xdebug -Xrunjdwp:transport=dt_socket,server=y,suspend=y,address=5005" /home/spark_work/XXX.jar
其中ip和port是spark所在服务器的ip及spark配置的端口,address=5005表示在5005端口监听本机java代码