1、安装scala
下载scala window zip 安装包 2.10.4 到D:盘解压
配置SCALA_HOME 及 PATH
2、将centos上的spark 下载一份到本地D盘
配置SPARK_HOME 及 SPARK_CLASSPATH
D:\spark-1.3.0-bin-2.5.0\lib\spark-assembly-1.3.0-hadoop2.5.0.jar
3、删除之前的本地Hadoop 2.7.2 下载centos上的2.5.0 cdh版到本地D盘
重新下载windows插件及工具
hadoop-eclipse-plugin-2.5.2 hadoop2.5.2(x64).zip
cdsn年费到期了,重新缴费下载。。。
4、重配 HADOOP_HOME
5、复制解压的bin到hadoop 2.5.0的bin
6、将hadoop-eclipse-plugin-2.5.2.jar拷贝至eclipse的plugins目录下,重启eclipse
7、Eclipse安装Scala ide
http://download.scala-ide.org/sdk/lithium/e44/scala211/dev/update-site.zip
解压后local安装
8、测试 scala项目
新建scala项目 sparktestscala wordcount
右键configuire 转为maven项目
配置pom
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>lining</groupId>
<artifactId>spark</artifactId>
<version>0.0.1-SNAPSHOT</version>
<dependencies>
<dependency>
<groupId>jdk.tools</groupId>
<artifactId>jdk.tools</artifactId>
<version>1.7</version>
<scope>system</scope>
<systemPath>${JAVA_HOME}/lib/tools.jar</systemPath>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.10</artifactId>
<version>1.3.0</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.10</artifactId>
<version>1.3.0</version>
<!--<scope>provided</scope>-->
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-hive_2.10</artifactId>
<version>1.3.0</version>
<!--<scope>provided</scope>-->
</dependency>
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>2.10.6</version>
</dependency>
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-compiler</artifactId>
<version>2.10.6</version>
</dependency>
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-reflect</artifactId>
<version>2.10.6</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-common</artifactId>
<version>2.5.0</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-hdfs</artifactId>
<version>2.5.0</version>
</dependency>
<dependency>
<groupId>log4j</groupId>
<artifactId>log4j</artifactId>
<version>1.2.17</version>
</dependency>
<dependency>
<groupId>mysql</groupId>
<artifactId>mysql-connector-java</artifactId>
<version>5.1.38</version>
</dependency>
<dependency>
<groupId>org.apache.hive</groupId>
<artifactId>hive-jdbc</artifactId>
<version>1.1.0</version>
</dependency>
</dependencies>
<build>
<sourceDirectory>src</sourceDirectory>
<plugins>
<plugin>
<artifactId>maven-compiler-plugin</artifactId>
<version>3.3</version>
<configuration>
<source>1.7</source>
<target>1.7</target>
</configuration>
</plugin>
<plugin>
<groupId>net.alchim31.maven</groupId>
<artifactId>scala-maven-plugin</artifactId>
<version>3.2.0</version>
</plugin>
</plugins>
</build>
</project>
9、设置scala compiler 到2.10 dynamic
10、编写wordcount类
package wordcount
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
object WordCount {
def main(args: Array[String]) {
val conf = new SparkConf();
conf.setAppName("Sparkwordcountscala");
conf.setMaster("yarn-client");
val sc = new SparkContext(conf);
val lines = sc.textFile("/opt/test/wordcount.txt");
val words = lines.flatMap { lines => lines.split(" ") };
val pairs = words.map { word => (word, 1) }
val wordCounts = pairs.reduceByKey(_ + _)
wordCounts.saveAsTextFile("/opt/test/wordcountresult")
}
}
11、使用maven install 打jar包
12、将jar包发送到Spark集群上运行
spark-submit --class wordcount.WordCount /opt/test/spark-0.0.1-SNAPSHOT.jar
13、将数据导入hive
hive> load data inpath "/opt/test/wordcountresult" overwrite into table sparkresult;
14、查看统计数据
hive> select * from sparkresult;
OK
(gdsgsd,4)
(gdsfdsg,4)
(dsgadf,4)
(lingds,2)
(sdfgds,4)
(sdfgsdf,3)
(hrdgf,3)
15、给wordcountresult按词频排序
package wordcount
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
object WordCount {
def main(args: Array[String]) {
val conf = new SparkConf();
conf.setAppName("Sparkwordcountscala");
conf.setMaster("yarn-client");
val sc = new SparkContext(conf);
val lines = sc.textFile("/opt/test/wordcount.txt");
val words = lines.flatMap { lines => lines.split(" ") };
val pairs = words.map { word => (word, 1) }
val wordCounts = pairs.reduceByKey((x,y)=>x + y)
val wordcount02 = wordCounts.map(x=>(x._2,x._1))
val wordcount03 = wordcount02.sortByKey();
val wordcount04 = wordcount03.map(x=>(x._2,x._1))
wordcount04.saveAsTextFile("/opt/test/wordcountsorted")
}
}
16、测试sparksql
配置hive-site.xml ,添加metastore服务端口
<property>
<name>hive.metastore.uris</name>
<value>thrift://lining05:9083</value>
</property>
开启 metastore
hive --service metastore &
17、读取hive表lining_test.mysql_member 创建DataFrame
scala> val members = sqlContext.table("lining_test.mysql_member")
20/08/03 16:52:12 INFO hive.metastore: Trying to connect to metastore with URI thrift://lining05:9083
20/08/03 16:52:12 INFO hive.metastore: Connected to metastore.
20/08/03 16:52:12 INFO session.SessionState: No Tez session required at this point. hive.execution.engine=mr.
members: org.apache.spark.sql.DataFrame = [member_id: int, member_name: string, member_tel: string, member_birthday: string, member_sex: string, member_password: string, member_wxopenid: string, member_address: string, member_certificate: string, member_status: string, member_isadmin: string, member_ismanager: string, member_isdev: string, member_isteacher: string, member_isstudent: string, member_issaler: string]
+-----+
scala> members.show()
member_id member_name member_tel member_birthday member_sex member_password member_wxopenid member_address member_certificate member_status member_isadmin member_ismanager member_isdev member_isteacher member_isstudent member_issaler
100003 徐子昂(顺顺) 15117901199 2014-02-10 男 123 null null null null null null
100005 安子骁 13911893905 2012-09-10 男 123 null null null null null null
100007 李东昊 15600602525 2011-01-07 男 123 null null null null null null
100008 赵子一 null 2012-02-21 男 123 删除 null null null null null null
100011 张三 null 2012-02-26 男 123 删除 null null null null null null
100013 刘泽铭(铭铭) 15801526279 2015-10-17 男 123 null null null null null null
100015 陆一鸣(王子) 18611571120 2015-01-23 男 123 null null null null null null
100016 tony 13716626053 2014-02-26 男 123 null null null null null null
100019 赵雅彤 18210319153 2013-04-11 女 123 null null null null null null
100021 李康宇 18610066815 2016-06-20 男 123 null null null null null null
100022 胡祥睿 13810538426 2015-05-11 男 123 null null null null null null
100024 姜昱洲 13260172546 2014-07-29 男 123 null null null null null null
100025 橙汁 15210888113 2016-04-28 男 123 null null null null null null
100026 宋易锦 13717631760 2012-08-28 男 123 null null null null null null
100027 张延润 13146269378 2012-03-07 男 123 null null null null null null
100028 丁诺一 15901252570 2015-03-22 女 123 null null null null null null
100029 刘乐 18612187699 2012-07-01 男 123 null null null null null null
100030 谷雨京 13810336840 2014-04-19 男 123 null null null null null null
100031 谷雨奇 13810336840 2014-04-19 男 123 null null null null null null
100032 刘婉茹 13661122326 2011-09-11 女 123 null null null null null null
scala>
18、测试java 项目
新建maven simple 项目 sparksql
添加resources 文件夹 将core-site hdfs-site yarn-site hive-site hbase-site
配置文件复制到resources
将spark home 下面的lib文件复制到resources
build path add folder add external jar
19、windows properties java install jres
添加jdk路径,并在下面的execution enviroment中选择jdk
否则无法使用maven install 打包java项目
20、修改pom
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>lining</groupId>
<artifactId>sparksql01</artifactId>
<version>0.0.1-SNAPSHOT</version>
<build>
<plugins>
<plugin>
<artifactId>maven-assembly-plugin</artifactId>
<version>2.2</version>
<configuration>
<archive>
<manifest>
<!-- 我运行这个jar所运行的主类 -->
<mainClass>lining.sparksql01</mainClass>
</manifest>
</archive>
<descriptorRefs>
<descriptorRef>
<!-- 必须是这样写 -->
jar-with-dependencies
</descriptorRef>
</descriptorRefs>
</configuration>
</plugin>
</plugins>
</build>
<properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
<spark.version>1.3.0</spark.version>
<scala.version>2.10</scala.version>
<hadoop.version>2.5.0</hadoop.version>
</properties>
<dependencies>
<dependency>
<groupId>jdk.tools</groupId>
<artifactId>jdk.tools</artifactId>
<version>1.7</version>
<scope>system</scope>
<systemPath>${JAVA_HOME}/lib/tools.jar</systemPath>
</dependency>
<dependency>
<groupId>org.apache.hive</groupId>
<artifactId>hive-metastore</artifactId>
<version>1.2.2</version>
</dependency>
</dependencies>
</project>
仅增加hive-metastore就可以,其它的jar包都包含在spark assembly jar包里面了
21、新建java类 SparkSql01
package lining;
import java.util.HashMap;
import org.apache.hadoop.hive.conf.HiveConf;
import org.apache.hadoop.hive.metastore.HiveMetaStoreClient;
import org.apache.hadoop.hive.metastore.api.MetaException;
import org.apache.spark.SparkConf;
import org.apache.spark.SparkContext;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.sql.DataFrame;
import org.apache.spark.sql.SQLContext;
import org.apache.spark.sql.SaveMode;
import org.apache.spark.sql.hive.HiveContext;
public class sparksql01 {
public static void main( String[] args )
{
HiveConf hiveConf = new HiveConf();
hiveConf.addResource("hive-site.xml");
try {
HiveMetaStoreClient client = new HiveMetaStoreClient(hiveConf);
} catch (MetaException e) {
e.printStackTrace();
}
SparkConf conf = new SparkConf();
conf.setAppName("HiveDFTest");
conf.setMaster("local");
SparkContext sc = new SparkContext(conf);
HiveContext sqlContext = new HiveContext(sc);
System.out.println("well fffff!");
DataFrame d= sqlContext.table("mysql_card");
System.out.println("well ddddd!");
HashMap<String,String> map = new HashMap<>();
map.put("card_contractcoin","avg");
map.put("card_restcoin","avg");
map.put("card_restleavecoin","avg");
DataFrame d1= d.agg(map);
d1.insertInto("cardavg");
sc.stop();
}
}
22、使用export 打包把jar都放到runnable jar包里面去,上传到linging05 的/opt/test 文件夹
23、使用spark-submit 运行jar包
spark-submit --class lining.sparksql01 /opt/test/liningsparksql.jar
24、查看导出的统计结果,在hive的cardavg表里面
这里需要注意,必须先在hive里面建表,然后使用dataframe的insertinto函数导出数据,否则导出的文件是parquet格式,hive不能查询
CREATE TABLE `cardavg`(
`contractcoin` int,
`restcoin` int,
`restleavecoin` int)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' ;
hive>select * from cardavg;