1.数据的分类
①非结构化数据:文本、多媒体
②结构化数据:数据库、格式化文本
③半结构化数据:key-value、xml、tag
2.DataFrame和RDD的区别
DataFrame是带Schema的RDD
创建DataFrame的方法:
scala>val ssc = new org.apache.spark.sql.SQLContext(sc)
scala>val df = ssc.read.json("/home/hadoop/app/spark-1.6.3-bin-hadoop2.6/examples/src/main/resources/people.json")
scala>df.show
3.DataFrame支持的操作
①explain帮助分析优化操作
②select
df.select(df("name"),df("age")+1).show()
③filter
df.filter(df("age")>=18).show()
④groupby
df.groupBy(df("age")).count().show()
⑤查看表信息
df.printSchema()
4.RDD转化成DF
①反射推断创建DF
scala> val sqlContext = new org.apache.spark.sql.SQLContext(sc)
scala> import sqlContext.implicits._
case class Person(name: String, age: Int)
scala> val people = sc.textFile("/home/hadoop/app/spark-1.6.3-bin-hadoop2.6/examples/src/main/resources/people.txt").map(_.split(",")).map(p=>Person(p(0),p(1).trim.toInt)).toDF()
scala> people.registerTempTable("people")
scala> val teenagers = sqlContext.sql("SELECT name, age FROM people WHERE age>=13 AND age<20")
teenagers.map(t=>"Name is " + t(0)).collect().foreach(println)
teenagers.map(t=>"Name is " + t.getAs[String]("name")).collect().foreach(println)
scala> teenagers.map(_.getValuesMap[Any](List("name","age"))).collect().foreach(println)
②通过Row RDD创建DF
scala> val sqlContext = new org.apache.spark.sql.SQLContext(sc)
scala> val people = sc.textFile("/home/hadoop/app/spark-1.6.3-bin-hadoop2.6/examples/src/main/resources/people.txt")
scala> val schemaString = "name age"
scala> import org.apache.spark.sql.Row
scala> import org.apache.spark.sql.types.{StructType,StructField,StringType}
scala> val schema = StructType(schemaString.split(" ").map(fieldName=>StructField(fieldName,StringType,true)))
scala> val rowRDD = people.map(_.split(",")).map(p=>Row(p(0),p(1).trim))
scala> val peopleDataFrame = sqlContext.createDataFrame(rowRDD,schema)
scala> peopleDataFrame.registerTempTable("people")
scala> val result = sqlContext.sql("SELECT name FROM people")
scala> result.map(t=>"Name:"+t(0)).collect().foreach(println)
5.数据源加载方式
①json
②Hive
③jdbc
④通过其他系统加载如Flume等
6.使用Hive作为数据源加载数据
操作背景:Hive安装成功,mysql用于存储元数据,hive的数据存储在hdfs上
①将Hive的hive-site.xml配置文件拷贝到spark的conf目录下
②确保hive-site.xml中指定了Hive的文件存储路径,因为Spark默认的文件存储路径是本地路径(file:/user/hive/warehouse)而不是hdfs上的路径(hdfs://hadoop000:8020/user/hive/warehouse)
<property>
<name>hive.metastore.warehouse.dir</name>
<value>hdfs://hadoop000:8020/user/hive/warehouse</value>
</property>
③要在Spark-shell中操作从Hive中加载数据需要加载驱动,加载驱动的方法是在启动Spark-shell时通过命令行加载
spark-shell --jars ~/app/hive-1.1.0-cdh5.7.0/lib/mysql-connector-java-5.1.27-bin.jar
或者在spark-env.xml中加入以下语句
export SPARK_CLASSPATH=/home/hadoop/app/hive-1.1.0-cdh5.7.0/lib/mysql-connector-java-5.1.27-bin.jar
④创建表
sqlContext.sql("CREATE TABLE IF NOT EXISTS src(key INT, value STRING)")
⑤向表中导入数据
sqlContext.sql("LOAD DATA LOCAL INPATH '/home/hadoop/app/spark-1.6.3-bin-hadoop2.6/examples/src/main/resources/kv1.txt' INTO TABLE src")
⑥查询数据
sqlContext.sql("SELECT key,value FROM src").collect.foreach(println)
7.UDF定义
sqlContext.udf.register("strLen",(s:String)=>s.length())