It wil be good for data analyst that he just run "big sql" to process tables from mysql, hbase or something else. And one more importance thing is the performance thing. We should avoid running "big sql" directely
on mysql, espacially join many tables. Our data analyst once run a bid sql which joined 33 tables, the result is our mysql is run to "death". Another case is very simple, that is "select count(*)" a 3 million lines table. Our data analyst said that it will
need more than half an hour.
To resolve this, one good choice is spark whose parquet support and dataframe resolved this problem. Parquet is a good choice for performance consideration. Here is the steps:
- With sqlContext, the mysql big tables could be loaded and saved as parquet files in hdfs. Design this as daily job. The code could be like following:
Please notcie that this code is based on spark-1.3. From Spark 1.4, please use sqlContext.read
partitions = ( upper_bound - lower_bound ) / lines_each_part
var options: HashMap[String, String] = new HashMapoptions.put( "driver" , "com.mysql.jdbc.Driver" )options.put( "url" , url )options.put( "dbtable" , table )options.put( "lowerBound" , lower_bound .toString())options.put( "upperBound" , upper_bound .toString())//partitions are base don lower_bound and upper_bound
options.put( "numPartitions" , partitions .toString())options.put( "partitionColumn" , id );
val jdbcDF = sqlContext.load( "jdbc" , options )jdbcDF.save( output) - Load mysql data from hdfs parquet files like following:
sqlContext.parquetFile(base_dir + "/" + table).toDF().registerTempTable(table) - Load hbase table as DF and register as a table:
var config = HBaseConfiguration.create()config.addResource( new Path(System.getenv("HBASE_HOME" ) + "/conf/hbase-site.xml" ))try {HBaseAdmin.checkHBaseAvailable( config)System. out.println( "Detected HBase is running" )} catch {case e => e .printStackTrace}config.set(TableInputFormat. INPUT_TABLE , hbase_table )config.set(TableInputFormat. SCAN_COLUMN_FAMILY , columnF )
......sqlContext.createDataFrame( hc .toRowRDD(hc .createPairRDD(jsc, config)),hc.schema()).toDF().registerTempTable(table)
//hc.createPaireRDD:
public JavaPairRDD<ImmutableBytesWritable, Result> createPairRDD (JavaSparkContext jsc, Configuration conf) {
return jsc .newAPIHadoopRDD(conf,TableInputFormat. class ,ImmutableBytesWritable. class ,Result. class).cache();}
//hc.toRowRDD:
public JavaRDD<Row> toRowRDD(JavaPairRDD<ImmutableBytesWritable, Result> pairRDD ) {return pairRDD .map( new Function<Tuple2<ImmutableBytesWritable, Result>, Row>() {
private static final long serialVersionUID = -4887770613787757684L;
public Row call(Tuple2<ImmutableBytesWritable, Result> re)throws Exception {Result result = re._2();Row row = null ;if (schema .getColumns(). length == 0) {row = getAll( result);} else {row = get( result);}return row ;}public Row get(Result result ) throws Exception {List<Object> values = new ArrayList<Object>();for (String col : schema .getColumns()) {byte [] b = result .getValue(schema .getFamily().getBytes(), col.getBytes());if (b == null) {values.add( "0" );continue ;}values.add( new String(b ));}Row row = RowFactory. create( values.toArray( new Object[values .size()]));return row ;}public Row getAll(Result result ) throws Exception {NavigableMap< byte [], byte []> map = result .getFamilyMap(schema .getFamily().getBytes());List<Object> values = new ArrayList<Object>();for (byte [] key : map .keySet()) {values.add( new String(map .get(key )));}Row row = RowFactory. create( values.toArray( new Object[values .size()]));return row ;}});}//hc.schema():
public StructType schema() {final List<StructField> keyFields = new ArrayList<StructField>();for (String fieldName : this.hbase_columns) {//hbase_columns is String[]keyFields .add(DataTypes.createStructField( fieldName , DataTypes.StringType , true));}return DataTypes.createStructType( keyFields );} - run sql as following and save result in hdfs:
val rdd_parquet = sqlContext .sql(sql )rdd_parquet.rdd.saveAsTextFile( output)
本文探讨了在大数据分析中,如何通过使用Spark和Parquet文件,有效提升从MySQL、HBase等数据库获取数据的性能。重点介绍了如何将MySQL的大表加载并保存为Parquet文件,以及如何利用Spark加载HBase表作为DataFrame,解决性能瓶颈问题。
1357

被折叠的 条评论
为什么被折叠?



