chenfangfang_2015-优快云博客

原创如何解决spark写hive慢的问题

在使用spark写hive过程中，发现最耗时的部分是将产生的结果写入hive，举个例子，对3g*1G表的join来讲，将结果使用以下方式直接写入hive表需要超过半小时的时间：dataframe.registerTempTable("result")sql(s"""INSERT OVERWRITE Table $outputTable PARTITION (dt ='$output

2016-05-31 10:13:23 15667 3

原创如何使用zeppelin实现大数据可视化

Zeppelin是基于spark的数据可视化方案。支持scala语言，任何在spark上运行的job都可以在此平台上运行，此外支持对表数据的可视化。对数据源的可视化可以通过interpreter进行扩展，比如github中就有支持mysql的interpreter。下面着重介绍zeppelin notebook中代码书写：scala：使用此interpreter的好处是，可以将各个数据源的

2016-05-25 15:28:35 9680

原创 Spark sql处理数据倾斜方法

定义与表现：数据倾斜指的是由于数据分区不均匀导致的，spark一部分tasks承担的数据量太大，而导致整体运行时间过长的现象。一般出现在对大表的join过程中，数据表现是大表的join key集中分布在某几个取值上，spark运行时的表现是job在某个或某些task的处理上停留时间过长（more than 0.5 hour）。一般分为大表join大表，大表join小表；其中大表join小表

2016-05-25 15:09:03 11370

转载互联网及移动广告常见的几种计费方式，包括CPC、CPM、CPA、CPD、CPS、dCPM

互联网及移动广告常见的几种计费方式，包括CPC、CPM、CPA、CPD、CPS、dCPM(2013-01-09 17:52:15)转载▼标签：广告研究计费方式 it分类：广告研究 CPC按点击付费，原英文为Cost Per Click 每点击成本，网络广告每

2016-02-18 12:59:18 24742

原创基于Spark DataFrame的数据仓库框架

数据存储的多样性，对数据分析、挖掘带来众多不变。应用瓶颈表现在两个方面：1. 传统数据库mysql等的数据处理能力有限，随着数据量的增加，join、groupby、orderby等操作出现速度极慢，甚至将机器资源耗尽、不能运行的情况；另一方面，将数据存储转移到分布式系统比如hdfs的代价太大。2. 不能进行跨数据源的访问。比如对hive table、htable、mys

2015-11-30 10:55:15 6610

原创 How-to: resolve "java.io.NotSerializableException" issue during spark reading hbase table

During reading htable via spark scala code, the following error happened:15/10/28 16:39:00 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 10.0 (TID 2536, slave14.dc.tj): java.lang.Runtime

2015-10-29 11:24:19 2282

原创 How-to: resolve spark streaming "Not enough space to cache input-0-* in memory! "

Error:15/09/22 21:52:51 WARN storage.MemoryStore: Failed to reserve initial memory threshold of 1024.0 KB for computing block input-0-1442929965600 in memory.15/09/22 21:52:51 WARN storage.MemoryS

2015-09-24 16:53:14 4917

原创 How-to: use spark to suport query across mysql tables and hbase tables

It wil be good for data analyst that he just run "big sql" to process tables from mysql, hbase or something else. And one more importance thing is the performance thing. We should avoid running "big s

2015-09-18 18:04:40 848

原创 How-to: write own Kafka Partitioner based on requirement

Kafka's default partitioner is based on hash first element which is generated by spliting log via tab. In our usage, this is not a normal case. Our logs are normal log which the first element should b

2015-09-14 11:12:31 588

转载 Spark On YARN内存分配

http://blog.javachen.com/2015/06/09/memory-in-spark-on-yarn.html非常好的文章，推荐。本文主要了解Spark On YARN部署模式下的内存分配情况，因为没有深入研究Spark的源代码，所以只能根据日志去看相关的源代码，从而了解“为什么会这样，为什么会那样”。说明按照Spark应用程序中的driver分布方式不同

2015-09-02 19:04:30 1994

原创 Two ways to load mysql tables into hdfs via spark

There are two ways to load mysql tables into hdfs via spark, then process these datas.Load mysql tables: use JDBCRDD directelypackage org.apache.spark.examples.sqlimport org.apache.spark.s

2015-08-25 18:00:09 999

原创 how-to: resolve crontab does not work

The environment crontab is working is different with current user. With current user, I could run my shell, but crontab does not work.Resolvation:Add following into shell:. /etc/profile

2015-08-20 18:27:35 664

原创 How-to: set yarn mapreduce memory properties

The memory of containers, nodemanager, map task, reduce task should based on system memory, cpu numbers.

2015-08-12 14:40:44 541

原创 How-to: controle tasks for each stage(partitions for each rdd)

The way to controle tasks number is to controle parallize number for each stage. There are two ways to controle spark parallize number. The difference between the two is that, repartition will perform

2015-08-07 18:47:02 868

原创 How-to: Enabled hive job running support on hive

There are multiple issues during running spark hive example HiveFromSpark:Copy hive-site.xml to ${SPARK_HOME}/conf/Correct command:${SPARK_HOME}/bin/spark-submit --master yarn-cluster --driv

2015-08-06 18:00:26 941

原创 Spark Streaming时间间隔性能测试

SparkStreaming能支持的最短时间间隔取决于数据源产生的速度，及对RDD的操作。本文针对同一数据源（日志由spark实时收集），测试RDD几种操作对应的合适的时间间隔。时间间隔time以如下形式作用于spark streaming：new StreamingContext(sparkConf, Milliseconds(time.toLong))测试数据源： log

2015-08-05 13:58:53 5961

原创 How-to: resolve spark "/usr/bin/python: No module named pyspark" issue

Error: Error from python worker:/usr/bin/python: No module named pysparkPYTHONPATH was:/home/hadoop/tmp/nm-local-dir/usercache/chenfangfang/filecache/43/spark-assembly-1.3.0-cdh5.4.1-hadoop2.

2015-08-05 13:39:59 2870

原创 How-to: enable spark sql in cdh version spark

Cloudera spark does not support sparksql. Here, I will take cdh-5.4.1 spark as example to enabled sparkSql.The overall steps wil be update hive version, resolve compile issue and update spark packag

2015-08-04 12:13:50 1064

原创 How-to: Install hive with mysql metastore

Install mysqlInstall mysql connector:$ curl -L 'http://www.mysql.com/get/Downloads/Connector-J/mysql-connector-java-5.1.31.tar.gz/from/http://mysql.he.net/' | tar xz$ sudo cp mysql-connector-java-

2015-08-04 11:49:35 462

原创 How-to: configure hadoop rack awareness

Update/genrate following files in ${HADOOP_CONF_DIR}: core-site.xml: net.topology.script.file.name ${HADOOP_CONF_DIR}/rack-topology.sh

2015-07-17 17:24:04 683

原创 Hadoop cluster security2: How to enable hadoop Service Level Authorization

Reference:http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/ServiceLevelAuth.htmlSteps:Add following in core-site.xml: hadoop.s

2015-07-17 11:48:36 988

原创 How-to: enable fair scheduler in hadoop

Introduction and reference:http://hadoop.apache.org/docs/r2.7.1/hadoop-yarn/hadoop-yarn-site/FairScheduler.htmlSteps:Add following in yarn-site.xml: yarn.resou

2015-07-17 11:33:32 732

原创 How-to: enable hbase ACL and verify

Add following configuration in hbase-site.xml:hbase.security.authorizationtruehbase.coprocessor.master.classesorg.apache.hadoop.hbase.security.access.AccessControllerhbase.copr

2015-07-15 18:36:05 1901

原创 How-to: transfer hbase data between two hadoop cluster

Overall: use hbase Export/Import toolReference: http://hbase.apache.org/book.html#tools Steps: At old cluster run: hbase org.apache.hadoop.hbase.mapreduce.Export table_name hdfs://ne

2015-07-15 18:20:17 593

原创 How-to: resolve "java.lang.NoClassDefFoundError: org/htrace/Trace" when hbase Export

Error:Caused by: java.lang.NoClassDefFoundError: org/htrace/Trace at org.apache.hadoop.hbase.client.ResultBoundedCompletionService.submit(ResultBoundedCompletionService.java:142) a

2015-07-15 17:34:21 4203

原创 How-to: deploy hadoop client with some special user based on acl enbaled cluster

This is to deploy a hadoop client with some user besides hadoop admin user to connect an acl(configure user access permission ) enabled hadoop.At client node:Useradd client user. Here clie

2015-07-15 13:38:53 906

原创 How-to: effective store kafka data into hdfs via spark streaming

This is an improvement to How-to: make spark streaming collect data from Kafka topics and store data into hdfsIn How-to: make spark streaming collect data from Kafka topics and store data

2015-07-13 15:59:17 970

原创 How-to: Resolve "Datanode denied communication with namenode because hostname cannot be resolved (ip

Error:org.apache.hadoop.hdfs.server.protocol.DisallowedDatanodeException: Datanode denied communication with namenode because hostname cannot be resolved (ip=172.31.34.70, hostname=172.31.34.70):

2015-07-02 13:58:15 8106 1

原创 How-to: Enable yarn ResourceManager HA

Update yarn-site.xml. Add following in yarn-site.xml(master.chff.dc and slave01.chff.dc are resourcemanager nodes): yarn.resourcemanager.ha.enabled true

2015-06-25 14:52:07 1138

原创 How-to: resolve hbase "org.apache.hadoop.hbase.TableExistsException: hbase:namespace"

Error:2015-06-24 13:34:05,251 FATAL [master:60000.activeMasterManager] master.HMaster: Failed to become active masterorg.apache.hadoop.hbase.TableExistsException: hbase:namespace at org.a

2015-06-24 14:36:13 3659

原创 How-to: enable HDFS HA at a new cluster

deploy hadoop cluster with non-HA: make sure hadoop could work normallyConfiguration update:hdfs-site.xml: dfs.nameservices dfscluster

2015-06-24 13:05:52 923

原创 How-to resolve hbase shutdown caused by "KeeperErrorCode = ConnectionLoss for /hbase"

Error log:2015-06-23 17:35:05,995 WARN [main] zookeeper.RecoverableZooKeeper: Possibly transient ZooKeeper, quorum=slave.chff.dc:2183,master.chff.dc:2183,slave01.chff.dc:2183, exception=org.apache.

2015-06-23 18:22:14 4465

原创 How-to: make spark streaming collect data from Kafka topics and store data into hdfs

Develop steps:Develop class which is used for connect kafka topics and store data into hdfs.In spark project:./examples/src/main/scala/org/apche/spark/examples/streaming/Kafka.scalapackage o

2015-06-18 15:49:43 887

原创 How-to: Enbale HMaster HA(high availability) and High Available Reads

At first, please make sure the backup HMaster node's hostname is configured at all hbase nodes /etc/hosts file. Then add following information in hbase-site.xml of all hbase node(including HMaster, re

2015-06-16 12:38:53 1678

原创 How-to: resolve regionserver died with "No lease on /hbase/oldWALs/..."

Error Log from died regionserver: 2015-06-11 16:23:03,072 ERROR [regionserver/slave04/172.31.34.64:60020] regionserver.HRegionServer: Shutdown / close of WAL failed: org.apache.hadoop.hdfs.server.na

2015-06-16 11:10:08 3520

原创 How-to: install puppet via yum

Master:sudo rpm -ivh http://yum.puppetlabs.com/puppetlabs-release-el-5.noarch.rpmsudo yum install puppet-serverAdd following in /etc/puppet/puppet.conf [main]: dns_alt_names = master.chff.dcsu

2015-06-15 13:12:55 451

原创 how-to: resolve "Connection refused" during connext hiveserver2 via beeline

Issue:At hiveserver2 node, could connect hiveserver2 via localhost/127.0.0.1, but connection refused when using ip.At other node, could not access hiveserver2 via ipConnect command is like:/op

2015-06-11 11:13:35 1493

原创 how-to: resolve "java.lang.OutOfMemoryError: Java heap space" during using beeline && hiveserver2

Error log in hive.log file:2015-06-10 00:33:18,207 ERROR [HiveServer2-Handler-Pool: Thread-47]: thrift.ProcessFunction (ProcessFunction.java:process(41)) - Internal error processing OpenSession

2015-06-11 11:12:29 3711

原创 How-to: resolve "Unapproved licenses:" issue during building with mvn+rat

This issue happened during building flume, but this has nothing to do with flume.mvn verbose output will mention about what files is marked "Unapproved license" from target/rat.txt. Like:

2015-06-09 12:29:15 913

原创 How-to: resolve " java.lang.OutOfMemoryError: unable to create new native thread" for hbase thrift

Error:Caused by: java.lang.OutOfMemoryError: unable to create new native thread at java.lang.Thread.start0(Native Method) at java.lang.Thread.start(Thread.java:714) at jav

2015-06-08 17:49:15 2100

空空如也

空空如也