- 博客(48)
- 收藏
- 关注
原创 如何解决spark写hive慢的问题
在使用spark写hive过程中,发现最耗时的部分是将产生的结果写入hive,举个例子,对3g*1G表的join来讲,将结果使用以下方式直接写入hive表需要超过半小时的时间:dataframe.registerTempTable("result")sql(s"""INSERT OVERWRITE Table $outputTable PARTITION (dt ='$output
2016-05-31 10:13:23
15667
3
原创 如何使用zeppelin实现大数据可视化
Zeppelin是基于spark的数据可视化方案。支持scala语言,任何在spark上运行的job都可以在此平台上运行,此外支持对表数据的可视化。对数据源的可视化可以通过interpreter进行扩展,比如github中就有支持mysql的interpreter。下面着重介绍zeppelin notebook中代码书写:scala:使用此interpreter的好处是,可以将各个数据源的
2016-05-25 15:28:35
9680
原创 Spark sql处理数据倾斜方法
定义与表现:数据倾斜指的是由于数据分区不均匀导致的,spark一部分tasks承担的数据量太大,而导致整体运行时间过长的现象。一般出现在对大表的join过程中,数据表现是大表的join key集中分布在某几个取值上,spark运行时的表现是job在某个或某些task的处理上停留时间过长(more than 0.5 hour)。一般分为大表join大表,大表join小表;其中大表join小表
2016-05-25 15:09:03
11370
转载 互联网及移动广告常见的几种计费方式,包括CPC、CPM、CPA、CPD、CPS、dCPM
互联网及移动广告常见的几种计费方式,包括CPC、CPM、CPA、CPD、CPS、dCPM(2013-01-09 17:52:15)转载▼标签: 广告研究 计费方式 it分类: 广告研究 CPC按点击付费,原英文为Cost Per Click 每点击成本,网络广告每
2016-02-18 12:59:18
24742
原创 基于Spark DataFrame的数据仓库框架
数据存储的多样性,对数据分析、挖掘带来众多不变。应用瓶颈表现在两个方面:1. 传统数据库mysql等的数据处理能力有限,随着数据量的增加,join、groupby、orderby等操作出现速度极慢,甚至将机器资源耗尽、不能运行的情况;另一方面,将数据存储转移到分布式系统比如hdfs的代价太大。2. 不能进行跨数据源的访问。比如对hive table、htable、mys
2015-11-30 10:55:15
6610
原创 How-to: resolve "java.io.NotSerializableException" issue during spark reading hbase table
During reading htable via spark scala code, the following error happened:15/10/28 16:39:00 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 10.0 (TID 2536, slave14.dc.tj): java.lang.Runtime
2015-10-29 11:24:19
2282
原创 How-to: resolve spark streaming "Not enough space to cache input-0-* in memory! "
Error:15/09/22 21:52:51 WARN storage.MemoryStore: Failed to reserve initial memory threshold of 1024.0 KB for computing block input-0-1442929965600 in memory.15/09/22 21:52:51 WARN storage.MemoryS
2015-09-24 16:53:14
4917
原创 How-to: use spark to suport query across mysql tables and hbase tables
It wil be good for data analyst that he just run "big sql" to process tables from mysql, hbase or something else. And one more importance thing is the performance thing. We should avoid running "big s
2015-09-18 18:04:40
848
原创 How-to: write own Kafka Partitioner based on requirement
Kafka's default partitioner is based on hash first element which is generated by spliting log via tab. In our usage, this is not a normal case. Our logs are normal log which the first element should b
2015-09-14 11:12:31
588
转载 Spark On YARN内存分配
http://blog.javachen.com/2015/06/09/memory-in-spark-on-yarn.html非常好的文章,推荐。本文主要了解Spark On YARN部署模式下的内存分配情况,因为没有深入研究Spark的源代码,所以只能根据日志去看相关的源代码,从而了解“为什么会这样,为什么会那样”。说明按照Spark应用程序中的driver分布方式不同
2015-09-02 19:04:30
1994
原创 Two ways to load mysql tables into hdfs via spark
There are two ways to load mysql tables into hdfs via spark, then process these datas.Load mysql tables: use JDBCRDD directelypackage org.apache.spark.examples.sqlimport org.apache.spark.s
2015-08-25 18:00:09
999
原创 how-to: resolve crontab does not work
The environment crontab is working is different with current user. With current user, I could run my shell, but crontab does not work.Resolvation:Add following into shell:. /etc/profile
2015-08-20 18:27:35
664
原创 How-to: set yarn mapreduce memory properties
The memory of containers, nodemanager, map task, reduce task should based on system memory, cpu numbers.
2015-08-12 14:40:44
541
原创 How-to: controle tasks for each stage(partitions for each rdd)
The way to controle tasks number is to controle parallize number for each stage. There are two ways to controle spark parallize number. The difference between the two is that, repartition will perform
2015-08-07 18:47:02
868
原创 How-to: Enabled hive job running support on hive
There are multiple issues during running spark hive example HiveFromSpark:Copy hive-site.xml to ${SPARK_HOME}/conf/Correct command:${SPARK_HOME}/bin/spark-submit --master yarn-cluster --driv
2015-08-06 18:00:26
941
原创 Spark Streaming时间间隔性能测试
SparkStreaming能支持的最短时间间隔取决于数据源产生的速度,及对RDD的操作。本文针对同一数据源(日志由spark实时收集),测试RDD几种操作对应的合适的时间间隔。时间间隔time以如下形式作用于spark streaming:new StreamingContext(sparkConf, Milliseconds(time.toLong))测试数据源: log
2015-08-05 13:58:53
5961
原创 How-to: resolve spark "/usr/bin/python: No module named pyspark" issue
Error: Error from python worker:/usr/bin/python: No module named pysparkPYTHONPATH was:/home/hadoop/tmp/nm-local-dir/usercache/chenfangfang/filecache/43/spark-assembly-1.3.0-cdh5.4.1-hadoop2.
2015-08-05 13:39:59
2870
原创 How-to: enable spark sql in cdh version spark
Cloudera spark does not support sparksql. Here, I will take cdh-5.4.1 spark as example to enabled sparkSql.The overall steps wil be update hive version, resolve compile issue and update spark packag
2015-08-04 12:13:50
1064
原创 How-to: Install hive with mysql metastore
Install mysqlInstall mysql connector:$ curl -L 'http://www.mysql.com/get/Downloads/Connector-J/mysql-connector-java-5.1.31.tar.gz/from/http://mysql.he.net/' | tar xz$ sudo cp mysql-connector-java-
2015-08-04 11:49:35
462
原创 How-to: configure hadoop rack awareness
Update/genrate following files in ${HADOOP_CONF_DIR}: core-site.xml: net.topology.script.file.name ${HADOOP_CONF_DIR}/rack-topology.sh
2015-07-17 17:24:04
683
原创 Hadoop cluster security2: How to enable hadoop Service Level Authorization
Reference:http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/ServiceLevelAuth.htmlSteps:Add following in core-site.xml: hadoop.s
2015-07-17 11:48:36
988
原创 How-to: enable fair scheduler in hadoop
Introduction and reference:http://hadoop.apache.org/docs/r2.7.1/hadoop-yarn/hadoop-yarn-site/FairScheduler.htmlSteps:Add following in yarn-site.xml: yarn.resou
2015-07-17 11:33:32
732
原创 How-to: enable hbase ACL and verify
Add following configuration in hbase-site.xml:hbase.security.authorizationtruehbase.coprocessor.master.classesorg.apache.hadoop.hbase.security.access.AccessControllerhbase.copr
2015-07-15 18:36:05
1901
原创 How-to: transfer hbase data between two hadoop cluster
Overall: use hbase Export/Import toolReference: http://hbase.apache.org/book.html#tools Steps: At old cluster run: hbase org.apache.hadoop.hbase.mapreduce.Export table_name hdfs://ne
2015-07-15 18:20:17
593
原创 How-to: resolve "java.lang.NoClassDefFoundError: org/htrace/Trace" when hbase Export
Error:Caused by: java.lang.NoClassDefFoundError: org/htrace/Trace at org.apache.hadoop.hbase.client.ResultBoundedCompletionService.submit(ResultBoundedCompletionService.java:142) a
2015-07-15 17:34:21
4203
原创 How-to: deploy hadoop client with some special user based on acl enbaled cluster
This is to deploy a hadoop client with some user besides hadoop admin user to connect an acl(configure user access permission ) enabled hadoop.At client node:Useradd client user. Here clie
2015-07-15 13:38:53
906
原创 How-to: effective store kafka data into hdfs via spark streaming
This is an improvement to How-to: make spark streaming collect data from Kafka topics and store data into hdfsIn How-to: make spark streaming collect data from Kafka topics and store data
2015-07-13 15:59:17
970
原创 How-to: Resolve "Datanode denied communication with namenode because hostname cannot be resolved (ip
Error:org.apache.hadoop.hdfs.server.protocol.DisallowedDatanodeException: Datanode denied communication with namenode because hostname cannot be resolved (ip=172.31.34.70, hostname=172.31.34.70):
2015-07-02 13:58:15
8106
1
原创 How-to: Enable yarn ResourceManager HA
Update yarn-site.xml. Add following in yarn-site.xml(master.chff.dc and slave01.chff.dc are resourcemanager nodes): yarn.resourcemanager.ha.enabled true
2015-06-25 14:52:07
1138
原创 How-to: resolve hbase "org.apache.hadoop.hbase.TableExistsException: hbase:namespace"
Error:2015-06-24 13:34:05,251 FATAL [master:60000.activeMasterManager] master.HMaster: Failed to become active masterorg.apache.hadoop.hbase.TableExistsException: hbase:namespace at org.a
2015-06-24 14:36:13
3659
原创 How-to: enable HDFS HA at a new cluster
deploy hadoop cluster with non-HA: make sure hadoop could work normallyConfiguration update:hdfs-site.xml: dfs.nameservices dfscluster
2015-06-24 13:05:52
923
原创 How-to resolve hbase shutdown caused by "KeeperErrorCode = ConnectionLoss for /hbase"
Error log:2015-06-23 17:35:05,995 WARN [main] zookeeper.RecoverableZooKeeper: Possibly transient ZooKeeper, quorum=slave.chff.dc:2183,master.chff.dc:2183,slave01.chff.dc:2183, exception=org.apache.
2015-06-23 18:22:14
4465
原创 How-to: make spark streaming collect data from Kafka topics and store data into hdfs
Develop steps:Develop class which is used for connect kafka topics and store data into hdfs.In spark project:./examples/src/main/scala/org/apche/spark/examples/streaming/Kafka.scalapackage o
2015-06-18 15:49:43
887
原创 How-to: Enbale HMaster HA(high availability) and High Available Reads
At first, please make sure the backup HMaster node's hostname is configured at all hbase nodes /etc/hosts file. Then add following information in hbase-site.xml of all hbase node(including HMaster, re
2015-06-16 12:38:53
1678
原创 How-to: resolve regionserver died with "No lease on /hbase/oldWALs/..."
Error Log from died regionserver: 2015-06-11 16:23:03,072 ERROR [regionserver/slave04/172.31.34.64:60020] regionserver.HRegionServer: Shutdown / close of WAL failed: org.apache.hadoop.hdfs.server.na
2015-06-16 11:10:08
3520
原创 How-to: install puppet via yum
Master:sudo rpm -ivh http://yum.puppetlabs.com/puppetlabs-release-el-5.noarch.rpmsudo yum install puppet-serverAdd following in /etc/puppet/puppet.conf [main]: dns_alt_names = master.chff.dcsu
2015-06-15 13:12:55
451
原创 how-to: resolve "Connection refused" during connext hiveserver2 via beeline
Issue:At hiveserver2 node, could connect hiveserver2 via localhost/127.0.0.1, but connection refused when using ip.At other node, could not access hiveserver2 via ipConnect command is like:/op
2015-06-11 11:13:35
1493
原创 how-to: resolve "java.lang.OutOfMemoryError: Java heap space" during using beeline && hiveserver2
Error log in hive.log file:2015-06-10 00:33:18,207 ERROR [HiveServer2-Handler-Pool: Thread-47]: thrift.ProcessFunction (ProcessFunction.java:process(41)) - Internal error processing OpenSession
2015-06-11 11:12:29
3711
原创 How-to: resolve "Unapproved licenses:" issue during building with mvn+rat
This issue happened during building flume, but this has nothing to do with flume.mvn verbose output will mention about what files is marked "Unapproved license" from target/rat.txt. Like:
2015-06-09 12:29:15
913
原创 How-to: resolve " java.lang.OutOfMemoryError: unable to create new native thread" for hbase thrift
Error:Caused by: java.lang.OutOfMemoryError: unable to create new native thread at java.lang.Thread.start0(Native Method) at java.lang.Thread.start(Thread.java:714) at jav
2015-06-08 17:49:15
2100
空空如也
空空如也
TA创建的收藏夹 TA关注的收藏夹
TA关注的人