zhixingheyi_tian
Intel Big Data. Spark
展开
专栏收录文章
- 默认排序
- 最新发布
- 最早发布
- 最多阅读
- 最少阅读
-
Janino
【代码】Janino。原创 2025-06-02 14:47:05 · 20 阅读 · 0 评论 -
设计模式之 Listener vs Visitor
一个事件 event 给多个 Listener 使用// 定义监听器接口// 事件源类// 实现监听器@Override");// 使用示例一个 Listener 处理多个对象// 定义访问者接口// 定义元素接口// 具体元素A@Override// 具体元素B@Override// 具体访问者@Override@Override// 使用示例。原创 2025-05-19 17:41:19 · 30 阅读 · 0 评论 -
CPU性能工程
CPU bounding” 是一个术语,用来描述计算任务的性能瓶颈主要受限于CPU的计算能力。具体来说,当我们说一个任务是“CPU bounding”时,意味着它的运行速度和效率主要受到CPU计算能力的限制,而不是其他因素如I/O操作、网络延迟或内存带宽等。原创 2024-08-09 10:46:06 · 172 阅读 · 0 评论 -
数据仓库相关
在阿里巴巴的数据体系中,我们建议将数据仓库分为三层,自下而上为:数据引入层(ODS,Operation Data Store)、数据公共层(CDM,Common Data Model)和数据应用层(ADS,Application Data Service)。公共汇总粒度事实层(DWS):以分析的主题对象作为建模驱动,基于上层的应用和产品的指标需求,构建公共粒度的汇总指标事实表,以宽表化手段物理化模型。降低数据计算口径和算法不统一风险。公共维度层的表通常也被称为逻辑维度表,维度和维度逻辑表通常一一对应。原创 2023-11-16 16:42:18 · 1526 阅读 · 0 评论 -
SIMD 介绍
AVX指令集是SandyBridge和Larrabee架构下的新指令集。AVX是在之前的128bit扩展到和256bit的SIMD(SingleInstruction, Multiple Data)。而SandyBridge的SIMD演算单元扩展到256bits的同时数据传输也获得了提升,所以从理论上看CPU内核浮点运算性能提升到了2倍。IntelAVX指令集,在SIMD计算性能增强的同时也沿用了的MMX/SSE指令集。不过和MMX/SSE的不同点在于增强的AVX指令,从指令的格式上就发生了很大.原创 2021-10-18 15:56:28 · 5335 阅读 · 0 评论 -
Spark Basic Concepts
Datasets and DataFramesA Dataset is a distributed collection of data. Dataset is a new interface added in Spark 1.6 that provides the benefits of RDDs (strong typing, ability to use powerful lambda f...原创 2018-11-16 17:12:55 · 166 阅读 · 0 评论 -
DataSet 探究
总述Before Spark 2.0, the main programming interface of Spark was the Resilient Distributed Dataset (RDD).After Spark 2.0, RDDs are replaced by Dataset, which is strongly-typed like an RDD, but with r...原创 2018-12-20 17:24:43 · 328 阅读 · 0 评论 -
ZooKeeper 安装
集群安装编辑配置文件下载 zookeeper-3.4.13.tar.gz,解压之后进入 conf目录cp zoo_sample.cfg zoo.cfg编辑zoo.cfg# The number of milliseconds of each ticktickTime=2000# The number of ticks that the initial # synchroniza...原创 2018-12-26 15:50:55 · 209 阅读 · 0 评论 -
kafka 安装
kafka 单机版安装,见官网分布式安装下载最近稳定版的kafkahttp://mirror.bit.edu.cn/apache/kafka/2.1.0/kafka_2.11-2.1.0.tgz解压编辑配置文件config/server.properties修改两项broker.id=541zookeeper.connect=sr541:2181,sr553:2181,sr554...原创 2018-12-26 16:12:04 · 163 阅读 · 0 评论 -
spark 读写 parquet
SQLConf// This is used to set the default data source val DEFAULT_DATA_SOURCE_NAME = buildConf("spark.sql.sources.default") .doc("The default data source to use in input/output.") .stringCo...原创 2018-12-10 15:47:41 · 3202 阅读 · 1 评论 -
Spark 之 FileFormat
每个FileFormat 都实现了,inferSchema,但是只有初始化的时候的调用一次。ParquetFileFormatspark 获取 parquet 的 schema 是通过发起了一个job/** * Figures out a merged Parquet schema with a distributed Spark job. * * Note that lo...原创 2018-12-17 11:43:04 · 545 阅读 · 0 评论 -
spark-shell
spark-shell 就是一个脚本里面调度了spark-submitfunction main() { if $cygwin; then # Workaround for issue involving JLine and Cygwin # (see http://sourceforge.net/p/jline/bugs/40/). # If you're usi...原创 2018-12-23 11:43:12 · 488 阅读 · 0 评论 -
OAP read parquet
spark2.1FileScanRDDprivate def nextIterator(): Boolean = {...currentIterator = readFunction(currentFile)...}OptimizedParquetFileFormatoverride def buildReaderWithPartitionValues( sparkS...原创 2018-12-12 13:26:02 · 374 阅读 · 0 评论 -
Spark Shared Variables
broadcast variables and accumulatorsNormally, when a function passed to a Spark operation (such as map or reduce) is executed on a remote cluster node, it works on separate copies of all the variable...原创 2018-12-20 16:45:57 · 255 阅读 · 0 评论 -
Spark 之 SparkContext
Initializing SparkThe first thing a Spark program must do is to create a SparkContext object, which tells Spark how to access a cluster. To create a SparkContext you first need to build a SparkConf o...原创 2018-12-20 09:33:09 · 276 阅读 · 1 评论 -
RDD 探究
总述At a high level, every Spark application consists of a driver program that runs the user’s main function and executes various parallel operations on a cluster.RDDThe main abstraction Spark provi...原创 2018-12-19 15:28:23 · 345 阅读 · 1 评论 -
算法 相关
普通hash 算法,例如取模运算,是对服务器节点数量取模一致性Hash算法是对2^32取模一致性哈希将整个哈希值空间组织成一个虚拟的圆环,如假设某哈希函数H的值空间为0-2^32-1(哈希值是一个32位无符号整形)Hash环整个空间按顺时针方向组织,圆环的正上方的点代表0,0点右侧的第一个点代表1,以此类推,2、3、4、5、6……直到2^32 -1, 我们把这个由2^32个点组成的圆环称...原创 2018-11-19 10:52:18 · 398 阅读 · 1 评论 -
Spark job 提交
Driver 侧在任务提交的时候要完成以下几个工作RDD依赖分析,以生成DAG根据DAG 将job 分割成多个 stagestage 一经确认,即生成相应的 task,将生成的task 分发到 Executor 执行提交的实现入口在SparkContext.scala/** * Run a job on all partitions in an RDD and return t...原创 2018-11-19 14:48:03 · 288 阅读 · 1 评论 -
Dataset schema
/** * Returns the schema of this Dataset. * * @group basic * @since 1.6.0 */ def schema: StructType = queryExecution.analyzed.schema原创 2018-12-04 13:20:02 · 632 阅读 · 0 评论 -
scala 之关键字 case
case 声明类的好处创建 case class 和它的伴生 object实现了 apply 方法让你不需要通过 new 来创建类实例默认为主构造函数参数列表的所有参数前加 val添加天然的 hashCode、equals 和 toString 方法。由于 == 在 Scala 中总是代表 equals,所以 case class 实例总是可比较的下面的三个操作效果是等价的val ...原创 2018-12-04 14:16:02 · 791 阅读 · 0 评论 -
Scala 之 关键字 lazy
先看一个示例scala> val a = { println("I'am a"); "aaa"}I'am aa: String = aaascala> ares8: String = aaascala> lazy val a = { println("I'am a"); "aaa"}原创 2018-11-28 11:46:21 · 346 阅读 · 0 评论 -
实现 spark DataSourceV2 的几个环节
继承 DataSourceV2class SimpleWritableDataSource extends DataSourceV2 with ReadSupport with WriteSupport { override def createReader() override def createWriter()}构造 Readerclass Reader(path: St...原创 2018-11-28 14:21:01 · 847 阅读 · 0 评论 -
spark 部署模式和启动进程
Spark Standalone Mode(独立集群模式)Launching Spark Applications (启动应用)The spark-submit script provides the most straightforward way to submit a compiled Spark application to the cluster.For standalone cl...原创 2018-12-12 15:45:56 · 588 阅读 · 0 评论 -
spark on yarn
Apache Hadoop YARNconceptThe fundamental idea of YARN is to split up the functionalities of resource management and job scheduling/monitoring into separate daemons.YARN的基本思想是将资源管理和作业调度/监视的功能分解为单独的守...原创 2018-12-13 12:53:55 · 184 阅读 · 0 评论 -
Spark 零件
SparkEnvSparkEnv 是spark的执行环境对象,存在driver 或 executor 进程中。BlockManagerDriver Application 和 Executor 都会创建 BlockManager .Manager running on every node (driver and executors) which provides interfaces ...原创 2018-12-18 15:05:15 · 206 阅读 · 0 评论 -
Spark 之 persist
persisit/** * Set this RDD's storage level to persist its values across operations after the first time * it is computed. This can only be used to assign a new storage level if the RDD does not...原创 2018-12-18 16:29:09 · 663 阅读 · 1 评论 -
Databricks IO (DBIO) cache
Databricks IO CacheThe Databricks IO cache accelerates data reads by creating copies of remote files in nodes’ local storage using fast intermediate data format. The data is cached automatically whe...原创 2019-01-04 09:40:54 · 681 阅读 · 0 评论 -
The Kubernetes Operator for Apache Spark (spark-on-k8s-operator)
Kubernetes Operator for Apache Spark DesignIntroductionIn Spark 2.3, Kubernetes becomes an official scheduler backend for Spark, additionally to the standalone scheduler, Mesos, and Yarn. Compared w...原创 2019-04-16 20:23:02 · 843 阅读 · 0 评论 -
spark start-thriftserver.sh & Kubernetes
启动命令sh sbin/start-thriftserver.sh –master k8s://https://192.168.99.108:8443 –name spark-thriftserver –conf spark.executor.instances=1 –conf spark.kubernetes.container.image=zhixingheyitian/spark...原创 2019-04-19 14:05:40 · 1573 阅读 · 1 评论 -
spark sql examples on kubernetes
submit sql to thriftserver by beelinerun thriftserver in a podsh sbin/start-thriftserver.sh \ --master k8s://https://kubernetes.default.svc.cluster.local:443 \ --name spark-thriftserver \ ...原创 2019-05-07 20:51:37 · 695 阅读 · 0 评论 -
OAP 不同介质的UI
bin/spark-sql \ --master k8s://https://192.168.99.108:8443 \ --deploy-mode client \ --name spark-sql \ --conf spark.executor.instances=2 \ --conf spark.kubernetes.container.image=z...原创 2019-05-13 16:49:52 · 222 阅读 · 0 评论 -
修复Hadoop集群系统
修复某一台datanode先建立免密登陆将namenode ~/.ssh/下的公钥id_rsa.pub 内容copy 到 datanode 的 ~/.ssh/authorized_keys 里,直接追加即可修改配置$HADOOP_HOME/etc/slaves如果datanode IP 变了要修改 slaves/etc/hosts如果datanode IP 变了,要修改 hosts...原创 2019-06-22 09:27:20 · 261 阅读 · 0 评论 -
OAP FileFormt
OAP File// OAP Data File V1 Meta Part// ..// Field Length In Byte// Meta// Magic and Version 4// Row Count In Each Row Group 4// ...原创 2019-06-18 10:38:36 · 411 阅读 · 0 评论 -
Spark on Yarn
deploy modesThere are two deploy modes that can be used to launch Spark applications on YARN. In cluster mode, the Spark driver runs inside an application master process which is managed by YARN on t...原创 2019-07-06 10:01:53 · 181 阅读 · 0 评论 -
Spark 各种部署模式实验
Yarnclusterdriver 端,在nodemanager 里 AM 进程里driver 端 stdout原创 2019-08-29 10:28:10 · 1107 阅读 · 0 评论 -
Spark源码分析之SparkSubmit.scala
clusterManager, deployMode//Spark 2.3.2 SparkSubmit.scalaprivate def doPrepareSubmitEnvironment( args: SparkSubmitArguments, conf: Option[HadoopConfiguration] = None) : (Seq[Stri...原创 2019-04-16 15:36:53 · 437 阅读 · 1 评论 -
spark 、hadoop、yarn 集群那些事
hadoop 主节点上jps 后,有这样几个进程hadoop:SecondaryNameNodeNameNodeyarn:ResourceManagerSpark的Job History服务启动 sbin/start-history-server.shHistoryServerHiveRunJar...原创 2019-04-09 17:07:29 · 176 阅读 · 0 评论 -
Spark 之 Strategy
package object sql { /** * Converts a logical plan into zero or more SparkPlans. This API is exposed for experimenting * with the query planner and is not designed to be stable across spark ...原创 2019-01-05 21:21:08 · 544 阅读 · 0 评论 -
FileSourceScanExec
FileSourceScanExec is a leaf physical operator (as a DataSourceScanExec) that represents a scan over collections of files (incl. Hive tables).FileSourceScanExec is created exclusively(专门) for a Logic...原创 2019-01-13 10:18:44 · 568 阅读 · 0 评论 -
OAP ParquetDataFile and Cache
ParquetDataFile.scalaval iterator = reader.iteratorWithRowIds(requiredIds, rowIds) .asInstanceOf[OapCompletionIterator[InternalRow]] val result = ArrayBuffer[Int]() while (iterator.hasN...原创 2019-02-01 16:17:34 · 231 阅读 · 0 评论