clusterdata-2011-2 谷歌集群数据分析（一）

最新推荐文章于 2024-12-23 11:43:35 发布

原创最新推荐文章于 2024-12-23 11:43:35 发布 · 6.7k 阅读

25 ·

CC 4.0 BY-SA版权

文章标签：

#谷歌 #集群 #google-cluster-data #cluster #trace

云平台专栏收录该内容

10 篇文章

订阅专栏

本文介绍了谷歌集群数据的六个主要文件，包括机器事件、机器属性、作业事件表、任务事件表、任务约束和任务资源使用情况。详细解析了各表格的字段，如时间戳、事件类型、优先级、资源请求等，并探讨了任务优先级的五个类别及其意义。数据集可用于深入理解谷歌集群的工作机制。

数据获取地址：

链接：https://pan.baidu.com/s/1r0AOSstlLV1YSetwbdwJcg
提取码：0ob8

说明文档获取地址：

链接：https://pan.baidu.com/s/1h10kaiS89sfsPSjfcB7G6g
提取码：rd3p

谷歌集群数据主要包括六个文件，总大小为41个G，先就每种表的属性名称及含义做一个统计。

Machine events:

1. timestamp 2. machine ID 3. event type 4. platform ID 5. capacity: CPU 6. capacity: memory

其中时间戳的单位是微秒，事件类型属性值为0（ADD）、1（Remove）、2（Update），平台ID为不透明字符串。

Machine attributes:

1. timestamp 2. machine ID 3. attribute name 4. attribute value 5. attribute deleted

属性名称为不透明字符串，属性值是一个数字或者字符串，属性删除是一个布尔值，指示属性是否被删除

Job events table:

1. timestamp 2. missing info 3. job ID 4. event type 5. user name 6. scheduling class 7. job name 8. logical job name

其中event type属性值为0--8，分别代表：

0提交，1调度，2逐出（被抢占），3失败，4完成，5杀死，6丢失，7有待更新，8更新运行。

调度类型，该类粗略地表示作业的延迟敏感程度。调度类型由一个数字表示，3表示一个对延迟比较敏感的作业，0表示一个非生产任务（例如:非关键业务分析等）。请注意，调度类不是优先级，尽管对延迟比较敏感任务往往有较高的任务优先级。调度类型影响到资源访问的机器本地策略。优先级确定任务是否安排在机器上。

task events table:

1. timestamp 2. missing info 3. job ID 4. task index - within the job 5. machine ID 6. event type 7. user name 8. scheduling class 9. priority 10. resource request for CPU cores 11. resource request for RAM 12. resource request for local disk space 13. different-machine constraint

Job ID 和user name 有对应关系，一个作业ID只对应一个用户名，一个用户名对应多个作业ID。

task event 属性是一个数值，代表将一个Job 拆分成了多少个Task，也可以看成并行度，因为这些task 一般都是并行运行在不同机器上。

priority 属性代表了每个任务的优先级别，数值为0--11，分成了五种优先级：

infrastructure (11)—this is the highest (most entitled to get resources) priority in the trace and accounts for most of the recorded disk I/O, so we speculate it includes some storage services;
monitoring (10)
normal production (9)—this is the lowest (and most occupied) of the priorities labeled ‘production’. The trace providers indicate that jobs at this priority and higher which are latency-sensitive should not be “evicted due to over-allocation of machine resources” .
other (2-8) — we speculate that these priorities are dominated by batch jobs;
gratis (free) (0-1) — the trace providers indicate that resources used by tasks at these priorities are generally not charged.

task constraints table:

1. timestamp 2. job ID 3. task index 4. attribute name -- corresponds to machine attribute table 5. attribute value -- either an opaque string or an integer or the empty string 6. comparison operator

第6个属性比较运算符，在数据集中该属性的值要么是字符串，要么是数字。但是如何比较，标准是什么不大清楚。

小于(2)，大于(3)：将机器属性表示为整数(或0，如果属性不存在)，然后将其与提供的属性值进行比较。这些比较严格小于和严格大于;
等于(0)，不等于(1)：机器属性表示为字符串(或空字符串如果它不存在的话),然后比较所提供的属性值;

task resource usage table:

1. start time of the measurement period 2. end time of the measurement period 3. job ID 4. task index 5. machine ID 6. mean CPU usage rate 7. canonical memory usage 8. assigned memory usage 9. unmapped page cache memory usage 10. total page cache memory usage 11. maximum memory usage 12. mean disk I/O time 13. mean local disk space used 14. maximum CPU usage 15. maximum disk IO time 16. cycles per instruction (CPI)

这个记录比较重要，可以通过这个记录数据看出作业执行过程，具体讨论在下一篇博客