Build in the Cloud: How the Build System works

本文详细介绍了谷歌如何利用云平台进行大规模的软件构建和测试工作,通过将源代码划分为更小的包,使用BUILD文件描述依赖关系,以及采用非文件依赖的规则来提高构建效率。谷歌的构建系统能够智能地组织并执行构建任务,即使在大量并行操作下也能保持高效。

This is the second in a four part series describing how we use the cloud to scale building and testing of software at Google. This series elaborates on a presentation given during the Pre-GTAC 2010 event in Hyderabad. Please see our first post in the series for a description how we access our code repository in a scalable way. To get a sense of the scale and details on the types of problems we are solving in Engineering Tools at Google take a look at this post.


As you have learned in the previous post, at Google every product is constantly building from head. In this article we will go into a little more detail in how we tell the build system about the source code and its dependencies. In a later article we will explain how we take advantage of having this exact knowledge to distribute the builds at Google across a large cluster of machines and share build results between developers.

From some of the feedback we got for our first posts it is clear that you want to see how we actually go about describing the dependencies we use to drive our build and test support.

At Google we partition our source code into smaller entities that we call packages. You can think of a package as a directory containing a number of source files and a description of how these source files should get translated into build outputs. The files describing that are all called BUILD and the existence of a BUILD file in a directory makes that directory into a package.

All packages reside in a single file tree and the relative path from the root of that tree to the directory containing the BUILD file is used as a globally unique identifier for the package. This means that there is a 1:1 relationship between package names and directory names.

Within a BUILD file we have a number of named rules describing the build outputs of the package. The combination of package name and rule name uniquely identifies the rule. We call this combination a label and we use labels to describe dependencies across rules.

Let’s look at an example to make this a little more concrete:

/search/BUILD:
cc_binary(name = ‘google_search_page’,
deps = [ ‘:search’,
‘:show_results’])

cc_library(name = ‘search’,
srcs = [ ‘search.h’,‘search.cc’],
deps = [‘//index:query’])

/index/BUILD:

cc_library(name = ‘query’,
srcs = [ ‘query.h’, ‘query.cc’, ‘query_util.cc’],
deps = [‘:ranking’,
‘:index’])

cc_library(name = ‘ranking’,
srcs = [‘ranking.h’, ‘ranking.cc’],
deps = [‘:index’,
‘//storage/database:query’])

cc_library(name = ‘index’,
srcs = [‘index.h’, ‘index.cc’],
deps = [‘//storage/database:query’])

This example shows two BUILD files. The first one describes a binary and a library in the package //search and the second one describes several libraries in the package //index. All rules are named using the name attribute, You can see the dependencies expressed in the deps attribute of each rule. The ‘:’ separates the package name from the name of the rule. Dependencies on rules in a different package start with a “//” and dependencies on rules in the same package can just omit the package name itself and start with a ‘:’. You can clearly see the dependencies between the rules. If you look carefully you will notice that several rules depend on the same rule //storage/database:query, which means that the dependencies form a directed graph. We require that the graph is acyclic, so that we can create an order in which to build the individual targets.

 

Here is a graphical representation of the dependencies described above:

 

You can also see that instead of referring to concrete build outputs as typically happens in make based build systems, we refer to abstract entities. In fact a rule like cc_library does not necessarily produce any output at all. It is simply a way to logically organize your source files. This has an important advantage that our build system uses to make the build faster. Even though there are dependencies between the different cc_library rules, we have the freedom to compile all of the source files in arbitrary order. All we need to compile ‘query.cc’ and ‘query_util.cc’ are the header files of the dependencies ‘ranking.h’ and ‘index.h’. In practice we will compile all these source files at the same time, which is always possible unless there is a dependency on an actual output file of another rule.


In order to actually perform the build steps necessary we decompose each rule into one or more discrete steps called actions. You can think of an action as a command line and a list of the input and output files. Output files of one action can be the input files to another action. That way all the actions in a build form a bipartite graph consisting of actions and files. All we have to do to build our target is to walk this graph from the leaves - the source files that do not have an action that produces them - to the root and executing all the actions in that order. This will guarantee that we will always have all the input files to an action in place prior to executing that action.

 

An example of the bipartite action graph for a small number of targets. Actions are shown in green and files in yellow color:

 


We also make sure that the command-line arguments of every action only contain file names using relative paths - using the relative path from the root of our directory hierarchy for source files. This and the fact that we know all input and output files for an action allows us to easily execute any action remotely: we just copy all of the input files to the remote machine, execute the command on that machine and copy the output files back to the user’s machine. We will talk about some of the ways we make remote builds more efficient in a later post.

All of this is not much different from what happens in make. The important difference lies in the fact that we don’t specify the dependencies between rules in terms of files, which gives the build system a much larger degree of freedom to choose the order in which to execute the actions. The more parallelism we have in a build (i.e. the wider the action graph) the faster we can deliver the end results to the user. At Google we usually build our solutions in a way that can scale simply by adding more machines. If we have a sufficiently large number of machines in our datacenter, the time it takes to perform a build should be dominated only by the height of the action graph.

What I describe above is how a clean build works at Google. In practice most builds during development will be incremental. This means that a developer will make changes to a small number of source files in between builds. Building the whole action graph would be wasteful in this case, so we will skip executing an action unless one or more of its input files change compared to the previous build. In order to do that we keep track of the content digest of each input file whenever we execute an action. As we mentioned in the previous blog post, we keep track of the content digest of source files and we use the same content digest to track changes to files. The FUSE filesystem described in the previous post exposes the digest of each file as an extended attribute. This makes determining the digest of a file very cheap for the build system and allows us to skip executing actions if none of the input files have changed.

Stay tuned! In this post we explained how the build system works and some of the important concepts that allow us to make the build faster. Part three in this series describes how we made remote execution really efficient and how it can be used to improve build times even more!

- Christian Kemper

先展示下效果 https://pan.quark.cn/s/a4b39357ea24 遗传算法 - 简书 遗传算法的理论是根据达尔文进化论而设计出来的算法: 人类是朝着好的方向(最优解)进化,进化过程中,会自动选择优良基因,淘汰劣等基因。 遗传算法(英语:genetic algorithm (GA) )是计算数学中用于解决最佳化的搜索算法,是进化算法的一种。 进化算法最初是借鉴了进化生物学中的一些现象而发展起来的,这些现象包括遗传、突变、自然选择、杂交等。 搜索算法的共同特征为: 首先组成一组候选解 依据某些适应性条件测算这些候选解的适应度 根据适应度保留某些候选解,放弃其他候选解 对保留的候选解进行某些操作,生成新的候选解 遗传算法流程 遗传算法的一般步骤 my_fitness函数 评估每条染色体所对应个体的适应度 升序排列适应度评估值,选出 前 parent_number 个 个体作为 待选 parent 种群(适应度函数的值越小越好) 从 待选 parent 种群 中随机选择 2 个个体作为父方和母方。 抽取父母双方的染色体,进行交叉,产生 2 个子代。 (交叉概率) 对子代(parent + 生成的 child)的染色体进行变异。 (变异概率) 重复3,4,5步骤,直到新种群(parentnumber + childnumber)的产生。 循环以上步骤直至找到满意的解。 名词解释 交叉概率:两个个体进行交配的概率。 例如,交配概率为0.8,则80%的“夫妻”会生育后代。 变异概率:所有的基因中发生变异的占总体的比例。 GA函数 适应度函数 适应度函数由解决的问题决定。 举一个平方和的例子。 简单的平方和问题 求函数的最小值,其中每个变量的取值区间都是 [-1, ...
评论
成就一亿技术人!
拼手气红包6.0元
还能输入1000个字符
 
红包 添加红包
表情包 插入表情
 条评论被折叠 查看
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值