初创文旅企业员工手册_Citymobil-用于在初创企业业务增长中提高可用性的手册。第2部分...-优快云博客

初创文旅企业员工手册

This is a second article out of a series «Citymobil — a manual for improving availability amid business growth for startups». You can read the first part here. Let’s continue to talk about the way we managed to improve the availability of Citymobil services. In the first article, we learned how to count the lost trips. Ok, we are counting them. What now? Now that we are equipped with an understandable tool to measure the lost trips, we can move to the most interesting part — how do we decrease losses? Without slowing down our current growth! Since it seemed to us that the lion’s share of technical problems causing the trips loss had something to do with the backend, we decided to turn our attention to the backend development process first. Jumping ahead of myself, I’m going to say that we were right — the backend became the main site of the battle for the lost trips.

这是《 Citymobil —在初创企业业务增长中提高可用性的手册》系列的第二篇文章。您可以在这里阅读第一部分。让我们继续谈谈我们设法改善Citymobil服务可用性的方式。在第一篇文章中，我们学习了如何计算丢失的行程。好的，我们在数他们。现在怎么办？既然我们已经配备了可以衡量损失行程的易于理解的工具，那么我们可以转到最有趣的部分-如何减少损失？在不减慢我们目前的增长速度的情况下！由于在我们看来，造成旅行损失的大部分技术问题与后端有关，因此我们决定首先将注意力转向后端开发过程。超越我自己，我要说的是对的-后端成为丢失行程之战的主要地点。

1.开发过程如何进行 (1. How the development process works)

The problems are usually caused by the code deployment and other manual actions. The services that are never changed and touched by hand sometimes also malfunction, however, that’s an exception which only proves the rule.

这些问题通常是由代码部署和其他手动操作引起的。从未更改和用手触摸的服务有时也会发生故障，但是，这是一个例外，仅证明了规则。

In my experience, the most interesting and unusual exception was the following. Way back in 2006, when I worked at one of the small webmail service, there was a gateway that proxied all the traffic and made sure the IP addresses weren’t on the blacklists. The service worked on FreeBSD and it worked well. But one day it just stopped working. Guess why? The disk in this machine failed (bad blocks had been forming for a while and the inevitable happened) and it happened three years prior to the service failure. Everything was alive with the failed disk. And then FreeBSD, for reasons known only to itself, suddenly decided to address the failed disk and halted as a result.

以我的经验，最有趣和最不寻常的例外是以下情况。早在2006年，当我在一家小型Webmail服务公司工作时，就有一个网关代理所有流量，并确保IP地址不在黑名单中。该服务在FreeBSD上运行良好，并且运行良好。但是有一天它刚刚停止工作。猜猜为什么？该计算机中的磁盘发生了故障(坏块已经形成了一段时间，这是不可避免的)，并且发生在服务故障之前的三年。发生故障的磁盘一切正常。然后，由于仅知道的原因，FreeBSD突然决定解决发生故障的磁盘并因此暂停。

When I was a child, 10-12 years old, I went hiking to the woods with my dad and heard a phrase from him that I never forgot: «all you need to do to keep the bonfire burning is not to touch it». I believe most of us can remember a situation when we fed some wood to the already burning fire and it would go out for no reason.

当我10岁至12岁的孩子时，我和父亲一起去树林远足，听到他的一句话我从未忘记：“保持篝火燃烧所需要做的就是不要碰它”。我相信我们大多数人都可以记住一种情况，那就是当我们向已经燃烧的火中倒入一些木头时，它会无缘无故地熄灭。

The bottom line is the problems are created by humans’ manual actions; for example, when you feed wood to the already well-burning bonfire thus cutting of the oxygen and killing the fire or by deployment of the code with bugs into production. Therefore, in order to understand what causes the services issues, we need to understand the way the deployment and development process work.

最重要的是，问题是由人类的手动操作造成的；例如，当您将木头喂入已经燃烧良好的篝火时，从而减少了氧气并杀死了火，或者将带有错误的代码部署到了生产中。因此，为了了解导致服务问题的原因，我们需要了解部署和开发过程的工作方式。

In Citymobil the process was fully fast-development oriented and organized in the following way:

在Citymobil，该过程完全以快速发展为导向，并通过以下方式进行组织：

20-30 releases per day.
每天20-30个发行版本。
Developers perform deployment by themselves.
开发人员自己执行部署。
Quick testing in test environment by the developer.
开发人员在测试环境中进行快速测试。
Minimum automated/unit tests, minimum reviewing.
最少的自动化/单元测试，最少的审查。

The developers worked in rough conditions without QA support, with an enormous flow of very important tasks from product team and experiments; they worked as intently and consistently as they could, they solved hard tasks in a simple way, they made sure that code didn’t turn into spaghetti, they understood business problematics, treated changes responsibly and quickly rolled back what didn’t work. There’s nothing new here. There was a similar situation at Mail.Ru service 8 years ago when I started working there. We started Mail.ru Cloud up quickly and easily, no prelude. We’d be changing our process down the road to achieve better availability.

在没有质量检查支持的情况下，开发人员在恶劣的条件下工作，产品团队和实验人员进行了大量重要任务。他们尽可能专心，始终如一地工作，以简单的方式解决了艰巨的任务，确保代码不会变成意大利面条，他们了解业务问题，负责任地处理更改，并Swift回滚了无效的内容。这里没有新内容。我八年前开始在Mail.Ru服务中遇到过类似情况。我们快速，轻松地启动Mail.ru Cloud，没有任何前奏。我们将不断改变流程，以实现更高的可用性。

I bet you’ve noticed that yourself: when there is no holds barred, when it’s just you and production, when you’re carrying a heavy burden of responsibility — you’re doing wonders. I’ve had an experience like that. Long time ago I was pretty much the only developer at Newmail.Ru webmail service (it was acquired a while ago and then taken down); I performed deployment by myself and conducted production testing also on myself via if (!strcmp(username, "danikin")) { … some code… }. So, I was familiar with this situation.

我敢打赌，您会注意到自己：当没有停工时，只有您和生产，当您承担着沉重的责任时，您就在创造奇迹。我曾经有过这样的经历。很久以前，我几乎是Newmail.Ru网络邮件服务的唯一开发人员(该服务是在不久前被收购，然后被撤消的)。我自己进行了部署，并通过if (!strcmp(username, "danikin")) { … some code… }对自己进行了生产测试。因此，我对这种情况很熟悉。

I wouldn’t be surprised to find out that such «quick and dirty» approach has been utilized by many startups both successful and not, but all driven by one passion — desire for rapid business growth and market share.

我发现这样的“快速而肮脏的”方法被成功的和没有成功的许多初创公司所使用，我不会感到惊讶，但所有这些都是受一种激情驱使的，即对业务快速增长和市场份额的渴望。

Why did Citymobil have such a process? There were very few developers to begin with. They’d been working for the company for a while and knew code and business very well. The process worked ideally under those conditions.

为什么Citymobil会有这样的过程？开始的开发人员很少。他们已经为公司工作了一段时间，并且非常了解代码和业务。该过程在这些条件下理想地工作。

2.为什么出现可用性威胁？ (2. Why did availability threat come along?)

Growth of investments into product development caused our product plans to become more aggressive and we started to hire more developers. Number of deployments per day was increasing, but naturally their quality decreased since the new guys had to dive into the system and business in field conditions. Increase in number of developers resulted in not just linear, but a quadratic drop in availability (number of deployments was growing linearly, and quality of an average deployment was dropping linearly, so «linear» * «linear» = «quadratic»).

产品开发投资的增长导致我们的产品计划变得更加激进，我们开始雇用更多的开发人员。每天的部署数量在增加，但自然而然，由于新员工不得不在现场条件下投入系统和业务，其质量下降了。开发人员数量的增加不仅导致线性增长，而且可用性出现二次下降(部署数量呈线性增长，平均部署质量呈线性下降，因此“线性” *“线性” =“二次”)。

Obviously, we couldn’t keep going that way. The process just wasn’t built for these new conditions. However, we had to modify it without time-to-market compromise; that is, keeping 20-30 releases per day (considering their number would grow as the team grows). We were growing rapidly, we conducted many experiments, promptly evaluated the results and conducted new experiments. We quickly tested product and business hypotheses, learned from them and made new hypotheses that we promptly tested again and so on so forth. Under no circumstances would we slow down. Moreover, we wanted to speed up and hire developers quicker. So, our actions aimed at business growth created availability threat, but we had absolutely no intentions to modify these actions.

显然，我们不能那样做。该过程并非针对这些新条件而构建。但是，我们必须对其进行修改，而不能缩短产品上市时间。也就是说，每天保留20到30个发行版(考虑到发行数量会随着团队的增长而增长)。我们发展Swift，我们进行了许多实验，Swift评估了结果并进行了新的实验。我们快速测试了产品和业务假设，并从中吸取了教训，并提出了新的假设，我们又Swift对其进行了测试，依此类推。我们决不会放慢脚步。此外，我们希望加快速度并更快地雇用开发人员。因此，我们针对业务增长的行动造成了可用性威胁，但我们绝对无意修改这些行动。

3.确定任务已设置，过程已清除。下一步是什么？ (3. Ok, the task is set, the process is clear. What’s next?)

Having an experience of working at the Mail.Ru email service and Mail.Ru Cloud where the availability at some point had been made number one priority, where the deployments took place once every week, where everything was covered by automated and unit tests and the code was reviewed at least once, but sometimes even three times, I faced a totally different situation.

拥有在Mail.Ru电子邮件服务和Mail.Ru Cloud上工作的经验，其中在某些时候已将可用性设置为第一要务，每周进行一次部署，其中所有内容都由自动化和单元测试以及代码至少被审查了一次，但有时甚至是三遍，我都面临着完全不同的情况。

You’d think everything was quite simple: we could replicate the Mail.Ru email/cloud process at Citymobil thus increasing the service availability. However, as they say — the devil is in the details:

您会认为一切都非常简单：我们可以在Citymobil复制Mail.Ru电子邮件/云过程，从而提高服务可用性。但是，正如他们所说，魔鬼在细节中：

the deployments in Mail.Ru email/cloud are conducted once a week, not 30 times a day; at Citymobil we didn’t want to sacrifice the releases quantity;
Mail.Ru电子邮件/云中的部署每周执行一次，而不是每天30次。在Citymobil，我们不想牺牲排放量。
in Mail.Ru email/cloud the code is covered by auto/unit test and we didn’t have neither time nor resources for that at Citymobil; we hurled all our backend development effort into hypotheses and product improvement testing.
在Mail.Ru电子邮件/云中，代码已被自动/单元测试覆盖，在Citymobil，我们既没有时间也没有资源；我们将所有后端开发工作投入到假设和产品改进测试中。

That said, we were short-handed in terms of backend developers, even though they were being hired promptly (a special thanks to Citymobil recruiters — the best recruiters in the world! I think there’s going to be a separate article about our recruitment process), so there was no way we could address testing and reviewing issues without slowing down.

就是说，尽管后端开发人员被Swift录用，但我们还是人手不足(特别要感谢Citymobil招聘人员-世界上最好的招聘人员！我认为关于我们的招聘流程将另作文章) ，因此我们无法在不降低速度的情况下解决测试和审查问题。

4.当你不知道该怎么做时，从错误中学习 (4. When you don’t know what to do, learn from mistakes)

So, what is it so magical that we’ve done at Citymobil? We decided to learn from mistakes. Learn-from-mistakes service improvement method is as old as time. If the system works well, it’s good. If the system works not well, it’s also good since we can learn from mistakes. Easier said than… Actually, it can be done easily, too. The key is to set a goal.

那么，我们在Citymobil所做的神奇的事情是什么？我们决定从错误中学习。从错误中学习服务改进方法与时俱进。如果系统运行良好，那就很好。如果系统运行不正常，那也很好，因为我们可以从错误中学习。说起来容易……实际上，也可以轻松完成。关键是要设定目标。

How did we learn? First, we started to religiously write down the information on every single outage, big and small. To be honest, I really didn’t feel like doing that at first as I was hoping for a miracle and thought that the outages would just stop by themselves. Obviously, nothing was stopping. New reality mercilessly demanded some changes.

我们如何学习？首先，我们开始认真地记录每笔大，小故障的信息。老实说，起初我真的不想这么做，因为我希望能创造奇迹，并认为断电只会自己停止。显然，什么都没有停止。新现实无情地要求做出一些改变。

We started logging all the outages in a Google Docs table. For every outage there was the following short information:

我们开始将所有中断记录在Google Docs表中。对于每次中断，都有以下简短信息：

date, time, duration;
日期，时间，持续时间；
the root cause;
根本原因
what was done to fix the problem;
解决此问题的措施；
business impact (number of lost trips, other outcomes);
业务影响(差旅次数，其他结果)；
takeaways.
外卖。

For every big outage, we would create a separate big file with detailed minute-by-minute description from the moment the outage began till the moment it ended: what we did, what decisions were made. It’s usually called a post-mortem. We would add the links to these post-mortems into the general table.

对于每次大故障，我们将创建一个单独的大文件，其中包含从中断开始到结束的每一分钟的详细描述：我们做了什么，做了什么决定。通常称为验尸。我们会将这些事后评估的链接添加到常规表中。

There was one reason for creating such a file: to come up with conclusions that would aim in decreasing the number of lost trips. It was very important to be very specific about what is «the root cause» and what are «the takeaways». The meaning of these words is clear; however, everyone can understand them differently.

创建这样一个文件的一个原因是：得出旨在减少旅行损失次数的结论。明确什么是“根本原因”以及什么是“外卖”是非常重要的。这些词的意思很清楚。但是，每个人都能以不同的方式理解它们。

5.我们从中了解到的停机示例 (5. Example of an outage we’ve learned from)

The root cause is an issue that needs to be fixed in order to avoid such accidents in the future. And conclusions — the ways to eliminate the root cause or to reduce the likelihood of its resurgence.

根本原因是需要解决的问题，以避免将来发生此类事故。结论：消除根本原因或减少其死灰复燃的方法。

The root cause is always deeper than it seems to be. The takeaways are always more complicated than they seem to be. You should never be satisfied by supposedly found root cause and never be satisfied with alleged conclusions, so that you don’t relax and stop at what seem to be right. This dissatisfaction creates a spark for further analysis.

根本原因总是比看起来更深。外卖总是比看起来更复杂。您应该永远不要对所谓的根本原因感到满意，也不要对所谓的结论感到满意，这样您就不会放松并停在看起来正确的地方。这种不满产生了进一步分析的火花。

Let me give you a real-world example: we deployed code, everything went down, we rolled it back, everything was working again. What’s the root cause of the problem? You’d say: Deployment. If you had not deployed code, then there wouldn’t have been an accident. So, what’s the takeaway: no more deployments? That’s a not very good takeaway. So, most likely, that wasn’t the root cause, we need to dig deeper. Deployment with a bug. That’s the root cause? Alright. How do we fix it? You’d say by testing. What kind of testing? For instance — full regression test of all functionality. This is a good takeaway, let’s remember it. But we need to increase availability here and now before we implemented the full regression test. We need to dig even deeper. Deployment with a bug that was caused by debug print in the database table; we overloaded the database, and it went down under the load. That sounds better. Now it became clear that even full regression test won’t save us from this issue. Since there won’t be the workload on the test database similar to the production workload.

让我给你举一个真实的例子：我们部署了代码，一切都崩溃了，我们回滚了，一切都重新开始了。问题的根本原因是什么？您会说： 部署。 如果您尚未部署代码，那么就不会发生意外。那么，收获是什么：没有更多的部署了？这不是一个很好的外卖。因此，这很可能不是根本原因，我们需要更深入地研究。 有错误的部署。 那是根本原因吗？好的。我们该如何解决？你会通过测试说。什么样的测试？例如-所有功能的完整回归测试。这是一个很好的外卖，让我们记住它。但是在实施完整回归测试之前，我们需要在此时此地提高可用性。我们需要更深入地挖掘。 部署时由于数据库表中的调试打印而导致的错误； 我们使数据库超载，并且在负载下崩溃了。 听起来更好。现在很明显，即使是完整的回归测试也无法使我们摆脱这个问题。由于测试数据库上不会有类似于生产工作负载的工作负载。

What’s the root cause for this problem, if we dig even deeper? We had to talk to engineers to find that out. Turned out, that the engineer got used to the database being able to handle any workload. However, due to the rapid growth of workload the database couldn’t at that time handle what it handled the day before. Very few of us had a chance to work for the projects with 50% growth rate monthly. For me, for instance, that was the first project like that. Having plunged into a project like that, you begin to comprehend new realities. You’ll never know it’s out there until you come across it.

如果我们进行更深入的研究，此问题的根本原因是什么？我们必须与工程师交谈才能找到答案。原来，工程师习惯了数据库能够处理任何工作负载。但是，由于工作负载的快速增长，当时的数据库无法处理前一天的工作。我们中很少有人有机会为每月增长50％的项目工作。例如，对我来说，那是第一个这样的项目。投入这样的项目后，您便开始理解新的现实。您将永远不会知道它在那里，直到您遇到它。

The engineer came up with the correct way to fix it: debug print must be done in a file that should be written via a cron script to the database in one thread. In case there’s too much debug printing, the database won’t go down; debug data will simply appear sooner or later. This engineer has obviously learned from his mistake and won’t make it again. But other engineers should also know about that. How? They need to be told. How to make them listen? By telling them the whole story from beginning to end, by laying out the consequences and proposing a correct way of doing it; and also, by listening and answering their questions.

工程师提出了正确的解决方案：调试打印必须在一个文件中完成，该文件应通过cron脚本在一个线程中写入数据库。万一调试打印过多，数据库就不会崩溃。调试数据只会早晚出现。这位工程师显然已经从他的错误中吸取了教训，不会再犯了。但是其他工程师也应该知道这一点。怎么样？需要告诉他们。如何让他们听？通过从头到尾告诉他们整个故事，列出后果并提出正确的做法；以及通过听和回答他们的问题。

6.从这个错误或“该做与不该做”中，我们还能学到什么？ (6. What else can we learn from this mistake or «do’s & don’ts».)

Ok, let’s keep analyzing this outage. The company is growing rapidly, new engineers are coming in. How are they going to learn from this mistake? Should we tell every new engineer about it? Obviously, there’ll be more and more mistakes — how do we make everyone learn from them? The answer is almost clear: create a do’s and don’ts file. We’ll be writing all the takeaways into this file. We show this file to all our new engineers and also to all our current engineers in a work group chat every time the do’s & don’ts is updated, strongly urging everyone to read it again (to brush up on the old information and see the new one).

好的，让我们继续分析这种中断。公司发展Swift，新工程师不断涌现。他们将如何从这一错误中学习？我们应该告诉每个新工程师吗？显然，将会有越来越多的错误-我们如何使每个人都从错误中学习？答案几乎很明确：创建一个“做”和“不做”文件。我们将把所有要点写入此文件。每次更新“要做与不要做的事情”时，我们都会在工作组聊天中向所有新工程师以及所有现任工程师显示此文件，强烈敦促每个人再次阅读(仔细阅读旧信息并查看新的一个)。

You might say that not everyone will read carefully. You might say that the majority will forget it right after reading. And you’d be right on both accounts. However, you can’t deny the fact that something will stick in someone’s head. And that’s good enough. In Citymobil experience, the engineers take this file very seriously and the situations when some lessons were forgotten occurred very rarely. The very fact that the lesson was forgotten can be seen as a problem; we should draw a conclusion and analyze the details to figure out the way to change something in the future. This kind of digging leads to more precise and accurate wordings in do’s and don’ts.

您可能会说，并非每个人都会仔细阅读。您可能会说大多数人会在阅读后立即忘记它。而且您在两个帐户上都将是正确的。但是，您不能否认某些东西会粘在某人的脑海这一事实。这样就足够了。根据Citymobil的经验，工程师非常重视此文件，而很少忘记某些课程的情况很少发生。忘记这一课的事实可以看作是一个问题。我们应该得出一个结论并分析细节，以找出将来更改某些东西的方式。这种挖掘导致“做与不做”中的措词更加准确。

The takeaway from the above-described outage: create a do’s and don’ts file; write everything we’ve learned in it, show the file to the whole team, request every newcomer to study it and encourage people to ask questions.

从上述中断中得出的结论：创建一个“做”和“不做”文件；写下我们从中学到的所有知识，向整个团队展示文件，要求每个新手学习它，并鼓励人们提出问题。

General advice that we derived from the outage review: we shouldn’t use a word combination «shit happens». As soon as you say it out loud, everyone decides that nothing needs to be done, no conclusions are necessary since humans have always made mistakes, are making mistakes now and will be making them in the future. Therefore, instead of saying that phrase, you should make a specific conclusion. A conclusion — is maybe a small but still a step to the direction of improvement of the development process, monitoring systems and automated tools. Such small steps result in a more stable service!

我们从中断审查中获得的一般建议：我们不应该使用“ shit事件”一词组合。只要您大声说出来，每个人都认为不需要做任何事情，也不需要结论，因为人类一直在犯错误，现在正在犯错误，将来还会犯错误。因此，您应该做出一个具体的结论 ，而不是说那句话。结论—也许只是很小的一步，但仍然朝着改进开发过程，监视系统和自动化工具的方向迈出了一步。如此小的步骤可以使服务更稳定！

7.代替结尾 (7. In lieu of epilogue )

In further parts, I’m going to talk about types of outages in Citymobil experience and go into detail about every outage type; I’ll also tell you about the conclusions we made about the outages, how we modified the development process, what automation we introduced. Stay tuned!

在其他部分，我将讨论Citymobil体验中的中断类型，并详细介绍每种中断类型。我还将告诉您有关停机的结论，如何修改开发流程以及引入了哪些自动化。敬请关注！

翻译自: https://habr.com/en/company/mailru/blog/449310/

初创文旅企业员工手册

初创文旅企业员工手册_Citymobil-用于在初创企业业务增长中提高可用性的手册。 第2部分...