时序数据库 – InfluxDB
Time series database (TSDB) 时序数据库
- 时序数据库是一种随着时代演化,数据量大量增加的情况出现的数据库,时序数据库更像是之前的关系型数据库和非关系型数据库的结合体,时序数据库不需要严格的数据结构,时序数据库的一种 — influxDB使用的是类似于SQL 语句的操纵方式,且时序数据库与非关系型数据库都有很强的高并发能力。
- 时序数据库是基于时间进行存储的一种数据库,每一条数据中都有一个时间戳,因而这种数据库特别适合存储那些随着时间变化的数据,通过一些工具处理后,能够分析出数据随时间变化的趋势。
- 时序数据库相较关系型数据库的变化,底层的存储结构不同,关系型数据库使用的是B+ 树存储结构,而时序数据库使用的是LSM存储结构,能够带来很多好处,一大好处就是支持比关系型数据库甚至比NoSQL大得多的并发量,如下图
- 说了这么多,到底时序数据库的用处在哪里呢
- 时序数据库最开始应用的地方 – 工业制造,为了高效存储传感器传来的测量数据,监测各项环境变量
- IoT – 物联网,也就是每个设备都会有一个传感器,搭配上IPv6,每个设备都能传输数据到数据库
- 自动驾驶,与5G搭配,将自动驾驶的海量数据存储到时序数据库
- 监控,运维中对计算机基础设施的监控数据非常多,从网络,数据中心,集群,主机,虚拟机,存储等
- 金融,华尔街还专门开发时序数据库(DolphinDB)存储金融数据,为量化金融的分析统计助力
- 参考资料
- 时序数据库排名
- DB-engine 时序数据库排名
- 2019.7 月排名
- 时序数据库优缺点
- 数据来自 时序数据库特点与对比
时序数据库 | 优点 | 缺点 |
---|---|---|
OpenTSDB | Metric+Tags- 集群方案成熟(HBase) 写高效(LSM-Tress) | 查询函数有限 依赖HBase 运维复杂- 聚合分析能力较弱 |
Graphite | 提供丰富的函数支持 - 支持自动Downsample 对Grafana的支持最好 - 维护简单 | Whisper存储 引擎IOPS高 Carbon组件CPU使用率高 聚合分析能力较弱 |
InfluxDB | Metrics+Tags 部署简单、无依赖 实时数据Downsample 高效存储 | 开源版本没有集群功能 存在前后版本兼容问题 存储引擎在变化 |
Prometheus | Metric + Tags 适用于容器监控 具有丰富的查询语言 维护简单 集成监控和报警功能 | 没有集群解决方案 聚合分析能力较弱 |
Druid | 支持嵌套数据的列式存储 - 具有强大的多维聚合分析能力 - 实时高性能数据摄取 - 具有分布式容错框架 - 支持类SQL查询 | 一般不能查询原始数据 不适合维度基数特别高的场景 时间窗口限制了数据完整性 运维较复杂 |
ElasticSearch | 支持嵌套数据的列式存储 支持全文检索 支持查询原始数据 灵活性高 社区活跃 扩展丰富 | 不支持分析字段的列式存储 对硬件资源要求高 集群维护较复杂 |
ClickHouse | 具有强大的多维聚合分析能力 实时高性能数据读写 支持类SQL查询 提供丰富的函数支持 具有分布式容错框架 支持原始数据查询 适用于基数大的维度存储分析 | 比较年轻,扩张不够丰富,社区还不够活跃 不支持数据更新和删除 集群功能较弱 |
- influxdata的开发者Paul Dix文章
为什么要开发时序数据平台?
身处大数据时代,在IoT物联网成为每个人的热门话题之前,在Cloud Native(包含云服务)成为一个熟知的术语之前, 在大型企业开始解决其基础设施监控和指标数据的工作之前,InfluxData 的创始人之一的Paul Dix开始开发特制的时间序列平台。今天,时间序列数据库已经是增长最快的数据库类别了,且市场增长已经超过Cassandra,Hbase这两个数据库开发时所预想的细分市场容量。跟随下面的文章,接收到第一手来自Paul Dix的回答,了解为什么要开发时间序列平台以及概述所遇到的问题。
- 我经常被问到,“为什么专门为时间序列开发一种数据库?”,这暗指普通的数据库可以通过在时间字段加索引变成时间序列数据库TSDB,或者你可以在分布式数据库的基础上构建,就像Cassandra数据库那样样。当他们通过这些方式去解决时间序列的问题,付出了大量时间和开发精力。我对其他工程师说去了解他们已经完成的项目,并且寻找这些时间序列数据平台的共同点,要站在巨人的肩膀上。每个人似乎都在重复发明轮子,但在市场上专门构建的东西和时间序列之间还是有一定距离。
- 在这篇文章中,我将会说明(定义)时间序列的问题,列出时间序列与其他用例和数据库工作负载的不同之处,并介绍我们处理时间序列数据独特需求的方法。最后会说明为时间序列专门构建一种数据库的优点。
定义时间序列问题
-
如何解决时间序列问题,时间序列平台又是什么样的,在回答这些问题之前,让我们先定义时间序列数据的概念
-
涉及时间序列数据,我会将此分为两类,一种是规则的数据,另一种是不规则的数据
- 规则的时序数据 与运维开发中的开发人员使用的度量指标非常相似。这些是在一段固定的间隔时间采集的简单数据,比如10s采集一次。这在传感器数据用例中也很常见,比如定期从传感器读取数据。关于时序数据很重要的一点是,时序数据表示底层原始数据流或分布的摘要。寻找事件的模式比可视化需要绘制的像素多时,摘要是很有用的。
- 非规则的时序数据 对应着离散的事件,这可能是API请求,股票市场的交易,或是在一定时间内想要追踪的事件。从非规则数据转化为规则数据是可实现的。例如,如果你想计算在一分钟间隔时,API的平均响应时间,你可以整合独立的请求生成规则的时序数据
-
我的理念是一个现代的时序数据库需要拥有同时处理规则数据和非规则数据的事件和指标
-
时间序列的另一部分,通常使用元数据来描述用户稍后可能想要查询的序列。这可能是主机名(hostname),应用(application),地区(region),传感器ID(sensorID),建筑(building),股票名(stock name),投资组合(portfolio name)等等任何维度上,在时间序列上想要被查询的东西。添加元数据到时间序列允许你切割他们,然后跨越不同的纬度创建一个摘要。这意味时序数据是元数据通过将(时间,值)索引而后描述的一对元组(数组)。元数据表示为测量名(表名 measurement),标签(key-value 数组)和一个字段名(field name)
时序应用及规模
- 现在我们将定义时序是什么,让我们挖掘时序数据库与其他数据库使用案例和工作负载之间的差别
- 时序数据需要关注于快速提取,然后呢,你总是插入新的数据,添加最近的时序数据时多数情况下都会是添加操作,当使用者在某时会想要添加一些以前的传感器数据用例(历史回填),我们经常看到滞后的数据集。即使使用后者,你通常会添加最新的数据到每个独立的时序中。
- 高精度的数据会被保存一段较短的时间,而长期保存的是中等精度或者低精度的summary数据。一种实现方式是间隔5分钟和1小时收集原始高精度样本和summary数据。可行的方法是你必须不断的将数据从数据库删除。高精度数据被放在一个短暂的窗口内,并且它们应该被删除。这与其他普通数据库处理数据的方式有很大不同。
- agent 或是数据库本身必须不断地从长期存储中的高精度数据计算summaries。这可能是一些简单的聚合操作,像 first,last,min,max,sum,count 或者可能包括更多复杂的计算,像 percentiles(百分位数值) 或者 histograms(直方图)
- 时序数据查询的模式与其他数据库的工作方式有很大不同。在多数情况下,一次查询会拉取请求时间范围内的一组数据。对于数据库,能够动态地进行聚合计算和下采样(downsamples)。它们会查看许多记录为获取一次查询的结果集。快速迭代许多记录进行计算聚合对于时间序列用例是很重要的。
我们了解到用户解决问题使用的三个主要应用:
- 服务与应用监控
- 实时分析
- 物联网 传感器数据监控和控制
这些每种数据都是不同的,但它们都很频繁地使用一样的通用模型。在服务器监控案例中我们使用规则数据(measurement)来追踪像CPU,硬盘,网络,内存使用等项。这也通常可以使用measurements来测量第三方服务,像 apache,redis,nginx,mysql等等的指标测量。一个服务器通常不会拥有200个或以上的measurement。让我们来粗略的估算一下运维开发每天的数据量。假设我们有100个服务器,每个服务器有200个独立的测量项(measurement)需要收集。这意味着我们有20,000个独立的序列。接着,每间隔10秒钟就收集这些数据,每个序列就有 86400 / 10 = 8640 的数据,总计下来每天需要记录 20000 * 8640 = 172,800,000(1亿七千万)。
使用关系型数据库存储时间序列的问题
许多我们的用户开始使用总所周知的关系型数据库PostgreSQL 和 MySQL来保存时间序列数据,通常来说他们在短时间内觉得是可行的,但随着数据量的显著增加,这种方式开始变得令人崩溃。如果我们提取之前的服务器监控样本,有几种方式可以做到,但仍然面临许多挑战。
- 结构化选项挑战结论,创造单个的表去存储所有东西,包括序列名,值,时间。如果我们想要搜索除指定的名称(服务器,指标,服务)以外的东西,需要单独的查找索引。
- 这种原始的实现方式可能会让你拥有一个每天增加 172 M数据的表,这很快会导致一个问题,因为过多的数据在表中。而时间序列,通常高精度的数据会被保持在一个较短的时间内。
- 这意味着你将会进行同样多的插入和删除操作,所以这不是一个传统数据库在设计时的处理方式。每天或一段时间就创建一个新的表,需要开发者开发应用在不同的数据表内将数据联系起来,为了计算summary统计,为了低精度的数据周期性的删除旧表,必须增加额外的代码进行控制。单个数据库服务器在扩展性上会有问题,将时序数据分片到不同的服务器上是一项通用的技术,但需要更多的应用层代码进行控制。
- 结论,关系型存储技术,不是专门为解决时序数据问题而设计的,尝试使用他们是不切实际的
搭建分布式数据库
在使用多个标准的关系型数据库后,更多地类似与Cassandra 和 HBase
这样的分布式数据库,作为SQL的变体,在Cassandra数据库
之上搭建一个分布式时间序列解决方案需要付出更多的努力。
- 首先,你需要决定如何描述数据,Cassandra 的行数据会被存储在一个复制集中,这意味着你需要思考如何描述行的键会被集群在不创建
hot pots
的情况下正确进行读写操作。 - 其次,一旦你已经决定如何排列数据,你就要为时序用例在应用逻辑上添加一个查询过程。你最终也编写下采样
diwnsampling
逻辑用来处理创建低精度样本使得能够进行长期的可视化。 - 最终,一旦你有了基础的连接,为了在进行查询许多时序数据和在不同纬度下计算聚合保证良好的查询性能,这将会是一个持续很久的苦差事,
结论: 编写这些应用代码通常是需要花费好几个月且需要得力的后端工程师才能完成。
专门设计一个时序数据库的优势
开发者的愿望
- 我们预想的一个目标是为节省用户和开发者宝贵时间而打造一个时间序列平台。也就是说,我们越快地解决他们的问题,越快启动并运行时序平台,体验就会越好。正是因为我们了解到用户们经常编写这样的代码或构建项目去解决相同的问题,所以我们推出我们的时序平台或数据库。开发者使用更少的代码解决问题,也节省了很多时间。
时间是特殊的
- 在重要的可用性目标以外,我们也看到我们需要围绕特殊的时间序列进行数据库的优化。只能进行插入操作,我们需要聚合和下采样,当用户想要释放的空间时, 我们需要自动化地转化高精度数据。我们还需要进行数据压缩以优化时序数据。我们为了更好地查询对tag进行索引,对数据进行组织。在数据库层面,还有很多可优化的地方。
超越数据库,让开发更简单
- 设计专门的时序数据库另一优势是,我们能够构建超越数据库的范围。我们了解到用户们遇到一些共同的问题,如何收集数据,如何存储数据,如何监控这些数据,如何将这些数据库展示出来。
- 我们也了解到社区通过公共的API使用我们的服务,变得更加简单。我们拥有行协议代表序列数据,我们的HTTP API 进行读写,
Kapacitor
用来监控过程。这意味着越过时间,我们可以为大多数情况进行预构建组件。 - 我们发现我们在性能和减少开发者工作方面至少比多数普通的数据库要高出一个数量级(10倍)。使用Cassandra 或者 MySQL 处理一些事时可能需要花费数月的时间,但使用我们的产品,可能需要一个下午就可以搞定。这正是我们所为之努力的成果。
- 通过聚焦在时序上,我们能够为应用开发者解决问题,以便让其能够将精力放在应用的其他方面,创造独特的价值。
Why Build a Time Series Data Platform?
In the middle of the big data hype cycle, before IoT was on the top of everyone’s buzz list, before Cloud Native was common lingo and before large enterprises were starting the work to de-silo their infrastructure monitoring and metrics data, Paul Dix, Founder of InfluxData, began building a purpose-built Time Series Platform. Flash forward to today, time series is now the fastest growing database segment, and the market is clearly moving beyond the re-purposed Cassandra and Hbase implementations that defined the segment at that time. The following post is a firsthand account from Paul Dix outlining the problems he witnessed and why he built a modern Time Series Platform.
I am frequently asked: “Why build a database specifically for time series?” The implication was that a general SQL database can act as a TSDB by ordering on some time column. Or you can build on top of a distributed database like Cassandra. While it’s possible to use these solutions for solving time series problems, they’re incredibly time consuming and require significant development effort… I talked to other engineers to see what they had done and found that there was a common set of tasks that led to the need for a common Time Series Platform. Everyone seemed to be reinventing the wheel, so it looked like there was a gap in the market for something built specifically for time series.
In this post, I’ll define the time series problem, lay out what differentiates time series from other use cases and database workloads and look at other approaches I’ve seen taken to handle the unique requirements of time series data. Finally, I’ll look at the advantages of building specifically for time series.
Defining The Time Series Problem
Let’s first define time series data and then look at how others have tried to solve for this before moving to a Time Series Platform.
When I refer to time series data, I think of two different types of time series: regular and irregular.
- Regular time series are familiar to developers in the DevOps or metrics space. These are simply measurements that are taken at fixed intervals of time, like every 10 seconds. This is also seen quite frequently in sensor data use cases, like taking a reading at regular intervals from a sensor. The important thing about regular time series is that they represent a summarization of some underlying raw event stream or distribution. Summarization is very useful when looking for patterns or visualizing data sets that have more events than you have pixels to draw.
- The second type of time series, irregular, corresponds to discrete events. These could be requests to an API, trades in a stock market, or really any kind of event that you’d want to track in time. It’s possible to induce a regular time series from an irregular one. For example, if you want to calculate the average response time from an API in 1 minute intervals, you can aggregate the individual requests to produce the regular time series.
My belief is a modern TSDB needs to be able to handle both regular and irregular events and metrics.
The other part of time series is that it’s common to have metadata that describe the series that users may want to query on later. This could be a hostname, application, region, sensor ID, building, stock name, portfolio name or really any dimension on which a time series might want to be queried on. Adding metadata to time series allows you to slice and dice them and create summaries across different dimensions. This means that a series is the metadata that describes it and ordered time, value pair tuples. Metadata is represented as a measurement name, tag key/value pairs, and a field name.
Time Series Applications & Scale
Now that we’ve defined what time series are, let’s dig into what makes them different from other database use cases and workloads.
- Time series data needs to focus on fast ingestion. That is, you’re always inserting new data. Most often, these are append operations where you’re adding only recent time series data—although users do sometimes need historical backfill, and with sensor data use cases, we frequently see lagged data collection. Even with the latter, you’re usually appending recent data to each individual series.
- High-precision data is kept for some short period of time with longer retention periods for summary data at medium or lower precision. One way to think about this is the raw high-precision samples and summaries for 5 minute and 1 hour intervals. Operationally this means that you must be constantly deleting data from the database. The high-precision data is resident for a short window and then should be evicted. This is a very different workload than what a normal database is designed to handle.
- An agent or the database itself must continuously compute summaries from the high-precision data for longer term storage. These could be simple aggregates like first, last, min, max, sum, count or could include more complex computations like percentiles or histograms.
- The query pattern of time series can be quite different from other database workloads. In most cases, a query will pull a range of data back for a requested time range. For databases that can compute aggregates and downsamples on the fly, they will frequently churn through many records to pull back the result set for a query. Quickly iterating through many records to compute an aggregate is critical for the time series use case.
The three primary applications we see users solving problems for these problems are in:
- Server and application monitoring
- Real-time analytics
- IoT sensor data monitoring and control
The data for each of these is different, but they frequently take the same general shape. In the server monitoring case we’re taking regular measurements for tracking things like CPU, hard disk, network, and memory utilization. It’s also common to take measurements to instrument third-party services like Apache, Redis, NGINX, MySQL, and many others. Series usually have metadata information like the server name, the region, the service name, and the metric being measured. It’s not uncommon to have 200 or more measurements (unique series) per server. Let’s get a rough idea of a DevOps data set for a day. Say we have 100 servers and each has 200 unique measurements to collect. That means we have 20,000 unique series. Further, let’s say that we’re collecting this data every 10 seconds. That means in the course of a day we’re collecting 86,400 / 10 = 8,640 values per series for a total of 20,000 * 8,640 = 172,800,000 values for each day.
The Problem of Using a SQL Database for Time Series
Many of our users started off working with time series by storing their data in common SQL RDBMSes like PostgreSQL or MySQL. Generally they find this works for a time, but things start to fall apart as the scale of the data increases. If we take our server monitoring example from before, there are a few ways to structure things, but there are some challenges.
Structure OptionsChallengesConclusionCreate a single table to store everything with the series name, the value, and a time.Separate lookup index if we wanted to search on anything other than the specific name (like server, metric, service, etc.).
This naive implementation would have a table that gets 172M new records per day. This would quickly cause a problem because of the sheer size of the table.With time series, it’s common to have high-precision data that is kept around only for a short period of time.
This means that soon you’ll be doing just as many deletes as inserts, which isn’t something a traditional DB is designed to handle well.Create a separate table per day or some other period of time.Requires the developer to create application code to tie the data from the different tables together.More code must be written to compute summary statistics for lower-precision data and to periodically drop old tables.Then there’s the issue of scaling past what a single SQL server can handle. Sharding segments of the time series to different servers is a common technique but requires more application-level code to handle it.
Conclusion: Relational technologies were not designed to solve the specific time series issues, and trying to get them to solve them is impractical.
Building on Distributed Databases
After initially working with a more standard relational database, many will look at distributed databases like Cassandra or HBase. As with the SQL variant, building a time series solution on top of Cassandra requires quite a bit of application-level code.
First, you need to decide how to structure the data. Rows in Cassandra get stored to one replication group, which means that you need to think about how to structure your row keys to ensure that the cluster is properly utilized without creating hot spots for writes and reads. Then, once you’ve decided how to arrange the data, you need to write application logic to do additional query processing for the time series use case. You’ll also end up writing downsampling logic to handle creating lower-precision samples that can be used for longer-term visualizations. Finally, once you have the basics wired up, it will be a continual chore to ensure that you get the query performance you need when querying many time series and computing aggregates across different dimensions.
Conclusion: Writing all of this application code is frequently a multi-month project requiring competent backend engineers.
Advantages of Building Specifically for Time Series
So this brings us back around to the point of this post: Why build a Time Series Data Platform?
Developer Happiness
One of our goals we envisioned when making a Time Series Platform was optimizing for a user’s or developer’s time to value. That is, the faster they get their problem solved and are up and running, the better the experience will be. That means that if we see users frequently writing code or creating projects to solve the same problems, we’ll try to pull that into our platform or database. The less code a developer has to write to solve their problem, the faster they’ll be done.
Time is Peculiar
Other than the obvious usability goals, we also saw that we could optimize the database around some of the peculiarities of time series. It’s insert only, we need to aggregate and downsample, we need to automatically evict high-precision data in the cases where users want to free up space. We could also build compression that was optimized for time series data. We also organized the data in a way that would index tag data for efficient queries. At the database level, there were many optimizations we could get.
Going Beyond a Database to Make Development Easier
The other advantage in building specifically for time series is that we could go beyond the database. We’ve found that most users run into a common set of problems they need to solve—how to collect the data, how to store it, how to process and monitor it, and how to visualize it.
We’ve also found that having a common API makes it easier for the community to build solutions around our stack. We have the line protocol to represent time series data, our HTTP API for writing and querying, and Kapacitor for processing. This means that over time, we can have pre-built components for the most common use cases.
We find that we can get better performance than more generalized databases while also reducing the developer effort to get a solution up by at least an order of magnitude. Doing something that might have taken months to get running on Cassandra or MySQL could take as little as an afternoon using our stack. And that’s exactly what we’re trying to achieve.
By focusing on time series, we can solve problems for application developers so that they can focus on the code that creates unique value inside their app.