opentsdb Writing Data

最新推荐文章于 2022-02-10 17:22:55 发布

翻译最新推荐文章于 2022-02-10 17:22:55 发布 · 1k 阅读

文章标签：

#opentsdb

大数据/数据挖掘专栏收录该内容

34 篇文章

订阅专栏

本文详细介绍了 OpenTSDB 的数据存储原理，包括时间序列的自动聚合、基数的重要性及其实现方式，以及如何设计合理的命名模式。同时，还探讨了 UID 的限制、查询速度的影响因素、数据规格等内容。

OpenTSDB will automatically aggregate all of the time series for the metric in a query if no tags are given. If one or more tags are defined, the aggregate will 'include all' time series that match on that tag, regardless of other tags. With the query sum:sys.cpu.user{host=webserver01}, we would include sys.cpu.user host=webserver01,cpu=0 as well as sys.cpu.userhost=webserver01,cpu=0,manufacturer=Intel, sys.cpu.user host=webserver01,foo=bar and sys.cpu.userhost=webserver01,cpu=0,datacenter=lax,department=ops. The moral of this example is: be careful with your naming schema.

如果一个query查询中没有给出任何一个标签，那么OpenTSDB将会对metric自动聚合所有的时间序列。如果定义了一个或者多个标签，那么聚合将会包括与该标签匹配的所有时间序列（不考虑其它标签的情况下）。

例如我们查询sum:sys.cpu.user{host=webserver01}，那么我们的查询会包括sys.cpu.user host=webserver01,cpu=0，也会包括sys.cpu.userhost=webserver01,cpu=0,manufacturer=Intel, sys.cpu.user host=webserver01,foo=bar 和 sys.cpu.userhost=webserver01,cpu=0,datacenter=lax,department=ops.所以，这个例子的寓意就是，一定要小心的你的命名模式！

Time Series Cardinality（时间序列的基数）

A critical aspect of any naming schema is to consider the cardinality of your time series. Cardinality is defined as the number of unique items in a set. In OpenTSDB's case, this means the number of items associated with a metric, i.e. all of the possible tag name and value combinations, as well as the number of unique metric names, tag names and tag values. Cardinality is important for two reasons outlined below.

任何命名模式的一个重要方面就是要考虑你的时间序列的基数，基数被定义为一套独特的项目数。在OpenTSDB的案例中，这意味着跟metric相关的项目（item）数，即所有可能的标签名和标签值的组合，以及唯一的metric（指标）名，标签名和标签值的数量。基数是非常重要的，下面给出了两个原因：

Limited Unique IDs (UIDs)

There is a limited number of unique IDs to assign for each metric, tag name and tag value. By default there are just over 16 million possible IDs per type. If, for example, you ran a very popular web service and tried to track the IP address of clients as a tag, e.g. web.app.hitsclientip=38.26.34.10, you may quickly run into the UID assignment limit as there are over 4 billion possible IP version 4 addresses. Additionally, this approach would lead to creating a very sparse time series as the user at address 38.26.34.10 may only use your app sporadically, or perhaps never again from that specific address.

为每一个metric，标签名和标签值分配了有限数量的唯一IDs，默认情况下，每个类型只有超过1600w个可能的IDs，如果，举个例子：

你运行着一个非常流行的web服务并试图跟踪用户的IP地址作为一个标签，例如：web.app.hitsclientip=38.26.34.10，你可能很快就会遇到UID分配受限，因为总共会有超过40亿个IPV4的地址。此外，这种方法会导致会创建一个非常稀少的时间序列作为用户在38.26.34.10这个IP上可能只是偶尔使用了你的app，或者可能再也不会用这个ip地址了。

The UID limit is usually not an issue, however. A tag value is assigned a UID that is completely disassociated from its tag name. If you use numeric identifiers for tag values, the number is assigned a UID once and can be used with many tag names. For example, if we assign a UID to the number 2, we could store timeseries with the tag pairs cpu=2, interface=2, hdd=2 and fan=2 while consuming only 1 tag value UID (2) and 4 tag name UIDs (cpu, interface, hdd and fan).

UID限制通常并不是一个问题，但是一个标签值分配一个UID是完全脱离了标签名的。如果你使用数字标识符的标签值，一旦数字分配给了一个UID，那么数字可以被用在许多标签名上，例如：我们分配了一个UID为2，我们可以存储标签对为cpu=2, interface=2, hdd=2 and fan=2 的时间序列，当强制性的只有1个标签值UID是2和4个标签名的UIDS（cpu, interface, hdd and fan）。

If you think that the UID limit may impact you, first think about the queries that you want to execute. If we look at the web.app.hitsexample above, you probably only care about the total number of hits to your service and rarely need to drill down to a specific IP address. In that case, you may want to store the IP address as an annotation. That way you could still benefit from low cardinality but if you need to, you could search the results for that particular IP using external scripts. (Note: Support for annotation queries is expected in a futureversion of OpenTSDB.)

如果你认为UID的限制会影响你，那么首先你需要考虑一下你想执行的查询是什么样的。如果我们看一下在上面的web.app.hits的例子中，你可能只关心你的服务的总点击数，很少会关心你的IP地址。在这种情况下，你可以把你的IP地址作为注解来存储。这样，你仍然可以从低基数中受益，如果你需要的话。你可以用额外的脚本来搜索特定的结果（注意：annotation queries将会在将来的某个版本支持）

If you desperately need more than 16 million values, you can increase the number of bytes that OpenTSDB uses to encode UIDs from 3 bytes up to a maximum of 8 bytes. This change would require modifying the value in source code, recompiling, deploying your customized code to all TSDs which will access this data, and maintaining this customization across all future patches and releases.

Warning

It is possible that your situation requires this value to be increased. If you choose to modify this value, you must start with fresh data and a new UID table. Any data written with a TSD expecting 3-byte UID encoding will be incompatible with this change, so ensure that all of your TSDs are running the same modified code and that any data you have stored in OpenTSDB prior to making this change has been exported to a location where it can be manipulated by external tools. See the TSDB.java file for the values to change.

如果你迫切需要超过1600万的值，你可以增加OpenTSDB编码的字节数，默认是3个字节，最大支持到8个字节。这需要你去修改源代码，然后重新编译，部署到你所有访问该数据的TSDs。

警告：

可能你真的需要增加这个值，如果你真的修改了这个值，你必须刷新你的数据还有创建新的UID表，任何用3个字节编码的TSD的数据将会不兼容这种更改，所以你要确保你的TSDs运行在相同的代码环境并且任何你已经存储在TSDB上的数据要在此次修改之前导出到一个外部工具可以访问的地方。可以看看TSDB.java文件要怎么去修改这个值；

Query Speed

Cardinality also affects query speed a great deal, so consider the queries you will be performing frequently and optimize your naming schema for those. OpenTSDB creates a new row per time series per hour. If we have the time series sys.cpu.userhost=webserver01,cpu=0 with data written every second for 1 day, that would result in 86400 rows of data. However if we have 8 possible CPU cores for that host, now we have 691200 rows of data. This looks good because we can get easily a sum or average of CPU usage across all cores by issuing a query like start=1d-ago&m=avg:sys.cpu.user{host=webserver01}.

However what if we have 20,000 hosts, each with 8 cores? Now we will have 3.8 million rows per day due to a high cardinality of host values. Queries for the average core usage on host webserver01 will be slower as it must pick out 691200 rows out of 3.8 million.

The benefits of this schema are that you have very deep granularity in your data, e.g., storing usage metrics on a per-core basis. You can also easily craft a query to get the average usage across all cores an all hosts: start=1d-ago&m=avg:sys.cpu.user. However queries against that particular metric will take longer as there are more rows to sift through.

Here are some common means of dealing with cardinality:

Pre-Aggregate - In the example above with sys.cpu.user, you generally care about the average usage on the host, not the usage per core. While the data collector may send a separate value per core with the tagging schema above, the collector could also send one extra data point such as sys.cpu.user.avg host=webserver01. Now you have a completely separate timeseries that would only have 24 rows per day and with 20K hosts, only 480K rows to sift through. Queries will be much more responsive for the per-host average and you still have per-core data to drill down to separately.

Shift to Metric - What if you really only care about the metrics for a particular host and don't need to aggregate across hosts? In that case you can shift the hostname to the metric. Our previous example becomes sys.cpu.user.websvr01 cpu=0. Queries against this schema are very fast as there would only be 192 rows per day for the metric. However to aggregate across hosts you would have to execute multiple queries and aggregate outside of OpenTSDB. (Future work will include this capability).

查询速度：

基数对速度的影响是特别大的，所以你要认真考虑一下你频繁执行的查询语句，还有要优化你的命名的模式。OpenTSDB每小时为每个时间序列会创建一行记录。如果我们有一个这样的时间序列sys.cpu.userhost=webserver01,cpu=0，每秒中写一条记录，那么一天将会有86400条记录，然后我们要是有8个CPU都可用，那么我们将会有691200 条数据。这看起来很不错，因为我们能非常容易的去统计一个总和或者CPU使用率的平均值通过这样的一个查询：start=1d-ago&m=avg:sys.cpu.user{host=webserver01}.

但是，我们要是有20000台主机呢，每台主机都有8个cpu呢？那么我们每天将会有380w条数据，这对于主机的这个值host=webserver01来说基数就很高了。这样我们要查询这台主机webserver01的CPU的平均使用率的话就要从380w条数据中去找出691200条数据，这就太慢了。

这种模式的好处是，这样的数据是有非常深的粒度的，比如：在每一个CPU的基础上存储使用率的metric，你可以非常容易的写出一个平均使用率的查询（所有主机&所有core）。然而，对于特定的metric的查询将会需要更长的时间，因为有更多的行要去筛选。

这里有一些对于基数常见的处理方法：

预聚合--在上面的sys.cpu.user例子中，你一般只关心主机的平均使用率，而不是每个core的使用率。当数据收集器可能往每个core上发送不相干的值，但是标注了标签，收集器也可以发送一个额外的数据点比如：sys.cpu.user.avg host=webserver01。现在你有一个完全独立的时间序列，每个主机每天只有24行，每行只有20k的数据，也就是总共480k的数据供筛选，那么每个主机的平均值的查询响应将会更快并且你仍然有每个core的数据去分析。

切换指标--如果你真的只关心一个特定主机的指标&不需要聚合主机之间的数据，你可以在metric上切换主机名，我们之前的例子就变成了sys.cpu.user.websvr01 cpu=0。针对此模式的查询速度就会非常快，因为每天只会产生192条数据（1core*1小时*1条*24小时*8core）,然而主机之间的聚合汇总你就不得不在OpenTSDB中执行多条查询和多条汇总（未来将会支持此功能）

Naming Conclusion

When you design your naming schema, keep these suggestions in mind:

Be consistent with your naming to reduce duplication. Always use the same case for metrics, tag names and values.
Use the same number and type of tags for each metric. E.g. don't store my.metric host=foo and my.metric datacenter=lga.
Think about the most common queries you'll be executing and optimize your schema for those queries
Think about how you may want to drill down when querying
Don't use too many tags, keep it to a fairly small number, usually up to 4 or 5 tags (By default, OpenTSDB supports a maximum of 8 tags).

命名的结论

当你设计你的命名模式的时候，一定要记住这些建议：

1.你的命名一定要连续&减少重复，对于metrics,标签名和标签值要总是使用相同的case；

2.为每个metric使用相同数量和类型的标签。比如：不要这么用：my.metric host=foo和my.metric datacenter=lga.

3.认真考虑你最常用的查询并且要优化这些查询的模式

4.想一下当你查询时，你可能要钻取哪些数据

5.不要使用太多的标签，使它保持在一个小的数量级，通常为4到5个标签（默认情况下，OpenTSDB最大支持到8个标签）

Data Specification

Every time series data point requires the following data:

metric - A generic name for the time series such as sys.cpu.user, stock.quote or env.probe.temp.
timestamp - A Unix/POSIX Epoch timestamp in seconds or milliseconds defined as the number of seconds that have elapsed since January 1st, 1970 at 00:00:00 UTC time. Only positive timestamps are supported at this time.
value - A numeric value to store at the given timestamp for the time series. This may be an integer or a floating point value.
tag(s) - A key/value pair consisting of a tagk (the key) and a tagv (the value). Each data point must have at least one tag.

数据规范：

每一个时间序列数据点需要以下数据：

1.metric 时间序列的通用名，比如：sys.cpu.user, stock.quote 或者 env.probe.temp

2.timestamp 就是一个当前unix时间戳，正数

3.value 在给定的时间序列给定的时间存储的一个数字类型的值，可能会是整型或者浮点型

4.tag 由tagk和tagv组成的键值对，每个数据点至少要有一个标签

Timestamps

Data can be written to OpenTSDB with second or millisecond resolution. Timestamps must be integers and be no longer than 13 digits (See first [NOTE] below). Millisecond timestamps must be of the format 1364410924250 where the final three digits represent the milliseconds. Applications that generate timestamps with more than 13 digits (i.e., greater than millisecond resolution) must be rounded to a maximum of 13 digits before submitting or an error will be generated.

Timestamps with second resolution are stored on 2 bytes while millisecond resolution are stored on 4. Thus if you do not need millisecond resolution or all of your data points are on 1 second boundaries, we recommend that you submit timestamps with 10 digits for second resolution so that you can save on storage space. It's also a good idea to avoid mixing second and millisecond timestamps for a given time series. Doing so will slow down queries as iteration across mixed timestamps takes longer than if you only record one type or the other. OpenTSDB will store whatever you give it.

Note

When writing to the telnet interface, timestamps may optionally be written in the form 1364410924.250, where three digits representing the milliseconds are placed after a period. Timestamps sent to the /api/put endpoint over HTTP must be integers and may not have periods. Data with millisecond resolution can only be extracted via the /api/query endpoint or CLI command at this time. See query/index for details.

Note

Providing millisecond resolution does not necessarily mean that OpenTSDB supports write speeds of 1 data point per millisecond over many time series. While a single TSD may be able to handle a few thousand writes per second, that would only cover a few time series if you're trying to store a point every millisecond. Instead OpenTSDB aims to provide greater measurement accuracy and you should generally avoid recording data at such a speed, particularly for long running time series.

时间戳

数据会在某一秒或某一毫秒被写入OpenTSDB，Timestamps 必须是整型，不超过13位数字，Millisecond timestamps必须是这种格式的1364410924250，后三位数字是代表毫秒的。应用程序生成的超过13位的时间戳必须在提交之前四舍五入到13位，不然会出错。秒级的时间戳是以2个字节来存储的，毫秒级的时间戳是以4个字节来存储的，因此如果你不需要毫秒的解决方案或者你所有的数据点都以秒作为分界线，我们建议你只提交10位的时间戳，也就是妙级的，这样你是可以节省存储空间的。这样避免在给出的时间序列上出现秒和毫秒的混合也是一个不错的主意。混合时间戳去查询是会变慢的，不如只定义了一种类型的快。

无论你是否给出时间戳的值，OpenTSDB都会给时间戳赋值

注：

当通过telnet写数据的时候，时间戳通常是以这种形式1364410924.250写入，后三位的毫秒数在一段时间之后会被放置，时间戳通过http发送数据必须是整型，并且是没有时间的，这个时候毫秒级的数据只能通过api/query/接口或者是命令行获取，详情请看query/index

注：

提供毫秒级并不意味着OPenTSDB支持在很多时间序列上每毫秒会写一个数据点，而一个单一的TSD每秒可能会处理几千个写的操作，如果这样会覆盖一些你以毫秒存储的时间序列的值，相反opentsdb旨在提供更好的度量的准确性，一般应该避免以这样的速度去记录数据，特别是长时间运行的时间序列。

Metrics and Tags

The following rules apply to metric and tag values:

Strings are case sensitive, i.e. "Sys.Cpu.User" will be stored separately from "sys.cpu.user"
Spaces are not allowed
Only the following characters are allowed: a to z, A to Z, 0 to 9, -, _, ., / or Unicode letters (as per the specification)

Metric and tags are not limited in length, though you should try to keep the values fairly short.

Integer Values

If the value from a put command is parsed without a decimal point (.), it will be treated as a signed integer. Integers are stored, unsigned, with variable length encoding so that a data point may take as little as 1 byte of space or up to 8 bytes. This means a data point can have a minimum value of -9,223,372,036,854,775,808 and a maximum value of 9,223,372,036,854,775,807 (inclusive). Integers cannot have commas or any character other than digits and the dash (for negative values). For example, in order to store the maximum value, it must be provided in the form 9223372036854775807.

Floating Point Values

If the value from a put command is parsed with a decimal point (.) it will be treated as a floating point value. Currently all floating point values are stored on 4 bytes, single-precision, with support for 8 byte double-precision in 2.4 and later. Floats are stored in IEEE 754 floating-point "single format" with positive and negative value support. Infinity and Not-a-Number values are not supported and will throw an error if supplied to a TSD. See Wikipedia and the Java Documentation for details.

Note

Because OpenTSDB only supports floating point values, it is not suitable for storing measurements that require exact values like currency. This is why, when storing a value like 15.2 the database may return 15.199999809265137.

Ordering

Unlike other solutions, OpenTSDB allows for writing data for a given time series in any order you want. This enables significant flexibility in writing data to a TSD, allowing for populating current data from your systems, then importing historical data at a later time.

指标和标签：

（未完....待续....）

以下为供学习参考的链接，以作备忘：

http://www.aichengxu.com/other/2563505.htm（opentsdb数据写入要点）

http://opentsdb.net/docs/build/html/user_guide/writing/index.html（官方文档地址）

http://opentsdb.net/docs/build/html/installation.html（官方文档地址）

http://www.tuicool.com/articles/2yqMv2Z(初识opentsdb)

http://www.aboutyun.com/thread-9106-1-1.html（整体认识opentsdb）

http://ju.outofmemory.cn/entry/88330（opentsdb详解）

http://www.ttlsa.com/opentsdb/opentsdb-insert-data-http-api-interface/（opentsdb数据写入http api接口）

http://blog.youkuaiyun.com/tracymkgld/article/details/50846113（opentsdb数据写入要点）

http://blog.youkuaiyun.com/yuanchao99/article/details/45771535（opentsdb写入数据）

http://www.cnblogs.com/gsblog/p/4029987.html（opentsdb写入数据）

http://blog.youkuaiyun.com/bluishglc/article/details/31052749（opentsdb设计解读）

http://it.taocms.org/09/5662.htm（opentsdb造成hbase整点压力过大问题解决）

https://www.oschina.net/question/12_60231（openrsdb监控系统的研究和介绍）

http://opentsdb.net/docs/build/html/user_guide/utilities/tcollector.html?highlight=script（tcollector官方文档）

https://github.com/OpenTSDB/opentsdb/issues（opentsdb github 问题区）