A Commute in Data: The comma2k19 Dataset
通勤数据:Comma2k19 数据集
https://arxiv.org/pdf/1812.05752v1
Abstract— comma.ai presents comma2k19, a dataset of over 33 hours of commute in California’s 280 highway. This means 2019 segments, 1 minute long each, on a 20km section of highway driving between California’s San Jose and San Francisco. The dataset was collected using comma EONs that have sensors similar to those of any modern smartphone including a road-facing camera, phone GPS, thermometers and a 9-axis IMU. Additionally, the EON captures raw GNSS measurements and all CAN data sent by the car with a comma grey panda. Laika, an open-source GNSS processing library, is also introduced here. Laika produces 40% more accurate positions than the GNSS module used to collect the raw data. This dataset includes pose (position + orientation) estimates in a global reference frame of the recording camera. These poses were computed with a tightly coupled INS/GNSS/Vision optimizer that relies on data processed by Laika. comma2k19 is ideal for development and validation of tightly coupled GNSS algorithms and mapping algorithms that work with commodity sensors.
摘要—comma.ai 介绍了 comma2k19,这是一个包含超过 33 小时加州 280 号高速公路通勤场景的数据集。该数据集包含 2019 个各为 1 分钟长的片段,记录了在加州圣何塞和旧金山之间 20 公里高速公路上的驾驶情况。数据集利用配备了与现代智能手机相似传感器的 comma EON 设备收集而成,这些传感器包括一个面向道路的摄像头、手机 GPS、温度计和一个九轴惯性测量单元(IMU)。此外,EON 设备还能捕获由汽车通过 comma grey panda 发送的原始 GNSS 测量数据和所有的 CAN 数据。文中还介绍了 Laika,这是一个开源的 GNSS 处理库,其提供的定位精度比收集原始数据时使用的 GNSS 模块高出 40%。该数据集包括了记录摄像头在全球参考框架中的姿态(位置和方向)估计,这些估计是通过一个紧密耦合的惯性导航系统/GNSS/视觉优化器计算得出的,该优化器依赖于 Laika 处理的数据。comma2k19 数据集非常适合用于开发和验证紧密耦合的 GNSS 算法和适用于普通传感器的地图算法。
I. INTRODUCTION
“Quality over quantity”, or that’s what they say anyway, but is this true in the world of data? The reality is that collecting data with high-end sensors is expensive as dedicated hardware is needed and this quickly becomes unfeasible for larger datasets. Affordable sensors on the other hand, are ubiquitous and already continuously logging data on billions of devices around the world. The world is a noisy place, some trends require big data to become obvious. To find such trends, algorithms need to be developed to deal with huge amounts of less than perfect data. It is this core idea that motivates comma.ai’s strategy to collect data with scalibility as a priority.
“质量胜于数量”,人们通常这么说,但在数据的世界里,这真的成立吗?实际情况是,使用高端传感器收集数据成本很高,因为需要专门的硬件,而且对于大规模数据集而言,这种方法很快就会变得不切实际。而价格可接受的传感器则随处可见,它们已经在全球数十亿设备上不断地记录着数据。世界充满了噪声,一些趋势只有通过大数据才能显现出来。为了发现这些趋势,需要开发能够处理大量不完美数据的算法。正是这个核心理念驱动了 comma.ai 优先考虑数据收集的可扩展性策略。
The dataset released here, comma2k191, contains data collected by an EON2 and a grey panda3 during 2019 minutes of driving sampled from a Californian commute (Figure 1). There are logs of a road-facing camera, a 9-axis IMU, the vehicle’s transmitted CAN messages and raw GNSS measurements. This makes this dataset uniquely valuable for the development of mapping algorithms that require dense data and can use raw GNSS data.
这里发布的数据集名为 comma2k19,它包含了在 2019 分钟的驾驶过程中,通过 EON2 设备和 grey panda 设备在加州通勤路线上收集的数据(见图 1)。数据记录包括一个面向道路的摄像头、一个九轴