MediaPipe Hands: On-device Real-time Hand Tracking 论文阅读笔记

最新推荐文章于 2025-09-06 18:40:24 发布

原创

最新推荐文章于 2025-09-06 18:40:24 发布 · 4.4k 阅读

59 ·

CC 4.0 BY-SA版权

文章标签：

#人工智能 #MediaPipe #手部跟踪 #手部追踪 #姿态估计

本文介绍了一种基于Google Research的MediaPipe实现的实时设备手部追踪技术，包括手掌检测器和手部坐标预测模型，适用于AR/VR应用。该方案无需额外硬件，能在移动GPU上实现高速推理与高精度预测，且开源提供跨平台支持。

设备端实时手部追踪

0. 摘要 (Abstract)
1. 简介 (Introduction)
2. 架构 (Architecture)
- 2.1 手部检测器
- 2.2 手部坐标预测模型（Hand LandMark Model）
3. 数据集和标注（DataSet And Annotation）
4. 试验结果（Result）
5. 使用MediaPipe的具体实现（Implementation In MedisPipe）
6. 应用举例（Application examples）
7. 结论（Conclution）

论文地址： https://arxiv.org/abs/2006.10214v1
Demo地址：https://hand.mediapipe.dev/
研究机构：Google Research
会议：CVPR2020

开始介绍之前，先贴一个模型的流程图，让大家对系统架构有个整体的概念
在这里插入图片描述

0. 摘要 (Abstract)

We present a real-time on-device hand tracking solution that predicts a hand skeleton of a human from a single RGB camera for AR/VR applications. Our pipeline consists of two models: 1) a palm detector, that is providing a bounding box of a hand to, 2) a hand landmark model, that is predicting the hand skeleton. It is implemented via MediaPipe[12], a framework for building cross-platform ML solutions. The proposed model and pipeline architecture demonstrate real-time inference speed on mobile GPUs with high prediction quality. MediaPipe Hands is open sourced at https://mediapipe.dev

我们提出了一种实时设备上的手部跟踪解决方案，该方案可以从单张的RGB图像中预测人体的手部骨架，并且可以用于AR/VR应用。我们方案的数据处理流水线由两个模型组成：

（1）手掌检测器：提供手的边界框
（2）手部坐标估计模型：预测手的骨架

本方案基于MediaPipe（是一个用于构建跨平台机器学习解决方案的框架）实现。
它在移动GPU上具有较高的实时推理速度和预测质量，具体开源代码请参见 MediaPipe Hands

1. 简介 (Introduction)

Hand tracking is a vital component to provide a natural way for interaction and communication in AR/VR, and has been an active research topic in the industry. Vision-based hand pose estimation has been studied for many years. A large portion of previous work requires specialized hardware, e.g. depth sensors . Other solutions are not lightweight enough to run real-time on commodity mobile devices and thus are limited to platforms equipped with powerful processors. In this paper, we propose a novel solution that does not require any additional hardware and performs in real-time on mobile devices. Our main contributions are:
• An efficient two-stage hand tracking pipeline that can track multiple hands in real-time on mobile devices.
• A hand pose estimation model that is capable of predicting 2.5D hand pose with only RGB input.
• And open source hand tracking pipeline as a ready-togo solution on a variety of platforms, including Android, iOS, Web (Tensorflow.js) and desktop PCs.

手部跟踪是AR/VR重要的组成部分，为AR/VR的交互和沟通提供最自然的方式，而且这个方向一直是业界的一个活跃研究课题。
基于视觉的手部姿势估计已经研究了很多年，但是有很多局限性，具体如下：

（1）大部分工作需要专用硬件，例如深度传感器
（2）不够轻量化，不能实时的在普通的商用设备上运行，仅能运行在配备了强大处理器的平台上

在本文中我们解决了上述两个局限性，提出了一个不需要额外设备且能在移动设备上实时运行的解决方案，我们的主要贡献如下：

（1）一个高效的两阶段手部跟踪处理流程，可以实时的在移动设备上跟踪多个手
（2）一个手部姿态估计模型，可以从RGB图像输入中预测2.5D的手部姿态
（3）一个跨平台开箱即用的开源手部跟踪处理流程，支持的平台包括安卓、苹果、网页（Tensorflow.js）和桌面PC等

2. 架构 (Architecture)

Our hand tracking solution utilizes an ML pipeline consisting of two models working together:
• A palm detector that operates on a full input image and locates palms via an oriented hand bounding box.
• A hand landmark model that operates on the cropped hand bounding box provided by the palm detector and returns high-fidelity 2.5D landmarks.
Providing the accurately cropped palm image to the hand landmark model drastically reduces the need for data augmentation (e.g. rotations, translation and scale) and allows the network to dedicate most of its capacity towards landmark localization accuracy. In a real-time tracking scenario, we derive a bounding box from the landmark prediction of the previous frame as input for the current frame, thus avoiding applying the detector on every frame. Instead, the detector is only applied on the first frame or when the hand prediction indicates that the hand is lost.

我们的解决方案使用了机器学习的处理