图像分类，物体检测，语义分割，实例分割等概念

最新推荐文章于 2025-05-01 23:59:27 发布

或许，这就是梦想吧！

最新推荐文章于 2025-05-01 23:59:27 发布

阅读量1.8k

点赞数 1

分类专栏： Image

原文链接：https://blog.youkuaiyun.com/u010821666/article/details/78697723

版权

Image 专栏收录该内容

1 篇文章

订阅专栏

本文详细介绍了图像处理领域的四大核心任务：图像分类、物体检测、语义分割和实例分割。每项任务都解释了其基本原理，应用场景及挑战，同时对比了它们之间的差异，帮助读者深入理解图像处理的不同层面。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

转载自：https://blog.youkuaiyun.com/u010821666/article/details/78697723

该文通俗易懂的介绍了概念，特转载供自己以后复习。

图像分类 Image Classification

The task of object classification requires binary labels indicating whether objects are present in an image.[1]

目标分类任务需要二值标签来指示图像中是否存在对象。图像分类，该任务需要我们对出现在某幅图像中的物体做标注。比如一共有1000个物体类，对一幅图中所有物体来说，某个物体要么有，要么没有。可实现：输入一幅测试图片，输出该图片中物体类别的候选集。

物体检测 Object detection

Detecting an object entails both stating that an object belonging to a specified class is present, and localizing it in the image. The location of an object is typically represented by a bounding box.

检测一个对象需要声明一个属于指定类的对象存在，并在图像中对其进行本地化。
对象的位置通常由一个边界框表示。物体检测，包含两个问题，一是判断属于某个特定类的物体是否出现在图中；二是对该物体定位，定位常用表征就是物体的边界框。可实现：输入测试图片，输出检测到的物体类别和位置。

语义分割 Semantic scene labeling
The task of labeling semantic objects in a scene requires that each pixel of an image be labeled as belonging to a category, such as sky, chair, floor, street, etc. In contrast to the detection task, individual instances of objects do not need to be segmented. 语义标注/分割：该任务需要将图中每一点像素标注为某个物体类别。同一物体的不同实例不需要单独分割出来。对下图，标注为人，羊，狗，草地。而不需要羊1，羊2，羊3，羊4，羊5.

实例分割 Instance segment
实例分割是物体检测+语义分割的综合体。相对物体检测的边界框，实例分割可精确到物体的边缘；相对语义分割，实例分割可以标注出图上同一物体的不同个体（羊1，羊2，羊3…）

4种任务的数据集标注示例如图示。可以看到，标注越来越复杂，但是处理效果越来越有用。

目标分割 Object Segmentation

one of the reasons that this has fallen out of favor in the research community is because it is problematically vague. Object segmentation used to simply mean finding a single or small number of objects in an image and draw a boundary around them, and for most purposes you can still assume it means this. However, it also began to be used to mean segmentation of blobs that might be objects, segmentation of objects from the background (more commonly now called background subtraction or background segmentation or foreground detection), and even in some cases used interchangeably with object recognition using bounding boxes (this quickly stopped with the advent of deep neural network approaches to object recognition, but beforehand object recognition could also mean simply labeling an entire image with the object in it).

这在研究界不受欢迎的原因之一是它的问题模糊。对象分割通常只是指在图像中找到单个或少量的对象并在它们周围画一个边界，对于大多数目的，您仍然可以假定它的意思是这样的。然而,它也开始被用来意味着blob可能对象的分割,分割的对象从背景中(更常见的现在称为背景减法或背景分割前景检测),甚至在某些情况下可以互换对象识别使用边界框(这迅速阻止随着深度的对象识别的神经网络方法,但事先的物体识别也可能意味着简单地将物体标记在整个图像上)。

是什么让“分割”具有“语义性”? What makes “segmentation” “semantic”?

Simpy, each segment, or in the case of deep methods each pixel, is given a class label based on a category. Segmentation in general is just the division of the image by some rule. Meanshift segmentation, for example, from a very high level divide the data according to the changes in the energy of the image. Graph cut based segmentation is similarly not learned but directly derived from the properties of each image separate from the rest. More recent (neural network based) methods use pixels that are labeled to learn to identify the local features which are associated with specific classes, and then classify each pixel based on which class has the highest confidence for that pixel. In this way, “pixel-labeling” is actually more honest name for the task, and the “segmentation” component is emergent.

在Simpy中，每个分段，或者在deep方法中，每个像素都有一个基于类别的类标签。分割一般就是根据一定的规则对图像进行分割。例如，Meanshift分割，从一个非常高的层次根据图像能量的变化对数据进行分割。基于图割的分割同样不是学习的，而是直接从每幅图像的属性中分离出来的。更近一些的(基于神经网络的)方法使用被标记的像素来学习识别与特定类相关的局部特征，然后根据对该像素有最高置信度的类对每个像素进行分类。通过这种方式，“像素标记”实际上是这个任务更真实的名称，而“分割”部分是紧急的。

实例分割 Instance Segmentation

Arguably the most difficult, relevant, and original meaning of Object Segmentation, “instance segmentation” means the segmentation of the individual objects within a scene, regardless of if they are the same type. However, one of the reason this is so difficult is because from a vision perspective (and in some ways a philosophical one) what makes an “object” instance is not entirely clear. Are body parts objects? Should such “part-objects” be segmented at all by an instance segmentation algorithm? Should they be only segmented if they are seen separate from the whole? What about compound objects should two things clearly adjoined but separable be one object or two (is a rock glued to the top of a stick an ax, a hammer, or just a stick and a rock unless properly made?). Also, it isn’t clear how to distinguish instances. Is a will a separate instance from the other walls it is attached to? What order should instances be counted in? As they appear? Proximity to the viewpoint? In spite of these difficulties, segmentation of objects is still a big deal because as humans we interact with objects all the time regardless of their “class label” (using random objects around you as paper weights, sitting on things that are not chairs), and so some dataset do attempt to get at this problem, but the main reason there isn’t much attention given to the problem yet is because it isn’t well enough defined.

可以说，对象分割最困难、最相关、最原始的意义，“实例分割”意味着分割的个别对象在一个场景中，不管他们是否相同的类型。然而，这是如此困难的原因之一，因为从视觉的角度(在某些方面是一个哲学的角度)是什么使一个“对象”实例是不完全清楚的。身体部分是客体吗? 这样的“部分对象”应该被实例分割算法分割吗? 只有当它们从整体中分离出来时，它们才应该被分割吗?
如果两个物体明显地连接在一起，但可分离的是一个或两个物体(一块石头粘在一根棍子的顶部，是斧头、锤子，还是只是一根棍子和一块石头，除非做得合适?) 此外，还不清楚如何区分实例。 will是一个独立的实例吗? 实例的计算顺序是什么? 当他们出现?
接近视点? 尽管有这些困难,分割的对象仍是一个大问题,因为作为人类我们与对象交互不管他们的“阶级标签”(使用随机对象你周围的纸的重量,坐在东西不是椅子),所以一些数据集试图得到这个问题,但是最主要的原因还没有过多的关注的问题是因为它不够好定义的。

è¿éåå¾çæè¿°

场景解析/现场标识 Scene Parsing/Scene labeling

Scene Parsing is the strictly segmentation approach to scene labeling, which also has some vagueness problems of its own. Historically, scene labeling meant to divide the entire “scene” (image) up into segments and give them all a class label. However, it was also used to mean giving class labels to areas of the image without explicitly segmenting them. With respect to segmentation, “semantic segmentation” does not imply dividing the entire scene. For semantic segmentation, the algorithm is intended to segment only the objects it knows, and will be penalized by its loss function for labeling pixels that don’t have any label. For example the MS-COCO dataset is a dataset for semantic segmentation where only some objects are segmented.

场景解析是一种严格分割的场景标注方法，它本身也存在一些模糊问题。在历史上，场景标签意味着把整个“场景”(图像)划分成段，并给他们所有的类标签。然而，它也被用来表示给图像的区域贴上类标签，而不明确地分割它们。对于分割，“语义分割”并不意味着分割整个场景。对于语义分割，该算法的目的是只分割它所知道的对象，它会因为对没有任何标签的像素进行标记而受到损失函数的惩罚。例如，MS-COCO数据集是用于语义分割的数据集，其中只分割了一些对象。

è¿éåå¾çæè¿°