【结果分析】之murel项目结果分析

本文链接：https://blog.youkuaiyun.com/snow_maple521/article/details/110001560

这篇博客介绍了MUREL项目在VQA任务中的多模态关系推理，展示了第20个epoch的验证集结果。整体准确率为64.5%，其中'whatroomis'、'isit'和'whatsportis'等问题类型的准确率较高，'why'、'whatnumberis'和'whatisthename'的准确率较低。作者通过问题类型和答案类型的准确率分布，揭示了模型在特定问题类型的局限性，并提出了数据偏差可能影响模型泛化的观点。此外，还提供了问题类型准确率的可视化结果。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

0.写在前面

关于annotation文件的训练集和验证集的问题类型的统计如下：

0.1 训练集

官网给出，训练集问题总数：443757

{
	'how many': 42339,
	'is the': 34927,
	'what': 34608,
	'what color is the': 27962,
	'what is the': 24502,
	'none of the above': 16973,
	'is this': 16444,
	'is this a': 16024,
	'what is': 13561,
	'what kind of': 11192,
	'are the': 10701,
	'is there a': 9982,
	'what type of': 7962,
	'is it': 7345,
	'what are the': 7225,
	'where is the': 6734,
	'is there': 6513,
	'what color are the': 6183,
	'does the': 6103,
	'is': 6079,
	'are there': 5877,
	'are these': 5782,
	'which': 5382,
	'what is the man': 5238,
	'is the man': 4972,
	'are': 4912,
	'how': 4740,
	'does this': 4396,
	'how many people are': 4276,
	'what is on the': 4254,
	'what does the': 4075,
	'what is in the': 3990,
	'what is this': 3970,
	'why': 3347,
	'what are': 3277,
	'are they': 3074,
	'what color': 3032,
	'do': 3012,
	'what time': 2914,
	'are there any': 2790,
	'what color is': 2649,
	'is he': 2534,
	'what sport is': 2527,
	'where are the': 2161,
	'who is': 2154,
	'how many people are in': 2071,
	'what animal is': 2001,
	'is this an': 1981,
	'do you': 1971,
	'is the woman': 1938,
	'has': 1827,
	'is this person': 1756,
	'what is the color of the': 1750,
	'what is the person': 1729,
	'can you': 1728,
	'what is the woman': 1706,
	'could': 1698,
	'is the person': 1694,
	'what number is': 1668,
	'what room is': 1647,
	'what is the name': 1618,
	'what brand': 1600,
	'is that a': 1585,
	'was': 1551,
	'why is the': 1544
}

0.2 验证集

官网给出，验证集问题总数：214354

{
	'how many': 20462,
	'is the': 17265,
	'what': 15897,
	'what color is the': 14061,
	'what is the': 11353,
	'none of the above': 8550,
	'is this': 7841,
	'is this a': 7492,
	'what is': 6328,
	'what kind of': 5840,
	'are the': 5264,
	'is there a': 4679,
	'what type of': 4040,
	'where is the': 3716,
	'is it': 3566,
	'what are the': 3282,
	'does the': 3183,
	'is': 3169,
	'is there': 3120,
	'what color are the': 3118,
	'are these': 2839,
	'are there': 2771,
	'what is the man': 2663,
	'is the man': 2511,
	'which': 2448,
	'how': 2422,
	'are': 2359,
	'does this': 2227,
	'what is on the': 2174,
	'how many people are': 2005,
	'what does the': 1970,
	'what time': 1746,
	'what is in the': 1733,
	'what is this': 1696,
	'what are': 1556,
	'do': 1503,
	'why': 1438,
	'what color': 1428,
	'what color is': 1335,
	'are they': 1335,
	'are there any': 1330,
	'where are the': 1313,
	'is he': 1087,
	'what sport is': 1086,
	'who is': 1070,
	'is the woman': 992,
	'has': 946,
	'what brand': 935,
	'how many people are in': 905,
	'what is the person': 900,
	'is this an': 890,
	'can you': 872,
	'what is the woman': 853,
	'what animal is': 833,
	'what is the color of the': 826,
	'was': 818,
	'is the person': 794,
	'what is the name': 780,
	'what room is': 762,
	'is this person': 734,
	'do you': 724,
	'is that a': 714,
	'what number is': 673,
	'could': 618,
	'why is the': 514
}

1.murel:视觉问答VQA中的多模态关系推理

项目介绍
本次结果是第20个epoch,采用的是验证集的results文件，由于中途断了，导致没有执行测试集test.

2.结果分析

我们评估采用的使官网提供的eval,

evaldemo如下

from block.external.VQA.PythonHelperTools.vqaTools.vqa import VQA
from block.external.VQA.PythonEvaluationTools.vqaEvaluation.vqaEval import VQAEval
import matplotlib.pyplot as plt
import json
import random

# set up file names and paths
# versionType ='v2_' # this should be '' when using VQA v2.0 dataset
taskType    ='OpenEnded' # 'OpenEnded' only for v2.0. 'OpenEnded' or 'MultipleChoice' for v1.0
dataType    ='mscoco'  # 'mscoco' only for v1.0. 'mscoco' for real and 'abstract_v002' for abstract for v1.0.
dataSubType ='val2014'
dir_rslt = 'D:/论文项目/bootstrap项目/murel.bootstrap.pytorch/logs/vqa2/murel/results/val/epoch,20/'
dataDir = 'D:/论文项目/bootstrap项目/murel.bootstrap.pytorch/data/vqa/vqa2/raw/annotations/'
annFile     ='%s%s_%s_annotations.json'%(dataDir, dataType, dataSubType)
quesFile    ='%s%s_%s_%s_questions.json'%(dataDir, taskType, dataType, dataSubType)
# imgDir      ='%s/Images/%s/%s/' %(dataDir, dataType, dataSubType)
resultType  ='model'
fileTypes   = ['results', 'accuracy', 'evalQA', 'evalQuesType', 'evalAnsType']

# An example result json file has been provided in './Results' folder.

[resFile, accuracyFile, evalQAFile, evalQuesTypeFile, evalAnsTypeFile] = ['%s/%s_%s_%s_%s_%s.json'%(dir_rslt, taskType, dataType, dataSubType, \
resultType, fileType) for fileType in fileTypes]

# create vqa object and vqaRes object
vqa = VQA(annFile, quesFile)
vqaRes = vqa.loadRes(resFile, quesFile)

# create vqaEval object by taking vqa and vqaRes
vqaEval = VQAEval(vqa, vqaRes, n=2)   #n is precision of accuracy (number of places after decimal), default is 2

# evaluate results
"""
If you have a list of question ids on which you would like to evaluate your results, pass it as a list to below function
By default it uses all the question ids in annotation file
"""
vqaEval.evaluate()

# print accuracies
import copy
print ("Overall Accuracy is: %.02f\n" %(vqaEval.accuracy['overall']))
print ("Per Question Type Accuracy is the following:")
tmp_vqa_50d = copy.deepcopy(vqaEval.accuracy['perQuestionType'])
for quesType in vqaEval.accuracy['perQuestionType']:
	print ("%s : %.02f" %(quesType, vqaEval.accuracy['perQuestionType'][quesType]))
# tmp_vqa_50s = []
# for quesType in vqaEval.accuracy['perQuestionType']:
# 	if  vqaEval.accuracy['perQuestionType'][quesType]>=60:
# 		tmp_vqa_50d.pop(quesType)
print ("Per Answer Type Accuracy is the following:")
for ansType in vqaEval.accuracy['perAnswerType']:
	print ("%s : %.02f" %(ansType, vqaEval.accuracy['perAnswerType'][ansType]))
# demo how to use evalQA to retrieve low score result
evals = [quesId for quesId in vqaEval.evalQA if vqaEval.evalQA[quesId]<35]   #35 is per question percentage accuracy
#随便选一个准确率低于35的问题，看其输出的答案

		# tmp_vqa_50s.append(quesType)
		#print ("%s : %.02f" %(quesType, tmp_vqa_50d[quesType]))

if len(evals) > 0:
	print ('ground truth answers')
	randomEval = random.choice(evals)
	randomAnn = vqa.loadQA(randomEval)
	vqa.showQA(randomAnn)

	print ('generated answer (accuracy %.02f)'%(vqaEval.evalQA[randomEval]))
	ann = vqaRes.loadQA(randomEval)[0]
	print ("Answer:   %s\n" %(ann['answer']))



# plot accuracy for various question types
c_l = plt.bar(range(len(tmp_vqa_50d)), tmp_vqa_50d.values(), align='center',)
for k in c_l:
    height=k.get_height()
    plt.text(k.get_x() + k.get_width() / 2, height, str(height),fontsize=5, ha="center", va="bottom")
plt.xticks(range(len(tmp_vqa_50d)), tmp_vqa_50d.keys(), rotation='90',fontsize=10)
plt.title('Per Question Type Accuracy', fontsize=10)
plt.xlabel('Question Types', fontsize=10)
plt.ylabel('Accuracy', fontsize=10)
plt.show()

2.1 结果

训练集结果排序后

Overall Accuracy is: 87.93
{
	'what room is': 98.84,
	'what sport is': 98.15,
	'is it': 97.28,
	'are there any': 97.03,
	'do you': 96.64,
	'could': 96.34,
	'is there a': 96.23,
	'are there': 96.07,
	'is there': 95.5,
	'is this a': 95.48,
	'is this': 95.47,
	'is that a': 95.21,
	'is this an': 95.16,
	'does this': 94.97,
	'is he': 94.87,
	'is this person': 94.78,
	'are these': 94.65,
	'was': 94.61,
	'are they': 94.57,
	'is': 94.43,
	'is the man': 94.26,
	'is the': 93.96,
	'what animal is': 93.92,
	'is the person': 93.1,
	'is the woman': 93.09,
	'are the': 92.96,
	'can you': 92.92,
	'does the': 92.81,
	'has': 92.8,
	'do': 92.57,
	'what color is': 92.23,
	'what is the person': 91.71,
	'what is the color of the': 91.52,
	'what is this': 90.71,
	'what color is the': 90.64,
	'what color are the': 90.44,
	'are': 90.08,
	'what is the man': 89.14,
	'what color': 88.89,
	'what are': 88.29,
	'what is the woman': 87.91,
	'none of the above': 86.5,
	'what are the': 86.07,
	'what is in the': 86.03,
	'what is the': 85.67,
	'what brand': 85.4,
	'what is': 84.86,
	'what is on the': 84.83,
	'what is the name': 84.06,
	'what does the': 83.67,
	'what': 83.2,
	'what kind of': 83.17,
	'what type of': 83.1,
	'how many people are in': 79.54,
	'what time': 79.01,
	'how many people are': 78.4,
	'how many': 76.82,
	'which': 74.06,
	'who is': 72.02,
	'what number is': 71.92,
	'where are the': 69.05,
	'where is the': 68.43,
	'how': 66.53,
	'why is the': 63.14,
	'why': 62.75
}
Per Answer Type Accuracy is the following:
other : 85.41
yes/no : 94.64
number : 76.26
ground truth answers
Question: Is a woman sitting in a chair?
Answer 1: yes
Answer 2: yes
Answer 3: yes
Answer 4: yes
Answer 5: yes
Answer 6: yes
Answer 7: yes
Answer 8: yes
Answer 9: yes
Answer 10: yes
generated answer (accuracy 0.00)
Answer:   no

验证集结果排序后

Overall Accuracy is: 64.50
{
	'what room is': 92.49,
	'is it': 91.27,
	'what sport is': 90.7,
	'are there any': 89.35,
	'are there': 84.89,
	'was': 83.51,
	'is there': 83.47,
	'is this': 83.06,
	'is this a': 82.93,
	'is he': 82.5,
	'is the man': 81.91,
	'is there a': 81.63,
	'could': 81.63,
	'do you': 81.49,
	'is that a': 81.39,
	'is the': 81.04,
	'is this person': 80.89,
	'does this': 80.61,
	'is the person': 80.42,
	'what is the color of the': 80.31,
	'are these': 80.29,
	'are they': 80.01,
	'is this an': 79.76,
	'what color is': 79.43,
	'is': 78.99,
	'has': 78.94,
	'are the': 78.66,
	'does the': 78.65,
	'what color is the': 78.43,
	'can you': 78.22,
	'is the woman': 77.91,
	'are': 76.78,
	'do': 76.55,
	'what color are the': 75.05,
	'what animal is': 72.73,
	'what color': 72.38,
	'what is this': 64.36,
	'what is the person': 64.34,
	'how many people are in': 63.54,
	'what is the man': 62.81,
	'how many people are': 59.54,
	'what are': 58.88,
	'none of the above': 58.44,
	'what kind of': 56.49,
	'what is the woman': 56.05,
	'what type of': 55.0,
	'how many': 53.92,
	'what is in the': 51.48,
	'what are the': 50.41,
	'what is the': 48.72,
	'what': 46.44,
	'what is on the': 45.4,
	'which': 43.81,
	'what is': 43.08,
	'what brand': 43.04,
	'who is': 38.73,
	'where are the': 38.67,
	'where is the': 34.9,
	'how': 30.92,
	'what does the': 26.02,
	'what time': 25.25,
	'why is the': 24.2,
	'why': 21.02,
	'what is the name': 13.9,
	'what number is': 6.29
}
Per Answer Type Accuracy is the following:
other : 55.41
yes/no : 82.41
number : 47.40

#########从每个问题预测答案准确率小于35上随机选择一个问题输出其正确答案和预测生成的答案###################
ground truth answers：
Answer 1: 2
Answer 2: 2
Answer 3: 2
Answer 4: 2
Answer 5: 3
Answer 6: 2
Answer 7: 2
Answer 8: 2
Answer 9: 2
Answer 10: 2
Question: How many orange cones do you see in the picture?
generated answer (accuracy 30.00) #因为3出现了一次所以准确度为30，出现次数大于3就是百分之百。（计算方程在如下）
Answer:   3

在这里插入图片描述

2.2可视化

放在评估demo上输出的结果

2.2.1 在train训练集上的results.json文件展示文件

2．在val验证集上的结果展示文件

3.总结

按问题类型求其准确率，可发现‘why’,‘what number is’,'what is the name’准确度较低。而how many 等一些计数结果较高，在另一篇tduic评估方法中发现，可能是因为针对计数这类一个答案出现的频率太高，导致存在数据偏差，使预测准确，当消除这种偏差，计算某个问题类型的准确度与未消除答案偏差出现较大的差异，则说明该类问题依赖答案的分布，从而说明该模型没有好的泛化能力。

4.附录

4.1 问题类型统计并排序代码

难点在：对字典的排序。

import json
from collections import defaultdict, OrderedDict
with open('mscoco_val2014_annotations.json', 'r') as f:
    data_loader = json.load(f)
data_dict_list = data_loader['annotations']  # 获取annotations字段
#1 .for循环遍历字典将number的逐个赋值给v
#2 .删除非number的字段..
question_types_count = defaultdict(int)
annos_list = []
#keys是每条数据：dict
for dict1 in data_dict_list:
    question_types_count[dict1['question_type']]+=1

#输出问题类型计数
print(question_types_count)
#输出问题总数
print(sum(question_types_count.values()))
#sort 对字典的值进行排序 先把字典转换程元组list
list_1 = list(question_types_count.items())
list_1.sort(key=lambda x:x[1],reverse=True)
list2 = dict(list_1)
print(list2)

4.2 问题类型精确度可视化代码

import json
import copy
from collections import defaultdict, OrderedDict
with open('murel_20_val_acc_question_type.json', 'r') as f:
    data_loader = json.load(f)
# print(data_loader.items())
list_1 = list(data_loader.items())
list_1.sort(key=lambda x:x[1],reverse=True)
list_2 = dict(list_1) #排好序的问题类型概率
dict_0_40 = {} #12
dict_40_60 = {}
dict_60_80 = {}
dict_80_100 = {}

for key in list_2:
    # print(key)
    if list_2[key]>0  and list_2[key]<40:
        dict_0_40[key] = list_2[key]
    elif list_2[key]>=40  and list_2[key]<60:
        dict_40_60[key] = list_2[key]
    elif list_2[key]>=60  and list_2[key]<80:
        dict_60_80[key] = list_2[key]
    else:
        dict_80_100[key] = list_2[key]

print(list_2)
#可视化
import matplotlib.pyplot as plt
c_l = plt.bar(range(len(dict_80_100)), dict_80_100.values(), color = "green",align='center')
for k in c_l:
    height=k.get_height()
    plt.text(k.get_x() + k.get_width() / 2, height, str(height),fontsize=6, ha="center", va="bottom")
plt.xticks(range(len(dict_80_100)), dict_80_100.keys(), rotation='90',fontsize=10)
plt.title('Per Question Type Accuracy', fontsize=10)
plt.xlabel('Question Types', fontsize=10)
plt.ylabel('Accuracy', fontsize=10)
plt.show()