Accuracy is a most important data quality dimension and its assessment is a key issue in data management. Most of current studies focus on how to qualitatively analyze accuracy dimension and the analysis depends heavily on experts’ knowledge. Seldom work is given on how to automatically quantify accuracy dimension. Based on Jensen–Shannon divergence (JSD) measure, we propose accuracy of data can be automatically quantified by comparing data with its entity’s most approximation in available context . To quickly identify most approximation in large scale data sources, locality-sensitive hashing (LSH ) is employed to extract most approximation at multiple levels, namely column, record and field level. Our approach can not only give each data source an objective accuracy score very quickly as long as context member is available but also avoid human’s laborious interaction. As an automatic accuracy assessment solution in multiple-source environment, our approach is distinguished, especially for large scale data sources. Theory and experiment show our approach performs well in achieving metadata on accuracy dimension.
Published by Expert Systems With Applications (ELSEVIER)The online version is available at http://dx.doi.org/10.1016/j.eswa.2009.08.023 .
<!-- articleText -->
本文提出了一种基于Jensen–Shannon divergence的方法来自动量化数据的准确性。通过与实体最接近的上下文进行比较,该方法可以快速为大型数据源提供客观的准确性评分,减少了人工干预。实验表明,此方法在评估准确性维度的元数据方面表现出色。
1万+

被折叠的 条评论
为什么被折叠?



