论文笔记-LSHTC: A Benchmark for Large-Scale Text Classification-2015

最新推荐文章于 2024-09-25 08:27:07 发布

原创最新推荐文章于 2024-09-25 08:27:07 发布 · 679 阅读

0 ·

CC 4.0 BY-SA版权

文章标签：

#LSHTC #文本分类

Machine learning 同时被 2 个专栏收录

18 篇文章

订阅专栏

NLP

16 篇文章

订阅专栏

关于LSHTC更多介绍见官网

文章目录

title
abstract
dataset
LSHTC数据集介绍
评估方法
References(论文提到的算法的论文)

title

LSHTC: A Benchmark for Large-Scale Text
Classification

abstract

LSHTC is a series of challenges which aims to assess the performance
of classification systems in large-scale classification in a a large number of
classes (up to hundreds of thousands). This paper describes the dataset
that have been released along the LSHTC series. The paper details the
construction of the datsets and the design of the tracks as well as the
evaluation measures that we implemented and a quick overview of the
results. All of these datasets are available online and runs may still be
submitted on the online server of the challenges.

dataset

在这里插入图片描述
1 http://www.bioasq.org
2 http://www.image-net.org/challenges/LSVRC/2014/
3 http://research.microsoft.com/en-us/um/people/manik/events/xc13/
4 http://lshtc.iit.demokritos.gr/WSDM_WS
5 http://lshtc.iit.demokritos.gr/
6 http://dbpedia.org/About
7 http://www.dmoz.org/

LSHTC数据集介绍

LSHTC1

在这里插入图片描述
[外链图片转存失败(img-9BwPJFpt-1563424210678)(leanote://file/getImage?fileId=5d25a48fab64413ee900660d)]

The tracks of the first year of the challenge were based on the DMOZ dataset
(tree hierarchy) using only single-label instances. The challenge was split into
4 tracks which were composed by different combinations between Content and
Description vectors. Since both types of vectors were used in this challenge only
the intersection of the two sets of instances were used for this challenge (we used
only instances which had both a Content and Description vector).
【挑战第一年的轨道基于DMOZ数据集（树层次结构）仅使用单标签实例。挑战分为4个轨道由内容和内容之间的不同组合组成
描述向量。由于这两种类型的载体仅用于此挑战两组实例的交集用于此挑战（我们使用过只有同时具有内容和描述矢量的实例）。】

LSHTC2

[外链图片转存失败(img-RqjKRcGu-1563424210678)(leanote://file/getImage?fileId=5d25a48fab64413ee900660f)]

During LSHTC2, we used multi-label instances and added non-tree hierarchies.
Instead of using, for DMOZ, the intersection between the instances of Content
and Description vector, we decided to keep one of them. We kept the Content
vectors, since they did not require a human annotator in order be created. Since
we decided to move to multi-label classification, we used all the Content vectors
that we had.
【在LSHTC2期间，我们使用了多标签实例并添加了非树层次结构。对于DMOZ，我们决定保留其中一个实例，而不是使用内容和描述向量的实例之间的交集。我们保留了内容向量，因为它们不需要创建人类注释器。由于我们决定采用多标签分类，因此我们使用了所有内容向量】

LSHTC3 & LSHTC4

The two DBpedia datasets were also used, as Track 1, during the third iteration of the LSHTC challenges (LSHTC3) The only addition was regarding the
Medium DBpedia dataset, were we also provided the original text of the instances, without beeing pre-processed. During LSHTC 4, only the Large DBpedia dataset was used for the first track called \Very Large Supervised Learning",
which was evaluated at Kaggle[http://www.kaggle.com/].
【在LSHTC挑战的第三次迭代期间，两个DBpedia数据集也被用作轨道1（LSHTC3）唯一的补充是关于中DBMB数据集，我们还提供了实例的原始文本，而没有预先处理。在LSHTC 4期间，只有大型DBpedia数据集被用于第一个名为“非常大的监督学习”的轨道，该轨道在Kaggle进行了评估。】

评估方法

During the classification tracks of all LSHTC challenges, we used two types of
measures in order to evaluate the participating systems, flat and hierarchical.
[外链图片转存失败(img-6Fz4pRxq-1563424210679)(leanote://file/getImage?fileId=5d25bfacab64413ee9006b41)]

最好结果：
[外链图片转存失败(img-nTAUWEg7-1563424210679)(leanote://file/getImage?fileId=5d25bfacab64413ee9006b42)]

References(论文提到的算法的论文)

[1] Christophe Brouard. Echo at the lshtc pascal challenge 2. PASCAL Workshop on Large-Scale Hierarchical Classification, ECML/PKDD 2011, pages 49-57, 2011.
[2] Xiaogang Han, Shaohua Li, and Zhiqi Shen. A k-nn method for large scale
hierarchical text classification at lshtc3. Discovery Challenge Workshop on
Large Scale Hierarchical Classification, ECML/PKDD 2012, 2012.
[3] Aris Kosmopoulos, Ioannis Partalas, Eric Gaussier, Georgios Paliouras, and
Ion Androutsopoulos. Evaluation measures for hierarchical classification: a
unified view and novel approaches. Data Mining and Knowledge Discovery,
pages 1{46, 2014.
[4] Dong-Hyun Lee. Multi-stage rocchio classification for large-scale multilabeled text data. Discovery Challenge Workshop on Large Scale Hierarchical Classification, ECML/PKDD 2012, 2012.
[5] Xiao lin Wang, Hai Zhao, and Bao-Liang Lu. A meta-top-down method
for large-scale hierarchical classification. Knowledge and Data Engineering,
IEEE Transactions on, 26(3):500{513, March 2014.
[6] Omid Madani and Jian Huang. Large-scale many-class prediction via flat techniques. In Large-Scale Hierarchical Classification Workshop of ECIR,2010.
[7] Youdong Miao and Xipeng Qiu. Hierarchical centroid-based classifier for large scale text classification. Large Scale Hierarchical Text classification(LSHTC) Pascal Challenge, 18, 2009.
[8] Antti Puurula and Albert Bifet. Ensembles of sparse multinomial classifiers
for scalable text classification. Discovery Challenge Workshop on Large
Scale Hierarchical Classification, ECML/PKDD 2012, 2012.
[9] Yutaka Sasaki and Davy Weissenbacher. Tti’s system for the lshtc3 challenge. Discovery Challenge Workshop on Large Scale Hierarchical Classification, ECML/PKDD 2012, 2012.
[10] Grigorios Tsoumakas and Ioannis Vlahavas. Random k-labelsets: An ensemble method for multilabel classification. In Machine Learning: ECML
2007, volume 4701 of Lecture Notes in Computer Science, pages 406{417.
2007.
[11] Xiao-Lin Wang, Hai Zhao, and Bao-Liang Lu. Enhance k-nearest neighbour algorithm for large-scale multi-labeled hierarchical classification. PASCAL Workshop on Large-Scale Hierarchical Classification, ECML/PKDD 2011,pages 58{67, 2011.
[12] Yiming Yang and Xin Liu. A re-examination of text categorization methods. In Proceedings of the 22Nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’99,
pages 42{49. ACM Press, 1999.