Exploring the Power of Links in Data Mining-韩家炜演讲摘录

韩家炜教授分享了他在数据挖掘领域的最新研究进展，包括利用链接进行分类、用户引导聚类、链接聚类及对象区分分析等四项工作。这些方法在多种任务中展示了优秀的效果。

韩家炜（Jiawei Han），数据挖掘的泰斗级人物，大名如雷贯耳,今日有幸能一睹真人风采。见面第一感觉居然是此人年轻时肯定是个帅哥（汗！），当然，现在仍然是个精神矍铄的智者。

演讲的主题是：Exploring the Power of Links in Data Mining。报告主要讲了四篇论文，都是他的博士研究生Xiaoxin Yin完成。这些工作，大多是受到PageRank算法HITS等的影响导出的。利用数据间的连接关系，我们可以更有效的得出我们所关注的信息。这四篇论文提出的算法，在与其他相关算法的比较中，均显示出了较强的优越性。

1.CrossMine：在连接传播过程中，采用的是有控制的传播，有些比较弱的连接不考虑，这样，能在很好保持准确率的情况下，大大提高时间效率。在Relation少的时候，这种优势不明显，但当Relation多时，显示了强大的优越性。

2.User-Guided Clustering：类似于半监督的学习，用户提供认为重要的特征，然后再分类。这里把整个feature的一列作为特征考虑。而这个提供的特征只是作为soft hint，作为一种参考，我们还需要考虑其它的因素。

3.LinkClus：可以通过人们发的paper，找出各个会议间的相关性。同一个author发的不同会议间的联系强。原有的算法时间效率很差，这里利用了power law distribution of links。找出密集的links，因为密集的links比较少，所以只分析这些会有很大的效率提高。同时，绝大多数的性息被包含在这些密集的links中了，所以准确率也很好。

4.同名人发的paper怎么区分？特别是中国人，名称翻译成英文后，重名的很多，如王伟，有14个之多，如何区分他们，成了问题。这边用到了论文中合作者的信息（共同作者），首先训练的是那些很难重名的人，作为clean data。从他们出发，分类其它的。

最后讲了Xiaoxin Yin最近的研究方向：辨别网页上信息的真假。利用的是这样一个假设，真的信息只有一个，假的信息千变万化。

最后，再次向牛人致敬！

贴一下讲座的摘要，以及韩老的简历：

ABSTRACT
Algorithms like PageRank and HITS have been developed in late 1990s to
explore links among Web pages to discover authoritative pages and hubs.
Links have also been popularly used in citation analysis and social network
analysis. We show that the power of links can be explored thoroughly at
data mining in classification, clustering, information integration, and
other interesting tasks. Some recent results of our research that explore
the crucial information hidden in links will be introduced, including (1)
multi-relational classification, (2) user-guided clustering, (3) link-based
clustering, and (4) object distinction analysis. The power of links in
other analysis tasks will also be discussed in the talk.
------------------------
Short bio:
Jiawei Han, Professor, Department of Computer Science, University of
Illinois at Urbana-Champaign. He has been working on research into data
mining, data warehousing, database systems, data mining from spatiotemporal
data, multimedia data, stream and RFID data, Web data, social network data,
and biological data, with over 300 journal and conference publications. He
has chaired or served on over 100 program committees of international
conferences and workshops, including PC co-chair of 2005 (IEEE)
International Conference on Data Mining (ICDM), Americas Coordinator of
2006 International Conference on Very Large Data Bases (VLDB). He is also
serving as the founding Editor-In-Chief of ACM Transactions on Knowledge
Discovery from Data. He is an ACM Fellow and has received 2004 ACM SIGKDD
Innovations Award and 2005 IEEE Computer Society Technical Achievement
Award. His book "Data Mining: Concepts and Techniques" (2nd ed., Morgan
Kaufmann, 2006) has been popularly used as a textbook worldwide.