文本相似度Shingling和Minhash算法
目录:
1、测试案例:
2、程序流程:
3、源代码示例:
4、运行结果:
1、测试案例:
采用Shinling及Minhash技术分析以下两段文本的Jaccard相似度:
(1)IELTS (International English Language Testing System) conducted by the British Council, University of Cambridge Local Examinations Syndicate and International Development Program of Australian Universities and College: providing grade 6.5 or higher (i.e. 7, 8, 9) overall has been obtained with a breakdown of 6.0 in reading and writing and 5.5 in listening and speaking.
(2)IELTS / UKVI –IELTS 6.5 overall with 6.0 in reading and writing, 5.5 in listening and speaking for Law, Psychology, Architecture, English, Accounting and Finance
(1)IELTS (International English Language Testing System) conducted by the British Council, University of Cambridge Local Examinations Syndicate and International Development Program of Australian Universities and College: providing grade 6.5 or higher (i.e. 7, 8, 9) overall has been obtained with a breakdown of 6.0 in reading and writing and 5.5 in listening and speaking.
(2)IELTS / UKVI –IELTS 6.5 overall with 6.0 in reading and writing, 5.5 in listening and speaking for Law, Psychology, Architecture, English, Accounting and Finance
(3)采用的hash函数:
h1(r)=(3r +1) mod 7
h2(r)=(5r +1) mod 7
h1(r)=(3r +1) mod 7
h2(r)=(5r +1) mod 7
2、程序流程:
图2.1 程序流程图
3、源代码示例: