TFIDF based on MapReduce

最新推荐文章于 2025-08-23 23:02:37 发布

最新推荐文章于 2025-08-23 23:02:37 发布 · 104 阅读

·

0

·

文章标签：

Hadoop 同时被 2 个专栏收录

33 篇文章

订阅专栏

23 篇文章

订阅专栏

本文介绍了一种使用MapReduce框架实现TF-IDF算法的方法。该方法通过三个阶段的任务(Job)来处理文档集合，最终计算出每个词在文档中的TF-IDF值。第一阶段计算词频，第二阶段计算文档频率，第三阶段完成TF-IDF值的计算。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

Job1:

Map:

input: (document, each line of the document) # TextInputformat

output: (word@document, 1)

Reducer:

output: ((word@document), n)

n = sum of the values of each key(word@document)

the implicit process is:

the same key(word@document) will be pushed to the same reducer(in the shuffer phase)

Job2:

Map:

1、input: ((word@document), n)

2、Re-arrange the mapper to have the key based on each document

3、output: (document, word=n)

Reducer:

output: ((word@document), n/N)

N = total wordsInDocs = sum[word = n] for each document

Job3:

Map:

1、input: ((word@document), n/N)

2、Re-arrange the mapper to have the word as the key, since we need to count the number of documents where it occurs

3、ouput: (word, document=n/N)

Reducer:

ouput: ((word@document), d/D, n/N, tfidf)

D = total number of documents in corpus, which can be set in the configuration

d = number of documents in corpus where the word appears

TFIDF = n/N * log(D/d)

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。