TFIDF based on MapReduce


The TF-IDF MapReduce Phases by Ricky Ho

 

Job1:

Map:

input: (document, each line of the document) # TextInputformat

output: (word@document, 1)

Reducer:

output: ((word@document), n)

n = sum of the values of each key(word@document)

the implicit process is:

the same key(word@document) will be pushed to the same reducer(in the shuffer phase)

 

Job2:

Map:

1、input: ((word@document), n)

2、Re-arrange the mapper to have the key based on each document

3、output: (document, word=n)

Reducer:

     output: ((word@document), n/N)

     N = total wordsInDocs = sum[word = n] for each document

 

Job3:

Map:

1、input: ((word@document), n/N)

2、Re-arrange the mapper to have the word as the key, since we need to count the number of documents where it occurs

3、ouput: (word, document=n/N)

 

Reducer:

     ouput: ((word@document), d/D, n/N, tfidf)

     D = total number of documents in corpus, which can be set in the configuration

     d = number of documents in corpus where the word appears

             TFIDF = n/N * log(D/d)

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值