TF-IDF Implementation with C++

TF-IDF Implementation with C++

TF-IDF weight is widely used in text mining. It measures the importances of a word to a document in corpus. Recently I was doing with music recommendation algirhtms, and I have found that many papers were using the TF-IDF to measure the lyric similarity between musics. I have searched and did not find a TF-IDF library, so I decided to code one by myself.

Basics

TF-IDF weight is calculated by 2 components, Term Frequency (TF) and Inverse Document Frequency (IDF). The definations of TF-IDF weight of a term j in document i is shown below.

where tfij is the frequency of term j in document i, N is total number of documents, and nj is number of documents contains term j.

For more details, please refer to TF-IDF Tutorial.

Code

Assume we have 25 text files, each text file is a document. The code here will compute the TF-IDF weight for these 25 documents.

Firstly, we need to split the input file contents to words. Here I use the boost::tokenizer, it can split the std::string by everything which are not numbers and letters.

std::vector<std::string> textParse(std::string & bigString)
{
    std::vector<std::string> vec;
    boost::tokenizer<> tok(bigString);
    for (boost::tokenizer<>::iterator beg = tok.begin(); beg != tok.end(); ++ beg)
    {
        vec.push_back(*beg);
    }
    return vec;
}

The return value is a std::vector<std::string>, using a std::vector<std::vector<std::string>> rawDataSet contains the vector of all documents.

Next, we need a word list, which should include all unique words in all documents. In this example, the word list is in std::set<std::string> vocabListSet

void createVocabList()
{
    std::set<std::string> vocabListSet;
    for (std::vector<std::string> document : rawDataSet)
    {
        for (std::string word : document)
            vocabListSet.insert(word);
    }
    std::copy(vocabListSet.begin(), vocabListSet.end(), std::back_inserter(vocabList));
}

Using the word list, we can convert each documents to digits with Bag-of-words model. The function below converts std::vector of words in a document to a bag-of-words vector.

Eigen::VectorXf bagOfWords2VecMN(std::vector<std::string> & inputSet)
{
    std::vector<float> returnVec(vocabList.size(), 0);
    for (std::string word : inputSet)
    {
        size_t idx = std::find(vocabList.begin(), vocabList.end(), word) - vocabList.begin();
        if (idx == vocabList.size())
            cout << "word: " << word << "not found" << endl;
        else
            returnVec.at(idx) += 1;
    }
    Eigen::Map<Eigen::VectorXf> v(returnVec.data(),returnVec.size());
    return v;
}

The return value type of bagOfWords2VecMN() is Eigen::VectorXf. After convert all documents in rawDataSet, the vectors of all documents are in Eigen::MatrixXf dataMat.

void vec2mat()
{
    std::vector<Eigen::VectorXf> vec;
    for (std::vector<std::string> document : rawDataSet)
    {
        vec.push_back(bagOfWords2VecMN(document));
    }
    ncol = vec[0].size();
    nrow = vec.size();
    dataMat.resize(nrow, ncol);
    for (int i = 0; i < nrow; ++i)
    {
        dataMat.row(i) = vec[i];
    }
    rawDataSet.clear(); // release memory
}

After above functions, we have a matrix includes the number of occurance of each term in each document. To get the nj in the weight equation, one possible solution is:
* Create another matrix using set-of-words model, which marks 1 when the document have the term, no matter how many times the term occcurs.
* Sum all the rows in the matrix, and then we have a vector includes the occurance of all documents.

The code here convert the bag-of-word vector to set-of-word vector, and sum all of them together.

Eigen::MatrixXf dataMat2(dataMat); // make a copy of dataMat
Eigen::VectorXf termCount; // the sumed vector
termCount.resize(ncol); // ncol is number of column of dataMat
for (int i = 0; i != nrow; ++i)
{
    for (int j = 0; j != ncol; ++j)
    { // for each element in dataMat2
        if (dataMat2(i,j) > 1) // only keep 1 and 0
            dataMat2(i,j) = 1;
    }
    termCount += dataMat2.row(i); // no. of doc. each term appears
}
dataMat2.resize(0,0); //release

Finally, the weight of each term j to document i is

double tf = dataMat(i,j) / (dataMat.row(i).sum());
double idf = log((double)nrow / (termCount(j))); // nrow is number of rows of dataMat
double weight = tf * idf;

See the complete code here

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值