Mahout: Clustering - Representing data

本文介绍了Apache Mahout中数据向量化的方法,包括密集型向量、稀疏向量及顺序访问稀疏向量的实现原理。针对文本数据,详细解释了向量空间模型(VSM),并探讨了词频(TF)与词频逆文档频率(TF-IDF)等权重方案的应用。

Transforming data into vectors

In Mahout, vectors are implemented as three different classes

  • DenseVector can be thought of as an array of doubles, whose size is the numberof features in the data. Because all the entries in the array are preallocatedregardless of whether the value is 0 or not, we call it dense.
  • RandomAccessSparseVector is implemented as a HashMap between an integer and a double, where only nonzero valued features are allocated. Hence, they’re called as SparseVectors.
  • SequentialAccessSparseVector is implemented as two parallel arrays, one ofintegers and the other of doubles. Only nonzero valued entries are kept in it.Unlike the RandomAccessSparseVector, which is optimized for random access,this one is optimized for linear reading.



One possible problem with our chosen mappings to dimension values is that the values in dimension 1 are much larger than the others. If we applied a simple distance-based metric to determine similarity between these vectors, color differences would dominate the results. A relatively small color difference of 10 nm is treated as equal to a huge size difference of 10. Weighting the different dimensions solves this
problem.

 

 

Representing text documents as vectors

The vector space model (VSM) is the common way of vectorizing text documents. First, imagine the set of all words that could be encountered in a series of documents being vectorized. This set might be all words that appear at least once in any of the documents. Imagine each word being assigned a number, which is the dimension it’ll occupy in document vectors.

  • term frequency (TF)     The value of the vector dimension for a word is usually the number of occurrences of the word in the document. This is known as term frequency (TF) weighting.
  • Term frequency–inverse document frequency (TF-IDF)  Term frequency–inverse document frequency (TF-IDF) weighting is a widely usedimprovement on simple term-frequency weighting. The IDF part is the improvement;instead of simply using term frequency as the value in the vector, this value is multiplied by the inverse of the term’s document frequency. That is, its value is reduced more for words used frequently across all the documents in the dataset than for infrequently used words.



 
The basic assumption of the vector space model (VSM) is that the words are dimensions and therefore are orthogonal to each other. In other words, VSM assumes that the occurrences of words are independent of each other, in the same sense that a point’s x coordinate is entirely independent of its y coordinate, in two dimensions. By intuition you know that this assumption is wrong in many cases. For example, the word Cola has higher probability of occurring along with the word Coca, so these words aren’t completely independent. Other models try to consider word dependencies. One well-known technique is latent semantic indexing (LSI), which detects dimensions that seem to go together and merges them into a single one.

 

In Mahout, text documents are converted to vectors using TF-IDF weighting and n-gram collocation using the DictionaryVectorizer class.

 

Generating vectors from documents

 

 

mvn -e -q exec:java
-Dexec.mainClass="org.apache.lucene.benchmark.utils.ExtractReuters"
-Dexec.args="reuters/ reuters-extracted/"

 

 

mahout seqdirectory -c UTF-8
-i examples/reuters-extracted/ -o reuters-seqfiles

 

 

mahout seq2sparse -i reuters-seqfiles/ -o reuters-vectors -ow

 

  • In the first step, the text documents are tokenized—they’re split into individual words using the Lucene StandardAnalyzer and stored in the tokenized-documents/ folder.
  • The word-counting step—the n-gram generation step (which in this case only counts unigrams)—iterates through the tokenized documents and generates a set of important words from the collection.
  • The third step converts the tokenized documents into vectors using the term-frequency weight, thus creating TF vectors. By default, the vectorizer uses the TF-IDF weighting, so two more steps happen after this:
  • the document-frequency (DF) counting job, and the TF-IDF vector creation.

 

 

评论
成就一亿技术人!
拼手气红包6.0元
还能输入1000个字符
 
红包 添加红包
表情包 插入表情
 条评论被折叠 查看
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值