HOW TO SET UP APACHE MAHOUT

本文介绍如何利用Apache Mahout进行大规模文本分析,包括从邮件数据集中提取SequenceFiles、转换为稀疏向量及应用LDA算法进行主题分析的过程。

Apache Mahout is a set of machine learning tools, which deal with classification, clustering, recommendations, and other related stuff. We just bought a new book called Mahout In Action which is full of good examples and general machine learning advice; you can find it here. It’s pretty neat and it’s growing quickly, so I decided to take the time to learn about it.

Mahout functions as a set of MapReduce jobs. It integrates cleanly with Hadoop, and this makes it very attractive for doing text analysis on a large scale. Simpler queries, for instance getting the average response time from a customer, are probably better suited for Hive.

Most examples I’ve seen use Mahout as sort of a black box. The command line just forwards arguments to various Driver classes, which then work their magic. All input and output seems to be through HDFS, and Mahout also uses intermediate temp directories inside HDFS. I tried changing one of the Driver classes to work with HBase data, but the amount of work that seemed to be necessary was non-trivial.

EXAMPLE

I decided to work with Enron email data set because it’s reasonably large and it tells a story about fraud and corruption. Their use of keywords like ‘Raptor’ and ‘Death Star’ in place of other more descriptive phrases makes topic analysis pretty interesting.

Please read ‘Important things to watch out for’ at the bottom of this post first if you want to follow along.

This is what I did to get the Enron mail set to be analyzed using the LDA algorithm (Latent Dirchlet Allocation), which looks for common topics in a corpus of text data:

  • The Enron emails are stored in the maildir format, a directory tree of text emails. In order to process the text, it first needs to be converted to SequenceFiles. A SequenceFile is a file format used extensively by Hadoop, and it contains a series of key/value pairs. One way to convert a directory of text to SequenceFiles is to use Mahout’s seqdirectory command:
    ./bin/mahout seqdirectory -i file: ///home/georges/enron_mail_20110402 -o /data/enron_seq

    This can take a little while for large amounts of text, maybe 15 minutes. The SequenceFiles produced have key/value pairs where the key is the path of the file and the value is the text from that file.

  • Later on I wrote my own Java code which parsed out the mail headers to prevent them from interfering with the results. It is fairly simple to write a MapReduce task to quickly produce your own SequenceFiles. Also note that there are many other possible sources of text data, for instance Lucene indexes. There’s a list of ways to input text data here.
  • I needed to tokenize the SequenceFiles into vectors. Vectors in text analysis are a technical idea that I won’t get into, but these particular vectors are just simple term frequencies.
    ./bin/mahout seq2sparse -i /data/enron_seq -o /data/enron_vec_tf --norm  -wt tf -seq

    This command may need changing depending on what text analysis algorithm you’re using. Most algorithms would require tf-idf instead, which weights the term frequency against the size of the email. This took 5 minutes on a 10-node AWS Hadoop cluster. (I set the cluster up using StarCluster, another neat tool for managing EC2 instances.)

  • I ran the LDA algorithm:
    ./bin/mahout lda -i /dev/enron_vec_tf/tf-vectors -o /data/enron_lda -x  20  -k  10

    x is the max number of iterations for the algorithm. k is the number of topics to display from the corpus. This took a little under 2 hours on my cluster.

  • List the LDA topics:
    ./bin/mahout ldatopics -i /data/enron_lda/state- --dict /data/enron_vec_tf/dictionary.file- 0 -w  5  --dictionaryType sequencefile

    This command is a bit of pain because it doesn’t really error when you have an incorrect parameter, it just does nothing. Here’s some of the output I got:

    MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
    Running on hadoop, using HADOOP_HOME=/usr/lib/hadoop- 0.20
    HADOOP_CONF_DIR=/usr/lib/hadoop- 0.20 /conf
    MAHOUT-JOB: /data/mahout-distribution- 0.5 /examples/target/mahout-examples- 0.6 -SNAPSHOT-job.jar
    Topic  0
    ===========
    i [p(i|topic_0) =  0.023824791149925677
    information [p(information|topic_0) =  0.004141992353710214
    i 'm [p(i' m|topic_0) =  0.0012614859683494856
    i 'll [p(i' ll|topic_0) =  7 .433430267661564E- 4
    i 've [p(i' ve|topic_0) =  4 .22765928967555E- 4
    Topic  1
    ===========
    you [p(you|topic_1) =  0.013807669181244436
    you 're [p(you' re|topic_1) =  3 .431068629183266E- 4
    you 'll [p(you' ll|topic_1) =  1 .0412948245383297E- 4
    you 'd [p(you' d|topic_1) =  8 .39664771688153E- 5
    you 'all [p(you' all|topic_1) =  1 .5437174634592594E- 6
    Topic  2
    ===========
    you [p(you|topic_2) =  0.03938587430317399
    we [p(we|topic_2) =  0.010675333661142919
    your [p(your|topic_2) =  0.0038312042763726448
    meeting [p(meeting|topic_2) =  0.002407369369715602
    message [p(message|topic_2) =  0.0018055376982080878
    Topic  3
    ===========
    you [p(you|topic_3) =  0.036593494258252174
    your [p(your|topic_3) =  0.003970284840960353
    i 'm [p(i' m|topic_3) =  0.0013595988902916712
    i 'll [p(i' ll|topic_3) =  5 .879175074800994E- 4
    i 've [p(i' ve|topic_3) =  3 .9887853536102604E- 4
    Topic  4
    ===========
    i [p(i|topic_4) =  0.027838628233581693
    john [p(john|topic_4) =  0.002320786569676983
    jones [p(jones|topic_4) =  6 .79365597839018E- 4
    jpg [p(jpg|topic_4) =  1 .5296038761774956E- 4
    johnson [p(johnson|topic_4) =  9 .771211326361852E- 5
  • Looks like the data needs a lot of munging to provide more useful results. Still, you can see the relationship between some of the words in each topic.

I recommend playing around with the examples in the examples/bin directory in the Mahout folder.

IMPORTANT THINGS TO WATCH OUT FOR
  • I ran out of heap space once I asked Mahout to do some real work. I needed to increase the heap size for child MapReduce processes. How to do this is basically described here. You only need the -Xmx option, and I went for 2 gigabytes:
    <property>
        <name>mapred.child.java.opts</name>
        <value>
          -Xmx2048M
        </value>
      </property>

    You may also want to set MAHOUT_HEAPSIZE to 2048, but I’m not sure how much this matters.

  • Some environment variables weren’t set on my StarCluster instance by default, and the warnings are subtle. HADOOP_HOME is particularly important. If HADOOP_HOME is not set, MapReduce jobs will run as local jobs. There were weird exceptions accessing HDFS, and your jobs won’t show up in the job tracker. They do warn you in the console output for the job, but it’s easy to miss. JAVA_HOME is also important but it will explicitly error and tell you to set this. HADOOP_CONF_DIR should be set to $HADOOP_HOME/conf. For some reason it assumes you want HADOOP_HOME/src/conf instead if you don’t specify. Also set MAHOUT_HOME to your mahout directory. This is important so it can add its jar files to the CLASSPATH correctly.
  • I ended up compiling Mahout from source. The stable version of Mahout had errors I couldn’t really explain. File system mismatches or vector mismatches or something like that. I’m not 100% sure that it’s necessary, but it probably won’t hurt. Compilation is pretty simple, ‘mvn clean install’, but you will probably want to add ‘-DskipTests’ because the tests take a long time.
基于数据驱动的 Koopman 算子的递归神经网络模型线性化,用于纳米定位系统的预测控制研究(Matlab代码实现)内容概要:本文围绕“基于数据驱动的 Koopman 算子的递归神经网络模型线性化,用于纳米定位系统的预测控制研究”展开,提出了一种结合数据驱动方法与Koopman算子理论的递归神经网络(RNN)模型线性化方法,旨在提升纳米定位系统的预测控制精度与动态响应能力。研究通过构建数据驱动的线性化模型,克服了传统非线性系统建模复杂、计算开销大的问题,并在Matlab平台上实现了完整的算法仿真与验证,展示了该方法在高精度定位控制中的有效性与实用性。; 适合人群:具备一定自动化、控制理论或机器学习背景的科研人员与工程技术人员,尤其是从事精密定位、智能控制、非线性系统建模与预测控制相关领域的研究生与研究人员。; 使用场景及目标:①应用于纳米级精密定位系统(如原子力显微镜、半导体制造设备)中的高性能预测控制;②为复杂非线性系统的数据驱动建模与线性化提供新思路;③结合深度学习与经典控制理论,推动智能控制算法的实际落地。; 阅读建议:建议读者结合Matlab代码实现部分,深入理解Koopman算子与RNN结合的建模范式,重点关注数据预处理、模型训练与控制系统集成等关键环节,并可通过替换实际系统数据进行迁移验证,以掌握该方法的核心思想与工程应用技巧。
基于粒子群算法优化Kmeans聚类的居民用电行为分析研究(Matlb代码实现)内容概要:本文围绕基于粒子群算法(PSO)优化Kmeans聚类的居民用电行为分析展开研究,提出了一种结合智能优化算法与传统聚类方法的技术路径。通过使用粒子群算法优化Kmeans聚类的初始聚类中心,有效克服了传统Kmeans算法易陷入局部最优、对初始值敏感的问题,提升了聚类的稳定性和准确性。研究利用Matlab实现了该算法,并应用于居民用电数据的行为模式识别与分类,有助于精细化电力需求管理、用户画像构建及个性化用电服务设计。文档还提及相关应用场景如负荷预测、电力系统优化等,并提供了配套代码资源。; 适合人群:具备一定Matlab编程基础,从事电力系统、智能优化算法、数据分析等相关领域的研究人员或工程技术人员,尤其适合研究生及科研人员。; 使用场景及目标:①用于居民用电行为的高效聚类分析,挖掘典型用电模式;②提升Kmeans聚类算法的性能,避免局部最优问题;③为电力公司开展需求响应、负荷预测和用户分群管理提供技术支持;④作为智能优化算法与机器学习结合应用的教学与科研案例。; 阅读建议:建议读者结合提供的Matlab代码进行实践操作,深入理解PSO优化Kmeans的核心机制,关注参数设置对聚类效果的影响,并尝试将其应用于其他相似的数据聚类问题中,以加深理解和拓展应用能力。
评论
成就一亿技术人!
拼手气红包6.0元
还能输入1000个字符
 
红包 添加红包
表情包 插入表情
 条评论被折叠 查看
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值