Classification(3)Generate Features and Stem Adjust the Model System

本文介绍Scala中的字符串操作、Map合并、滑动窗口等实用技巧,并详细阐述如何为分类任务生成特征,包括文本预处理、分词、去除停用词及词干提取等步骤。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

Classification(3)Generate Features and Stem Adjust the Model System

1. Scala Operation
String Method - contains
scala> val longContent = "carl love to study python, scala"
longContent: String = carl love to study python, scala

scala> longContent.contains("python")
res0: Boolean = true

Map Merge Function
Directly under the the project which we already have the jar dependencies.
> sbt console
scala> import scalaz.Scalaz._
import scalaz.Scalaz._

scala>

scala> val m1 = Map("0"->0, "1" ->1)
m1: scala.collection.immutable.Map[String,Int] = Map(0 -> 0, 1 -> 1)

scala> val m2 = Map("2"->2)
m2: scala.collection.immutable.Map[String,Int] = Map(2 -> 2)

scala> val m3 = m1 |+| m2
m3: scala.collection.immutable.Map[String,Int] = Map(0 -> 0, 1 -> 1, 2 -> 2)

Map Operation
scala> m3
res1: scala.collection.immutable.Map[String,Int] = Map(0 -> 0, 1 -> 1, 2 -> 2)

scala> m3 - "0"
res2: scala.collection.immutable.Map[String,Int] = Map(1 -> 1, 2 -> 2)

Magic scalaz
https://github.com/scalaz/scalaz

Sliding
scala> (1 to 5).iterator.sliding(3).toList
res3: List[Seq[Int]] = List(List(1, 2, 3), List(2, 3, 4), List(3, 4, 5))

List Operation
scala> List(1,2,3).zip(List("one","two","three"))
res8: List[(Int, String)] = List((1,one), (2,two), (3,three))

Run with Assembly Jar
./spark-submit —num-executors 2 —driver-memory 2G —class com.sillycat.jobs.GenerateFeatureMap ${path_to_jar}

Nice Configuration in build.sbt
// There's a problem with jackson 2.5+ with Spark 1.4.1
dependencyOverrides ++= Set(
"com.fasterxml.jackson.core" % "jackson-databind" % "2.4.4"
)

When we build assembly Jar, We may just need Spark Core and related provided
"org.apache.spark" %% "spark-core" % "1.4.1" % "provided", // Apache v2
"org.apache.spark" %% "spark-mllib" % "1.4.1" % "provided", // Apache v2

2. Detail Operations
GenerateFeatureMap
step1. Load Job Info from S3(Only title and description), cache()
step2. Place the title and description in Object, Regex to Find the Title and Description again
step3. Normalize the String
For title: toLower —> filter all html —> stripChars, only keep [a-zA-Z\d\-]
For description: toLower —>filter URL —> filter HTML —> stripChar —> stripNumber
step4. Tokenize the String
We predefined a list of phrases and stored in text file. 2 words and 3 words.
For Title:
Find the phrases in the string which are contained in the pre-defined list.
Convert the string to words and phrase List
eg: big data software engineer —> big, data, software, engineer, big data, software engineer
(big data and software engineer are pre-defined in the list)

For description:
Find the phrases in the string which are contained in the pre-defined list.
Pre-defined a stop word list. Remove stop word
Porter Stemming Algorithm (https://github.com/dlwh/epic, PorterStemmer.scala)
Convert the string to words and phrase List
step5. Calculate IDF
TF-IDF http://sillycat.iteye.com/blog/2231432
The document frequency DF(t, D) is the number of documents that contains term t.
IDI is the total number of documents in the corpus.
IDF(t, D) = log((IDI+1)/(DF(t,D) +1))
step6. Save File on S3
key, index, IDF

3. Classifier Model Training
step1. Load featureMap which is pre-calculate in previous operation
step2. Binary Feature Extractor
step3. Load List of Jobs
step4. Train Minor
step5. Train Arbitrator

4. Classification System
MajorGroupClassificationSystem
MinorGroupClassificationSystem

References:
http://sillycat.iteye.com/blog/2230117
http://sillycat.iteye.com/blog/2231432

http://www.scalanlp.org/
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值