Classification(3)Generate Features and Stem Adjust the Model System

最新推荐文章于 2025-12-09 11:10:41 发布

原创最新推荐文章于 2025-12-09 11:10:41 发布 · 192 阅读

0 ·

CC 4.0 BY-SA版权

文章标签：

#scala #python #大数据

Summary 专栏收录该内容

381 篇文章

订阅专栏

本文介绍Scala中的字符串操作、Map合并、滑动窗口等实用技巧，并详细阐述如何为分类任务生成特征，包括文本预处理、分词、去除停用词及词干提取等步骤。

Classification(3)Generate Features and Stem Adjust the Model System

1. Scala Operation
String Method - contains
scala> val longContent = "carl love to study python, scala"
longContent: String = carl love to study python, scala

scala> longContent.contains("python")
res0: Boolean = true

Map Merge Function
Directly under the the project which we already have the jar dependencies.
> sbt console
scala> import scalaz.Scalaz._
import scalaz.Scalaz._

scala>

scala> val m1 = Map("0"->0, "1" ->1)
m1: scala.collection.immutable.Map[String,Int] = Map(0 -> 0, 1 -> 1)

scala> val m2 = Map("2"->2)
m2: scala.collection.immutable.Map[String,Int] = Map(2 -> 2)

scala> val m3 = m1 |+| m2
m3: scala.collection.immutable.Map[String,Int] = Map(0 -> 0, 1 -> 1, 2 -> 2)

Map Operation
scala> m3
res1: scala.collection.immutable.Map[String,Int] = Map(0 -> 0, 1 -> 1, 2 -> 2)

scala> m3 - "0"
res2: scala.collection.immutable.Map[String,Int] = Map(1 -> 1, 2 -> 2)

Magic scalaz
https://github.com/scalaz/scalaz

Sliding
scala> (1 to 5).iterator.sliding(3).toList
res3: List[Seq[Int]] = List(List(1, 2, 3), List(2, 3, 4), List(3, 4, 5))

List Operation
scala> List(1,2,3).zip(List("one","two","three"))
res8: List[(Int, String)] = List((1,one), (2,two), (3,three))

Run with Assembly Jar
./spark-submit —num-executors 2 —driver-memory 2G —class com.sillycat.jobs.GenerateFeatureMap ${path_to_jar}

Nice Configuration in build.sbt
// There's a problem with jackson 2.5+ with Spark 1.4.1
dependencyOverrides ++= Set(
"com.fasterxml.jackson.core" % "jackson-databind" % "2.4.4"
)

When we build assembly Jar, We may just need Spark Core and related provided
"org.apache.spark" %% "spark-core" % "1.4.1" % "provided", // Apache v2
"org.apache.spark" %% "spark-mllib" % "1.4.1" % "provided", // Apache v2

2. Detail Operations
GenerateFeatureMap
step1. Load Job Info from S3(Only title and description), cache()
step2. Place the title and description in Object, Regex to Find the Title and Description again
step3. Normalize the String
For title: toLower —> filter all html —> stripChars, only keep [a-zA-Z\d\-]
For description: toLower —>filter URL —> filter HTML —> stripChar —> stripNumber
step4. Tokenize the String
We predefined a list of phrases and stored in text file. 2 words and 3 words.
For Title:
Find the phrases in the string which are contained in the pre-defined list.
Convert the string to words and phrase List
eg: big data software engineer —> big, data, software, engineer, big data, software engineer
(big data and software engineer are pre-defined in the list)

For description:
Find the phrases in the string which are contained in the pre-defined list.
Pre-defined a stop word list. Remove stop word
Porter Stemming Algorithm (https://github.com/dlwh/epic, PorterStemmer.scala)
Convert the string to words and phrase List
step5. Calculate IDF
TF-IDF http://sillycat.iteye.com/blog/2231432
The document frequency DF(t, D) is the number of documents that contains term t.
IDI is the total number of documents in the corpus.
IDF(t, D) = log((IDI+1)/(DF(t,D) +1))
step6. Save File on S3
key, index, IDF

3. Classifier Model Training
step1. Load featureMap which is pre-calculate in previous operation
step2. Binary Feature Extractor
step3. Load List of Jobs
step4. Train Minor
step5. Train Arbitrator

4. Classification System
MajorGroupClassificationSystem
MinorGroupClassificationSystem

References:
http://sillycat.iteye.com/blog/2230117
http://sillycat.iteye.com/blog/2231432

http://www.scalanlp.org/