机器学习C3分类:垃圾过滤(tm包文本挖掘,朴素贝叶斯算法的垃圾邮件分类处理等)

本文介绍了一种基于贝叶斯算法的垃圾邮件分类器实现过程。通过构建词项-文档矩阵来量化邮件特征词项频率，并使用朴素贝叶斯算法进行训练与分类。文章详细展示了如何读取邮件数据、构建训练数据集以及对未知邮件进行分类。

机器学习笔记
教材为：机器学习：实用案例解析
代码： github链接
本文代码有微小修改

1. bayes 垃圾分类器

1.1加载路径

共有三类邮件：
1. 易识别的正常邮件:easy_ham.
2. 不易识别的正常邮件:hard_ham.
3. 垃圾邮件:spam.

# Load libraries
library('tm')
library('ggplot2')

# Set the global paths
spam.path <- file.path("data", "spam")
spam2.path <- file.path("data", "spam_2")
easyham.path <- file.path("data", "easy_ham")
easyham2.path <- file.path("data", "easy_ham_2")
hardham.path <- file.path("data", "hard_ham")
hardham2.path <- file.path("data", "hard_ham_2")

1.2 读取邮件中”“后的数据,以向量形式给出

这里原数据读取时在which(text=="")[1]一行出错,利用tryCatch来捕获异常,不输出.
最后一行返回的是以\n空格的文本数据, 每一行是一个邮件的内容.

get.msg <- function(path)
{
    con <- file(path, open = "rt", encoding = "latin1")
    text <- readLines(con)
    # The message always begins after the first full line break

    #加了一个tryCatch来捕获异常
    msg <- tryCatch(text[seq(which(text == "")[1] + 1, length(text), 1)], error = function(e) e)#从第一个""读数据到最后一行
    close(con)
    return(paste(msg, collapse = "\n"))#将msg以\n为空格拼接成向量单条文本元素
}


# Get all the SPAM-y email into a single vector
spam.docs <- dir(spam.path)
spam.docs <- spam.docs[which(spam.docs != "cmds")]
all.spam <- sapply(spam.docs,
                   function(p) get.msg(file.path(spam.path, p)))#all.spam的每一行是邮件文本内容

1.3 构建一个文本资料库

量化垃圾邮件特征词项频率的方法之一就是构造一个词项-文档矩阵(Term Document Matrix,TDM), TDM是一个N*M矩阵,矩阵的行对应在特定语料库的所有文档中抽取的词项,列对应该语料库中所有文档. 在这个矩阵中,一个[i,j]位置的元素表.示词项i在文档j中出现的次数

构建语料库(corpus对象不懂!!!)
本案例用到了VectorSource函数构建source对象以利用邮件向量构建语料库.
conrol参数的解释:告诉tm该如何清洗和规整文本
1. stopwords=TRUE : 移除488个最常见的英文停用词.
2. removePunctuation , removeNumbers: 移除标点和数字
3. minDocFreq=2 : 文本中出现次数大于1的词才能出现在TDM的行中.

get.tdm <- function(doc.vec)#doc.vec是所有的文档内容
{
    # 选项,用于设定如何提取文本
    control <- list(stopwords = TRUE,
                    removePunctuation = TRUE,
                    removeNumbers = TRUE,
                    minDocFreq = 2)
    doc.corpus <- Corpus(VectorSource(doc.vec))
    doc.dtm <- TermDocumentMatrix(doc.corpus, control)
    return(doc.dtm)#返回的是TMD文本
}


# Create a DocumentTermMatrix from that vector
spam.tdm <- get.tdm(all.spam)

1.4用TDM构建一套垃圾邮件训练数据

用as.matrix将TDM转换成R的标准矩阵
rowSums创建一个向量, 每个特征在所有文档中的总频次
在构建数据框时注意stringsAsFacors=F
occurrence : 计算一个特定特征词项所出现的文档在所有文档中所占比例
density: 统计整个语料库中每个词项的频次(不用此来分类, 如果想知道某些词是否影响结果,对比频次相当有用)

# Create a data frame that provides the feature set from the training SPAM data
spam.matrix <- as.matrix(spam.tdm)#转换成R的标准矩阵 行:词 列:每个文本中此词出现的次数
spam.counts <- rowSums(spam.matrix)#每行的500个doc对应的词数量
spam.df <- data.frame(cbind(names(spam.counts),
                            as.numeric(spam.counts)),
                      stringsAsFactors = FALSE)
names(spam.df) <- c("term", "frequency")

spam.df$frequency <- as.numeric(spam.df$frequency)
spam.occurrence <- sapply(1:nrow(spam.matrix),#文档中出现此词的次数/500doc
                          function(i)
                          {
                              length(which(spam.matrix[i, ] > 0)) / ncol(spam.matrix)
                          })
spam.density <- spam.df$frequency / sum(spam.df$frequency)#词出现概率

# Add the term density and occurrence rate
spam.df <- transform(spam.df,
                     density = spam.density,
                     occurrence = spam.occurrence)

构建正常邮件的训练数据和构建垃圾邮件的一样。只需用data/easy_ham。

1.5 定义分类器并用不易识别的正常邮件测试

为了计算一封邮件是垃圾邮件还是正常邮件的概率，需要找出待分类邮件和训练集共有的词项。让背后用这些特征的概率计算这封邮件是训练集中对应类别的条件概率，当待分类邮件中的词汇未出现在训练集合中出现时，我们可以给一个固定值或者按照某种分布给一个概率，或者用自然语言处理技术估计一个词项在特定上下文中的“垃圾倾向”。这里给固定值1e-6.

classify.email <- function(path, training.df, prior = 0.5, c = 1e-6)
{
  # Here, we use many of the support functions to get the
  # email text data in a workable format
  msg <- get.msg(path)#读取欲检验的邮件内容
  msg.tdm <- get.tdm(msg)#tdm
  msg.freq <- rowSums(as.matrix(msg.tdm))#计算每个词出现的次数
  # Find intersections of words
  msg.match <- intersect(names(msg.freq), training.df$term)#求交集
  # Now, we just perform the naive Bayes calculation
  if(length(msg.match) < 1)
  {
    return(prior * c ^ (length(msg.freq)))#若未出现则给予固定概率
  }
  else
  {
    match.probs <- training.df$occurrence[match(msg.match, training.df$term)]
    return(prior * prod(match.probs) * c ^ (length(msg.freq) - length(msg.match)))
  }
}
#最后得到test邮件的概率

进行分类

hardham.docs <- dir(hardham.path)
hardham.docs <- hardham.docs[which(hardham.docs != "cmds")]

hardham.spamtest <- sapply(hardham.docs,
                           function(p) classify.email(file.path(hardham.path, p), training.df = spam.df))

hardham.hamtest <- sapply(hardham.docs,
                          function(p) classify.email(file.path(hardham.path, p), training.df = easyham.df))

hardham.res <- ifelse(hardham.spamtest > hardham.hamtest,
                      TRUE,
                      FALSE)
summary(hardham.res)

1.6 对所有邮件进行分类

spam.classifier <- function(path)#path指定所有文件路径，进行分类func
{
  pr.spam <- classify.email(path, spam.df)
  pr.ham <- classify.email(path, easyham.df)
  return(c(pr.spam, pr.ham, ifelse(pr.spam > pr.ham, 1, 0)))
}

# Get lists of all the email messages
easyham2.docs <- dir(easyham2.path)
easyham2.docs <- easyham2.docs[which(easyham2.docs != "cmds")]

hardham2.docs <- dir(hardham2.path)
hardham2.docs <- hardham2.docs[which(hardham2.docs != "cmds")]

spam2.docs <- dir(spam2.path)
spam2.docs <- spam2.docs[which(spam2.docs != "cmds")]

# Classify them all!
#易于辨别的正常邮件
easyham2.class <- suppressWarnings(lapply(easyham2.docs,
                                   function(p)
                                   {
#不易于辨别的正常邮件                                     spam.classifier(file.path(easyham2.path, p))
                                   }))
hardham2.class <- suppressWarnings(lapply(hardham2.docs,
                                   function(p)
                                   {
                                 spam.classifier(file.path(hardham2.path, p))
                                   }))
#垃圾邮件    
spam2.class <- suppressWarnings(lapply(spam2.docs,
                                function(p)
                                {
                                  spam.classifier(file.path(spam2.path, p))
                                }))

# Create a single, final, data frame with all of the classification data in it数据整合
easyham2.matrix <- do.call(rbind, easyham2.class)
easyham2.final <- cbind(easyham2.matrix, "EASYHAM")

hardham2.matrix <- do.call(rbind, hardham2.class)
hardham2.final <- cbind(hardham2.matrix, "HARDHAM")

spam2.matrix <- do.call(rbind, spam2.class)
spam2.final <- cbind(spam2.matrix, "SPAM")

class.matrix <- rbind(easyham2.final, hardham2.final, spam2.final)
class.df <- data.frame(class.matrix, stringsAsFactors = FALSE)
names(class.df) <- c("Pr.SPAM" ,"Pr.HAM", "Class", "Type")
class.df$Pr.SPAM <- as.numeric(class.df$Pr.SPAM)
class.df$Pr.HAM <- as.numeric(class.df$Pr.HAM)
class.df$Class <- as.logical(as.numeric(class.df$Class))
class.df$Type <- as.factor(class.df$Type)