这个是有关“”数据的加载以及清理相关练习:
数据的下载可以详细看 : NLP练习数据
在练习的开展前,首先要下载数据以及加载相关的包。这份dataset包含的数据非常大,其中LOCALE是四个语言环境en_US,de_DE,ru_RU和fi_FI。下面练习只会用到英语数据库(English database),而且不一定需要加载整个数据集来构建算法 。
library(tm)
## Loading required package: NLP
setwd("C:\\Coursera-SwiftKey\\final\\en_US")
关于这个数据更多的处理,可以参考:NLP实践-预测输入法
Q1:
The en_US.twitter.txt has how many lines of text?
twitter <- readLines(con <- file("./en_US.twitter.txt"), encoding = "UTF-8", skipNul = TRUE)
###每次用完con,记得要关闭!
close(con)
#Checking the length
length(twitter)
## [1] 2360148
Q2:
What is the length of the longest line seen in any of the three en_US data sets?
#载入blog的数据:
fileName="en_US.blogs.txt"
con=file(fileName,open="r")
#设置lineBlogs 和longBlogs,并往内载入相关数据:
lineBlogs=readLines(con)
longBlogs=length(line)
#关闭使用过了的con
close(con)
#载入New的数据:
fileName="en_US.news.txt"
con=file(fileName,open="r")
#设置lineNews、longNews并往内载入相关数据:
lineNews=readLines(con)
## Warning in readLines(con): incomplete final line found on 'en_US.news.txt'
longNews=length(line)
#关闭使用过了的con
close(con)
fileName="en_US.twitter.txt"
con=file(fileName,open="r")
lineTwitter=readLines(con)
## Warning in readLines(con): line 167155 appears to contain an embedded nul
## Warning in readLines(con): line 268547 appears to contain an embedded nul
## Warning in readLines(con): line 1274086 appears to contain an embedded nul
## Warning in readLines(con): line 1759032 appears to contain an embedded nul
longTwitter=length(line)
close(con)
#Need the longest line in each array.
longBlogs = nchar(longBlogs)
max(nchar(longBlogs))
## [1] 1
#Apparently below is max of lineBlogs
require(stringi)
## Loading required package: stringi
longBlogs<-stri_length(lineBlogs)
max(longBlogs)
## [1] 40835
#Apparently below is max of lineNews
longNews<-stri_length(lineNews)
max(longNews)
## [1] 5760
#Apparently below is max of lineTwitter
longTwitter<-stri_length(lineTwitter)
max(longTwitter)
## [1] 213
Q3:
In the en_US twitter data set, if you divide the number of lines where the word “love” (all lowercase) occurs by the number of lines the word “hate” (all lowercase) occurs, about what do you get?
#Word "love"
loveTwitter<-grep("love",lineTwitter)
length(loveTwitter)
## [1] 90956
#Word "hate"
hateTwitter<-grep("hate",lineTwitter)
length(hateTwitter)
## [1] 22138
#Divide love by hate
90956/22138
## [1] 4.108592