Data Science Capstone-Quiz 1

最新推荐文章于 2024-08-14 09:20:45 发布

kidpea_lau

最新推荐文章于 2024-08-14 09:20:45 发布

阅读量544

点赞数

CC 4.0 BY-SA版权

分类专栏： R语言文章标签： R NLP

本文链接：https://blog.youkuaiyun.com/kidpea_lau/article/details/83476773

R语言专栏收录该内容

25 篇文章

订阅专栏

本文详细介绍了一项NLP练习中数据的下载、加载及清理过程。通过使用tm库和NLP包，演示了如何处理大型数据集，包括读取不同来源的文本数据，如twitter、blog和news。此外，文章还探讨了数据集中最长行的长度，并对比了单词“love”和“hate”的出现频率。

这个是有关“”数据的加载以及清理相关练习：

数据的下载可以详细看 : NLP练习数据

在练习的开展前，首先要下载数据以及加载相关的包。这份dataset包含的数据非常大，其中LOCALE是四个语言环境en_US，de_DE，ru_RU和fi_FI。下面练习只会用到英语数据库（English database），而且不一定需要加载整个数据集来构建算法。

library(tm)
## Loading required package: NLP

setwd("C:\\Coursera-SwiftKey\\final\\en_US")

关于这个数据更多的处理，可以参考：NLP实践-预测输入法

Q1:

The en_US.twitter.txt has how many lines of text?

twitter <- readLines(con <- file("./en_US.twitter.txt"), encoding = "UTF-8", skipNul = TRUE)
###每次用完con，记得要关闭！
close(con)
#Checking the length

length(twitter)
## [1] 2360148

Q2:

What is the length of the longest line seen in any of the three en_US data sets?

#载入blog的数据：
fileName="en_US.blogs.txt"
con=file(fileName,open="r")

#设置lineBlogs 和longBlogs，并往内载入相关数据：
lineBlogs=readLines(con) 
longBlogs=length(line)
#关闭使用过了的con
close(con)

#载入New的数据：
fileName="en_US.news.txt"
con=file(fileName,open="r")
#设置lineNews、longNews并往内载入相关数据：
lineNews=readLines(con) 

## Warning in readLines(con): incomplete final line found on 'en_US.news.txt'
longNews=length(line)
#关闭使用过了的con
close(con)

fileName="en_US.twitter.txt"
con=file(fileName,open="r")
lineTwitter=readLines(con) 

## Warning in readLines(con): line 167155 appears to contain an embedded nul
## Warning in readLines(con): line 268547 appears to contain an embedded nul
## Warning in readLines(con): line 1274086 appears to contain an embedded nul
## Warning in readLines(con): line 1759032 appears to contain an embedded nul


longTwitter=length(line)
close(con)

#Need the longest line in each array.


longBlogs = nchar(longBlogs)
max(nchar(longBlogs))

## [1] 1
#Apparently below is max of lineBlogs


require(stringi)

## Loading required package: stringi


longBlogs<-stri_length(lineBlogs)
max(longBlogs)
## [1] 40835
#Apparently below is max of lineNews


longNews<-stri_length(lineNews)
max(longNews)
## [1] 5760
#Apparently below is max of lineTwitter


longTwitter<-stri_length(lineTwitter)
max(longTwitter)
## [1] 213

Q3:

In the en_US twitter data set, if you divide the number of lines where the word “love” (all lowercase) occurs by the number of lines the word “hate” (all lowercase) occurs, about what do you get?

#Word "love"
loveTwitter<-grep("love",lineTwitter)
length(loveTwitter)
## [1] 90956
#Word "hate"
hateTwitter<-grep("hate",lineTwitter)
length(hateTwitter)
## [1] 22138
#Divide love by hate
90956/22138
## [1] 4.108592