数据仓库数据清理
Removing duplicate values
删除重复值
Removing null values
删除空值
Changing column names to readable, understandable, formatted names
将列名称更改为可读,可理解的格式化名称
Removing commas from numeric values i.e. (1,000,657 to 1000657)
从数值中删除逗号,例如(1,000,657至1000657)
Converting data types into their appropriate types for analysis
将数据类型转换为适当的类型以进行分析
The Experiment:
本实验:
The experiment conducted here is retrieved from UCI Machine Learning Repository where a group of 30 volunteers (age bracket of 19–48 years) performed six activities (WALKING, WALKING_UPSTAIRS, WALKING_DOWNSTAIRS, SITTING, STANDING, LAYING) wearing a Samsung Galaxy S smartphone. The data collected from the embedded accelerometers was divided into testing and trained data.
此处进行的实验是从UCI机器学习存储库中检索出来的,其中30名志愿者(年龄在19-48岁之间)组成的小组戴着三星Galaxy S智能手机进行了六项活动(行走,行走,下卧,坐着,站立,躺着)。 从嵌入式加速度计收集的数据分为测试数据和训练数据。
步骤1:从URL检索数据 (Step 1: Retrieving Data from URL)
The first step required is to obtain the data. Often, to avoid the headache of manually downloading thousands of files, they are downloaded using small code snippets. Since this was a zipped folder .
所需的第一步是获取数据。 通常,为了避免手动下载数千个文件的麻烦,可以使用小的代码段下载它们。 由于这是一个压缩文件夹。
Data Reference :
资料参考:
http://archive.ics.uci.edu/ml/datasets/Human+Activity+Recognition+Using+Smartphones
http://archive.ics.uci.edu/ml/datasets/人类+活动+识别+使用+智能手机
download.file(“https://d396qusza40orc.cloudfront.net/getdata%2Fprojectfiles%2FUCI%20HAR%20Dataset.zip", destfile = “files”, method = “curl”, mode = “wb”)
download.file(“ https://d396qusza40orc.cloudfront.net/getdata%2Fprojectfiles%2FUCI%20HAR%20Dataset.zip“,destfile =” files“,method =” curl“,mode =” wb“)
unzip(“files”)
解压缩(“文件”)
步骤2:将文件读入R (Step 2: Reading the files into R)
features <- read.table("...\\UCI\\features.txt", col.names = c("serial", "Functions"))
features
activities <- read.table("...\\UCI\\activity_labels.txt", col.names = c("serial", "Activity"))
activities
x_test <- read.table("...\\test\\X_test.txt", col.names = features$Functions)
x_test
y_test <- read.table("...\\test\\y_test.txt", col.names = "serial")
y_test
subject_test <- read.table("...\\test\\subject_test.txt", col.names = "subject")
subject_test
subject_train <- read.table("...UCI\\train\\subject_train.txt", col.names = "subject")
subject_train
x_train <- read.table("...\\UCI\\train\\X_train.txt", col.names = features$Functions)
x_train
y_train <- read.table("...\\UCI\\train\\y_train.txt", col.names = "serial")
y_train
注意:乍一看可能很难理解数据的含义和使用的列名,但过一会儿您便会明白。 (Note: It might be difficult to understand at first what the data means and what column names to use, but after a while you’ll start making sense.)
This clearly implies two things:
这显然意味着两件事:
- I had to merge the training and test sets by row binding them 我必须通过绑定绑定训练和测试集来合并它们
- I had to merge the different attributes of the subjects by column binding them. 我必须通过列绑定主题来合并主题的不同属性。
步骤3:智能合并表 (Step 3: Merging the tables intelligently)
binded_x <- rbind(x_test, x_train)
binded_y <- rbind(y_test, y_train)
subject <- rbind(subject_test, subject_train)
#Next, I used the cbind() function to complete attaching the columns as well.
raw_data_combined <- cbind(subject, binded_x, binded_yraw_data_combine
步骤4:仅过滤均值和标准列 (Step 4: Filtering out only the mean and std columns)
One thing to understand is the data is humongous, and we might need to perform certain filtering operations to extract the attributes we need. I had to filter out only those columns that mentioned ‘mean’ or ‘std’ in them. I used the select() function here which tidies up your code 10x better.
要了解的一件事是数据庞大,我们可能需要执行某些过滤操作以提取所需的属性。 我只需要过滤掉那些在其中提到“均值”或“标准”的列。 我在这里使用了select()函数,可以使您的代码整理得更好十倍。
Note: Download the package “dplyr” and then load library to use its functions like select(), arrange(), mutate(), filter(), summarise().
注意:下载软件包“ dplyr”,然后加载库以使用其功能,例如select(),range(),mutate(),filter(),summarise()。
#install.packages("dplyr")library(dplyr)analysis <- raw_data_combined %>% select(serial , subject ,contains("mean") , contains("std"))analysis
#install.packages("dplyr")library(dplyr)analysis <- raw_data_combined %>% select(serial , subject ,contains("mean") , contains("std"))analysis
步骤5:将活动标签从数字代码更改为描述性值 (Step 5: Changing the activity labels from numeric codes to descriptive values)
activities
activities
“activity_labels.txt” has 1–6 numbers assigned to the six activities and these codes were being used instead of the activity names. For better readability, I changed them into descriptive values using the following commands:
“ activity_labels.txt”具有分配给六个活动的1-6数字,并且使用这些代码代替活动名称。 为了提高可读性,我使用以下命令将它们更改为描述性值:
analysis$serial[analysis$serial == "1"] <- "WALKING"analysis$serial[analysis$serial == "2"] <- "WALKING_UPSTAIRS"analysis$serial[analysis$serial == "3"] <- "WALKING_DOWNSTAIRS"analysis$serial[analysis$serial == "4"] <- "SITTING"analysis$serial[analysis$serial == "5"] <- "STANDING"analysis$serial[analysis$serial == "5"] <- "LAYING"
analysis$serial[analysis$serial == "1"] <- "WALKING"analysis$serial[analysis$serial == "2"] <- "WALKING_UPSTAIRS"analysis$serial[analysis$serial == "3"] <- "WALKING_DOWNSTAIRS"analysis$serial[analysis$serial == "4"] <- "SITTING"analysis$serial[analysis$serial == "5"] <- "STANDING"analysis$serial[analysis$serial == "5"] <- "LAYING"
analysis
步骤6:更改列名称以增强可读性 (Step 6: Changing columns names to enhance readability)
- names() will give you only the column names of the dataset you’ve provided to it. names()将仅为您提供您提供给它的数据集的列名。
- gsub() will replace an old string with the new string you pass to it. gsub()会将旧字符串替换为您传递给它的新字符串。
names(analysis)<- gsub("Acc", "Accelerometer", names(analysis))names(analysis)<- gsub("tBody", "time", names(analysis))names(analysis)<- gsub("fBody", "frequency", names(analysis))names(analysis)<- gsub("Gyro", "Gyroscope", names(analysis))names(analysis)<-gsub("BodyBody", "Body", names(analysis))names(analysis)<-gsub("Mag", "Magnitude", names(analysis))names(analysis)<-gsub("serial", "Activity", names(analysis))
names(analysis)<- gsub("Acc", "Accelerometer", names(analysis))names(analysis)<- gsub("tBody", "time", names(analysis))names(analysis)<- gsub("fBody", "frequency", names(analysis))names(analysis)<- gsub("Gyro", "Gyroscope", names(analysis))names(analysis)<-gsub("BodyBody", "Body", names(analysis))names(analysis)<-gsub("Mag", "Magnitude", names(analysis))names(analysis)<-gsub("serial", "Activity", names(analysis))
names(analysis)
names(analysis)
步骤7:创建独立的整洁数据集,其中包含每个活动和每个主题的每个变量的平均值 (Step 7: Creating an independent tidy data set with the average of each variable for each activity and each subject)
To avoid the confusion, this simply means we need to take the mean of each feature in the ‘analysis’ dataset and represent them both by activity(s) and the subject(s).
为避免混淆,这仅表示我们需要对“分析”数据集中的每个特征取均值,并通过活动和主题来表示它们。
tidy_data <- analysis %>% group_by(subject, Activity) %>% summarise_all(list(mean))
tidy_data
The group_by() function categorizes your data according to the columns you feed into it, and summarise_all() function performs any function you feed into it (in this case mean())
group_by()函数根据您输入的数据列对数据进行分类,summarise_all()函数执行您输入的数据中的任何功能(在这种情况下,mean())
This concludes this project, as the raw data has been transformed into a tidy data set that can be used to analysis later.
该项目到此结束,因为原始数据已转换为可用于以后分析的整洁数据集。
Bonus Information :
奖金信息:

Check Out My YouTube Video :(Introduction of AWS EC2 instance)
签出我的YouTube视频:(AWS EC2实例简介)
More such simplified Data Science concepts will follow. If you liked this or have some feedback or follow-up questions please comment below.
随后将有更多此类简化的数据科学概念。 如果您喜欢这个或有任何反馈或后续问题,请在下面评论。
Thanks for Reading!
谢谢阅读!
翻译自: https://medium.com/analytics-vidhya/data-cleaning-in-r-for-data-science-dbfd75819bd6
数据仓库数据清理