Weather Data Analysis Example:Part 2

本文介绍了一个天气数据集的预处理过程,包括变量类型转换、数据检查等步骤,并使用R语言进行实际操作演示。
Part 2: Data Preparation

In Part 1 I have introduced the weather data set we will be using in this series of tutorials. We are now going to have the data prepared for the subsequent EDA. We will recode and transform variables, change their types, and perform some basic data checks. Feel free to follow along with the analysis ( click here to download the weather data), bearing in mind you can type ?function_name to get help about some specific R function, for instance, ?head

Importing the data


To start off, let's read in the data to an R data frame and run some basic commands:

# Make sure the file is in your current working directory
 
> weather <- read.csv("weather_2014.csv",sep=";",stringsAsFactors=FALSE)
 
> dim(weather)
[1] 365  14

> names(weather)
 [1] "day.count"      "day"            "month"          "season"        
 [5] "l.temp"         "h.temp"         "ave.temp"       "l.temp.time"   
 [9] "h.temp.time"    "rain"           "ave.wind"       "gust.wind"     
[13] "gust.wind.time" "dir.wind"
 
> head(weather)
  day.count day month season l.temp h.temp ave.temp l.temp.time h.temp.time rain
1         1   1     1 Winter   12.7   14.0     13.4       01:25       23:50 32.0
2         2   2     1 Winter   11.3   14.7     13.5       07:30       11:15 64.8
3         3   3     1 Winter   12.6   14.7     13.6       21:00       14:00 12.7
4         4   4     1 Winter    7.7   13.9     11.3       10:35       01:50 20.1
5         5   5     1 Winter    8.8   14.6     13.0       01:40       12:55  9.4
6         6   6     1 Winter   11.8   14.4     13.1       19:35       00:05 38.9
  ave.wind gust.wind gust.wind.time dir.wind
1     11.4      53.1          15:45        S
2      5.6      41.8          22:25        S
3      4.3      38.6          00:00      SSW
4     10.3      66.0          09:05       SW
5     11.6      51.5          13:50      SSE
6      9.9      57.9          08:10      SSE 
 
 
It seems we have correctly loaded the data into R. Notice that the variables in the data file are separated not by a comma, but by a semicolon, and hence the need to set the sep = ";" argument. We also told R not to import strings as factors (i.e., categorical variables). In many cases some character variables are indeed strings and some integer variables are actually factors. This means there is almost always manual work to be done after importing, and therefore I prefer to read in the variables without any initial processing.

Let's now have a look at the structure of the weather data frame, using one of the most useful R functions, str().

> str(weather)
'data.frame': 365 obs. of  14 variables:
 $ day.count     : int  1 2 3 4 5 6 7 8 9 10 ...
 $ day           : int  1 2 3 4 5 6 7 8 9 10 ...
 $ month         : int  1 1 1 1 1 1 1 1 1 1 ...
 $ season        : chr  "Winter" "Winter" "Winter" "Winter" ...
 $ l.temp        : num  12.7 11.3 12.6 7.7 8.8 11.8 11.4 12.4 9.2 8.3 ...
 $ h.temp        : num  14 14.7 14.7 13.9 14.6 14.4 14.8 15.6 18.4 14.8 ...
 $ ave.temp      : num  13.4 13.5 13.6 11.3 13 13.1 13.5 14.1 12.9 11 ...
 $ l.temp.time   : chr  "01:25" "07:30" "21:00" "10:35" ...
 $ h.temp.time   : chr  "23:50" "11:15" "14:00" "01:50" ...
 $ rain          : num  32 64.8 12.7 20.1 9.4 38.9 2 1.5 0 0 ...
 $ ave.wind      : num  11.4 5.6 4.3 10.3 11.6 9.9 6.6 5.9 0.2 1.4 ...
 $ gust.wind     : num  53.1 41.8 38.6 66 51.5 57.9 38.6 33.8 16.1 24.1 ...
 $ gust.wind.time: chr  "15:45" "22:25" "00:00" "09:05" ...
 $ dir.wind      : chr  "S" "S" "SSW" "SW" ...

Based on the output of this function, and having in the mind the goal of producing visualisations and potentially build models, how would you change the variables in the data set? Here are my thoughts:
  • Day and month are coded as integers but they should be factors (categorical variables); the same applies to the character variables representing the season and wind direction;
  • Looking at the first values for the wind direction - "S" "S" "SSW" "SW" - it seems that a 16-wind compass rose has been used. Since there are only 365 days in the year, do we have sufficient observations for each of the 16 groups, or could we try to group them into 8 principal winds or even only the 4 cardinal directions?
  • The day count (number of days since the beginning of the year) is useful, but we would like to have dates on the x axis when plotting instead of an index. This variable should therefore be transformed;
  • The three time variables show the exact minute where the corresponding event occurred. We would most likely benefit by doing some aggregation, and rounding to the nearest hour seems a good option. After that we would convert the hour variable to factor.
It is also quite common to check for missing values (coded as NA by default). As with almost everything, there are many ways to do it in R. Here are just two of them:

# One way (sum of NA over the entire data set)
> sum(is.na(weather))
[1] 0

# Another way (number of complete observations)
> nrow(weather)
[1] 365 
 
> sum(complete.cases(weather))
[1] 365 
 
> nrow(weather) == sum(complete.cases(weather))
[1] TRUE
 

 

Create factors

 

Variables can be easily coerced to factors using the as.factor() function, which is an abbreviated form of the main function, factor(). The first one, however, will order the levels by the alphabet when the original variable is a string (class character in R), which might be something we don't want for some variables, for example, the season of the year. The code below shows how factors are created and why they differ from the other types of variables.

> # Before (365 independent strings)
> class(weather$season)
[1] "character"

> summary(weather$season)
   Length     Class      Mode 
      365 character character 

> weather$season <- factor(weather$season,
                    levels = c("Spring","Summer","Autumn","Winter"))

> # After (4 categories, ordered by "levels")
> class(weather$season)
[1] "factor"

> summary(weather$season)
Spring Summer Autumn Winter 
    92     92     91     90
 
> # Using as.factor() when the order doesn't matter or original var. is integer

> weather$day <- as.factor(weather$day)
> weather$month <- as.factor(weather$month)
> weather$dir.wind <- as.factor(weather$dir.wind)
 

 

Dealing with the wind


Let's start by checking whether there are actually 16 directions in the dir.windvariable and, if so, determine whether we have sufficient number of observations in each group.

> # Number of unique values
> length(unique(weather$dir.wind))
[1] 16 
 
> # Absolute frequency (table function)
> table(weather$dir.wind)

  E ENE ESE   N  NE NNE NNW  NW   S  SE SSE SSW  SW   W WNW WSW 
 11  15   2  18  25   8  37 108  26  24  31  17  11   5  24   3 
 
> # Making it relative (prop.table function)
> rel <- round(prop.table(table(weather$dir.wind))*100,1)
> rel

   E  ENE  ESE    N   NE  NNE  NNW   NW    S   SE  SSE  SSW   SW    W  WNW  WSW 
 3.0  4.1  0.5  4.9  6.8  2.2 10.1 29.6  7.1  6.6  8.5  4.7  3.0  1.4  6.6  0.8 
 
> # Bringing some order to the table
> sort(rel,decreasing = TRUE)

  NW  NNW  SSE    S   NE   SE  WNW    N  SSW  ENE    E   SW  NNE    W  WSW  ESE 
29.6 10.1  8.5  7.1  6.8  6.6  6.6  4.9  4.7  4.1  3.0  3.0  2.2  1.4  0.8  0.5
 
It can be seen that the relative frequency is less than 5% for more than half of the groups. Unfortunately, we don't have the actual wind direction in degrees. But, making use of some domain knowledge and the information in the table above, let's try to give a reasonable answer to the following question: Is it more likely for a value in the NNW group to be closer to NW to N?
(assuming the direction isn't exactly the midpoint between NW and N, in which case it would be a pure NNW). The numbers show that NW would be far more likely. Even though this logic doesn't necessarily apply to all of the directions we are trying to eliminate, let's assume this criterion of recode them as the closest ordinal direction (by definition, "NW","NE","SE", "SW" are called ordinal) in a new variable.

As long as the analyst knows to explain why and how some new variable was created, and bears in mind it may lack accuracy, it is perfectly fine to add it to the data set. It may be useful or useless, and that is something he will try to figure out during the stages of visualisation or modelling.

> # Transforming wind direction variable: from 16 to 8 principal winds 
 
> # Create a copy from the original variable...
> weather$dir.wind.8 <- weather$dir.wind 
 
> # ...and then simply recode some of the variables
> weather$dir.wind.8 <- ifelse(weather$dir.wind %in%  c("NNE","ENE"),
                               "NE",as.character(weather$dir.wind.8)) 
 
> weather$dir.wind.8 <- ifelse(weather$dir.wind %in% c("NNW","WNW"),
                               "NW",as.character(weather$dir.wind.8)) 
 
> weather$dir.wind.8 <- ifelse(weather$dir.wind %in% c("WSW","SSW"),
                               "SW",as.character(weather$dir.wind.8)) 
 
> weather$dir.wind.8 <- ifelse(weather$dir.wind %in% c("ESE","SSE"),
                               "SE",as.character(weather$dir.wind.8)) 
 
> # create factors, ordered by "levels" 
> weather$dir.wind.8 <- factor(weather$dir.wind.8,
                        levels = c("N","NE","E","SE","S","SW","W","NW"))
 
 
> # Checking the length of the new variable 
> length(unique(weather$dir.wind.8))
[1] 8 
 
> # A 2-way table (direction vs season), with relative frequencies calculated
    over margin = 2 (i.e., the columns)  
 
> round(prop.table(table(weather$dir.wind.8,weather$season),margin = 2)*100,1)
    
     Spring Summer Autumn Winter
  N     1.1    3.3   12.1    3.3
  NE   14.1    5.4   20.9   12.2
  E     0.0    0.0    5.5    6.7
  SE   13.0   14.1   20.9   14.4
  S     5.4   12.0    4.4    6.7
  SW    6.5    8.7    2.2   16.7
  W     2.2    0.0    1.1    2.2
  NW   57.6   56.5   33.0   37.8 


The function ifelse() is one of the classical ways to populate a new variable based on the value, or calculation over the value, of any other(s). When the original variable is numeric, cut() is often simpler and used instead.

Just as a side note, it is uncommon and considered bad practice to use for loops and if statements in R when the goal is to loop through the rows of a column and apply some function when a certain condition is met. R supports vectorisation, which is a much more efficient way to accomplish the same thing. In fact, the ifelse() function is the vectorised way that makes the for-if-else construct unnecessary.

 

 

We need a date


To create a date in R, we just need to pass to the function as.Date() a string with an appropriate format. We can then add and subtract days using the usual math operators. Here is the code to calculate the date based on the day.count variable.

> first.day <- "2014-01-01"
> class(first.day)
[1] "character" 
 
> first.day <- as.Date(first.day)
> class(first.day)
[1] "Date" 
 
> Here is where we actually calculate the date 
> weather$date  <- first.day + weather$day.count - 1 
 
> head(weather$day.count)
[1] 1 2 3 4 5 6 
 
> head(weather$date)
[1] "2014-01-01" "2014-01-02" "2014-01-03" "2014-01-04" "2014-01-05" "2014-01-06"

 

 

Not so hard times


The last thing we need to do is to round (to the nearest hour) the time at which a certain event occurred  (lower temperature, higher temperature, and wind gust). Working with times in R is a bit more complicated than working with dates, with the former having two alternative classes to represent it: POSIXct and POSIXlt. The first stores the date and time as a simple number, representing the seconds since the UNIX epoch (Jan 1, 1970); the second stores the date and time in a list, with elements for seconds, hours, years, among others. Since we are interested in extracting the hour information, after rounding, we will use the more complete POSIXlt class.

> # Store date and time as POSIXlt class
> l.temp.time.date <- as.POSIXlt(paste(weather$date,weather$l.temp.time))
> head(l.temp.time.date)
[1] "2014-01-01 01:25:00 GMT" "2014-01-02 07:30:00 GMT" "2014-01-03 21:00:00 GMT"
[4] "2014-01-04 10:35:00 GMT" "2014-01-05 01:40:00 GMT" "2014-01-06 19:35:00 GMT" 
 
> # Round to the nearest hour
> l.temp.time.date <- round(l.temp.time.date,"hours")
> head(l.temp.time.date)
[1] "2014-01-01 01:00:00 GMT" "2014-01-02 08:00:00 GMT" "2014-01-03 21:00:00 GMT"
[4] "2014-01-04 11:00:00 GMT" "2014-01-05 02:00:00 GMT" "2014-01-06 20:00:00 GMT" 
 
> # Which attributes are stored in the POSIXlt time variable?
> attributes(l.temp.time.date)
$names
 [1] "sec"    "min"    "hour"   "mday"   "mon"    "year"   "wday"   "yday"   "isdst" 
[10] "zone"   "gmtoff"

$class
[1] "POSIXlt" "POSIXt" 

$tzone
[1] ""    "GMT" "BST"

> # Extract the value of the hour attribute as a number and add it to the data set
> weather$l.temp.hour <- l.temp.time.date [["hour"]]
> head(weather$l.temp.hour)
[1]  1  8 21 11  2 20 
 
> # Lastly, the integer is converted to factor
> weather$l.temp.hour <- as.factor(weather$l.temp.hour)
> head(weather$l.temp.hour)
[1] 1  8  21 11 2  20
Levels: 0 1 2 3 4 5 6 7 8 9 10 11 12 17 18 19 20 21 22 23

 

 

The prepared data set


After all the processing we have done, let's call str() again to see what our final data set looks like. We now have a date variable (to plot a time series) and several factor variables (commonly used to identify different groups of a numerical variable when creating a data visualisation).

> str(weather)
'data.frame': 365 obs. of  19 variables:
 $ day.count     : int  1 2 3 4 5 6 7 8 9 10 ...
 $ day           : Factor w/ 31 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
 $ month         : Factor w/ 12 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ season        : Factor w/ 4 levels "Spring","Summer",..: 4 4 4 4 4 4 4 4 4 4 ...
 $ l.temp        : num  12.7 11.3 12.6 7.7 8.8 11.8 11.4 12.4 9.2 8.3 ...
 $ h.temp        : num  14 14.7 14.7 13.9 14.6 14.4 14.8 15.6 18.4 14.8 ...
 $ ave.temp      : num  13.4 13.5 13.6 11.3 13 13.1 13.5 14.1 12.9 11 ...
 $ l.temp.time   : chr  "01:25" "07:30" "21:00" "10:35" ...
 $ h.temp.time   : chr  "23:50" "11:15" "14:00" "01:50" ...
 $ rain          : num  32 64.8 12.7 20.1 9.4 38.9 2 1.5 0 0 ...
 $ ave.wind      : num  11.4 5.6 4.3 10.3 11.6 9.9 6.6 5.9 0.2 1.4 ...
 $ gust.wind     : num  53.1 41.8 38.6 66 51.5 57.9 38.6 33.8 16.1 24.1 ...
 $ gust.wind.time: chr  "15:45" "22:25" "00:00" "09:05" ...
 $ dir.wind      : Factor w/ 16 levels "E","ENE","ESE",..: 9 9 12 13 11 11 10 10 4 7 ...
 $ dir.wind.8    : Factor w/ 8 levels "N","NE","E","SE",..: 5 5 6 6 4 4 4 4 1 8 ...
 $ date          : Date, format: "2014-01-01" "2014-01-02" ...
 $ l.temp.hour   : Factor w/ 20 levels "0","1","2","3",..: 2 9 18 12 3 17 8 1 8 9 ...
 $ h.temp.hour   : Factor w/ 19 levels "0","1","2","3",..: 1 8 11 3 10 1 12 10 11 9 ...
 $ gust.wind.hour: Factor w/ 24 levels "0","1","2","3",..: 17 23 1 10 15 9 13 15 15 15 ...

  

 

Final notes on data "wrangling" and R

As we have seen in Part 1, the process of cleaning and transforming the raw data is almost always required prior to starting the actual analysis. In fact, in many real life cases, the visualisation and modeling stages are easier and less time-consuming than having the data ready to be explored.

The R language is extremely powerful to create visualisations and build models. It is often considered, however, that the language is not the most user-friendly when it comes to prepare the data (still true, but not as much as a few years ago). Here are some of the alternatives we, the analysts, have at our disposal:
  • For reasonably simple and tidy data sets, like the one we have been using in this tutorial, Excel is usually sufficient; a combination of Pivot Tables,Vlookup() and/or Index()/Match() and a few basic formatting functions would have accomplished the same we have done here;
  • When some more advanced processing is required (for instance, the use of regex), and even though R supports regular expressions and provides functions in its base package, some prefer to use other languages (in this case, Perl or Python would be good options);
  • An ever increasing alternative option is to use other R packages that provide wrapper functions for the ones in its base. For instance, we would probably have dealt with our three time variables easily using thelubridate package than using the base functions; had we needed to format strings and thestringr package would provide us with several consistent wrappers, making it simpler to process the text.
Onward and upward to Part 3 of this series of tutorials, where we will be creating a few  visualisations to gain insight from our weather data.        
一、数据采集层:多源人脸数据获取 该层负责从不同设备 / 渠道采集人脸原始数据,为后续模型训练与识别提供基础样本,核心功能包括: 1. 多设备适配采集 实时摄像头采集: 调用计算机内置摄像头(或外接 USB 摄像头),通过OpenCV的VideoCapture接口实时捕获视频流,支持手动触发 “拍照”(按指定快捷键如Space)或自动定时采集(如每 2 秒采集 1 张),采集时自动框选人脸区域(通过Haar级联分类器初步定位),确保样本聚焦人脸。 支持采集参数配置:可设置采集分辨率(如 640×480、1280×720)、图像格式(JPG/PNG)、单用户采集数量(如默认采集 20 张,确保样本多样性),采集过程中实时显示 “已采集数量 / 目标数量”,避免样本不足。 本地图像 / 视频导入: 支持批量导入本地人脸图像文件(支持 JPG、PNG、BMP 格式),自动过滤非图像文件;导入视频文件(MP4、AVI 格式)时,可按 “固定帧间隔”(如每 10 帧提取 1 张图像)或 “手动选择帧” 提取人脸样本,适用于无实时摄像头场景。 数据集对接: 支持接入公开人脸数据集(如 LFW、ORL),通过预设脚本自动读取数据集目录结构(按 “用户 ID - 样本图像” 分类),快速构建训练样本库,无需手动采集,降低系统开发与测试成本。 2. 采集过程辅助功能 人脸有效性校验:采集时通过OpenCV的Haar级联分类器(或MTCNN轻量级模型)实时检测图像中是否包含人脸,若未检测到人脸(如遮挡、侧脸角度过大),则弹窗提示 “未识别到人脸,请调整姿态”,避免无效样本存入。 样本标签管理:采集时需为每个样本绑定 “用户标签”(如姓名、ID 号),支持手动输入标签或从 Excel 名单批量导入标签(按 “标签 - 采集数量” 对应),采集完成后自动按 “标签 - 序号” 命名文件(如 “张三
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值