Weather Data Analysis Example:Part 2

本文介绍了一个天气数据集的预处理过程,包括变量类型转换、数据检查等步骤,并使用R语言进行实际操作演示。
Part 2: Data Preparation

In Part 1 I have introduced the weather data set we will be using in this series of tutorials. We are now going to have the data prepared for the subsequent EDA. We will recode and transform variables, change their types, and perform some basic data checks. Feel free to follow along with the analysis ( click here to download the weather data), bearing in mind you can type ?function_name to get help about some specific R function, for instance, ?head

Importing the data


To start off, let's read in the data to an R data frame and run some basic commands:

# Make sure the file is in your current working directory
 
> weather <- read.csv("weather_2014.csv",sep=";",stringsAsFactors=FALSE)
 
> dim(weather)
[1] 365  14

> names(weather)
 [1] "day.count"      "day"            "month"          "season"        
 [5] "l.temp"         "h.temp"         "ave.temp"       "l.temp.time"   
 [9] "h.temp.time"    "rain"           "ave.wind"       "gust.wind"     
[13] "gust.wind.time" "dir.wind"
 
> head(weather)
  day.count day month season l.temp h.temp ave.temp l.temp.time h.temp.time rain
1         1   1     1 Winter   12.7   14.0     13.4       01:25       23:50 32.0
2         2   2     1 Winter   11.3   14.7     13.5       07:30       11:15 64.8
3         3   3     1 Winter   12.6   14.7     13.6       21:00       14:00 12.7
4         4   4     1 Winter    7.7   13.9     11.3       10:35       01:50 20.1
5         5   5     1 Winter    8.8   14.6     13.0       01:40       12:55  9.4
6         6   6     1 Winter   11.8   14.4     13.1       19:35       00:05 38.9
  ave.wind gust.wind gust.wind.time dir.wind
1     11.4      53.1          15:45        S
2      5.6      41.8          22:25        S
3      4.3      38.6          00:00      SSW
4     10.3      66.0          09:05       SW
5     11.6      51.5          13:50      SSE
6      9.9      57.9          08:10      SSE 
 
 
It seems we have correctly loaded the data into R. Notice that the variables in the data file are separated not by a comma, but by a semicolon, and hence the need to set the sep = ";" argument. We also told R not to import strings as factors (i.e., categorical variables). In many cases some character variables are indeed strings and some integer variables are actually factors. This means there is almost always manual work to be done after importing, and therefore I prefer to read in the variables without any initial processing.

Let's now have a look at the structure of the weather data frame, using one of the most useful R functions, str().

> str(weather)
'data.frame': 365 obs. of  14 variables:
 $ day.count     : int  1 2 3 4 5 6 7 8 9 10 ...
 $ day           : int  1 2 3 4 5 6 7 8 9 10 ...
 $ month         : int  1 1 1 1 1 1 1 1 1 1 ...
 $ season        : chr  "Winter" "Winter" "Winter" "Winter" ...
 $ l.temp        : num  12.7 11.3 12.6 7.7 8.8 11.8 11.4 12.4 9.2 8.3 ...
 $ h.temp        : num  14 14.7 14.7 13.9 14.6 14.4 14.8 15.6 18.4 14.8 ...
 $ ave.temp      : num  13.4 13.5 13.6 11.3 13 13.1 13.5 14.1 12.9 11 ...
 $ l.temp.time   : chr  "01:25" "07:30" "21:00" "10:35" ...
 $ h.temp.time   : chr  "23:50" "11:15" "14:00" "01:50" ...
 $ rain          : num  32 64.8 12.7 20.1 9.4 38.9 2 1.5 0 0 ...
 $ ave.wind      : num  11.4 5.6 4.3 10.3 11.6 9.9 6.6 5.9 0.2 1.4 ...
 $ gust.wind     : num  53.1 41.8 38.6 66 51.5 57.9 38.6 33.8 16.1 24.1 ...
 $ gust.wind.time: chr  "15:45" "22:25" "00:00" "09:05" ...
 $ dir.wind      : chr  "S" "S" "SSW" "SW" ...

Based on the output of this function, and having in the mind the goal of producing visualisations and potentially build models, how would you change the variables in the data set? Here are my thoughts:
  • Day and month are coded as integers but they should be factors (categorical variables); the same applies to the character variables representing the season and wind direction;
  • Looking at the first values for the wind direction - "S" "S" "SSW" "SW" - it seems that a 16-wind compass rose has been used. Since there are only 365 days in the year, do we have sufficient observations for each of the 16 groups, or could we try to group them into 8 principal winds or even only the 4 cardinal directions?
  • The day count (number of days since the beginning of the year) is useful, but we would like to have dates on the x axis when plotting instead of an index. This variable should therefore be transformed;
  • The three time variables show the exact minute where the corresponding event occurred. We would most likely benefit by doing some aggregation, and rounding to the nearest hour seems a good option. After that we would convert the hour variable to factor.
It is also quite common to check for missing values (coded as NA by default). As with almost everything, there are many ways to do it in R. Here are just two of them:

# One way (sum of NA over the entire data set)
> sum(is.na(weather))
[1] 0

# Another way (number of complete observations)
> nrow(weather)
[1] 365 
 
> sum(complete.cases(weather))
[1] 365 
 
> nrow(weather) == sum(complete.cases(weather))
[1] TRUE
 

 

Create factors

 

Variables can be easily coerced to factors using the as.factor() function, which is an abbreviated form of the main function, factor(). The first one, however, will order the levels by the alphabet when the original variable is a string (class character in R), which might be something we don't want for some variables, for example, the season of the year. The code below shows how factors are created and why they differ from the other types of variables.

> # Before (365 independent strings)
> class(weather$season)
[1] "character"

> summary(weather$season)
   Length     Class      Mode 
      365 character character 

> weather$season <- factor(weather$season,
                    levels = c("Spring","Summer","Autumn","Winter"))

> # After (4 categories, ordered by "levels")
> class(weather$season)
[1] "factor"

> summary(weather$season)
Spring Summer Autumn Winter 
    92     92     91     90
 
> # Using as.factor() when the order doesn't matter or original var. is integer

> weather$day <- as.factor(weather$day)
> weather$month <- as.factor(weather$month)
> weather$dir.wind <- as.factor(weather$dir.wind)
 

 

Dealing with the wind


Let's start by checking whether there are actually 16 directions in the dir.windvariable and, if so, determine whether we have sufficient number of observations in each group.

> # Number of unique values
> length(unique(weather$dir.wind))
[1] 16 
 
> # Absolute frequency (table function)
> table(weather$dir.wind)

  E ENE ESE   N  NE NNE NNW  NW   S  SE SSE SSW  SW   W WNW WSW 
 11  15   2  18  25   8  37 108  26  24  31  17  11   5  24   3 
 
> # Making it relative (prop.table function)
> rel <- round(prop.table(table(weather$dir.wind))*100,1)
> rel

   E  ENE  ESE    N   NE  NNE  NNW   NW    S   SE  SSE  SSW   SW    W  WNW  WSW 
 3.0  4.1  0.5  4.9  6.8  2.2 10.1 29.6  7.1  6.6  8.5  4.7  3.0  1.4  6.6  0.8 
 
> # Bringing some order to the table
> sort(rel,decreasing = TRUE)

  NW  NNW  SSE    S   NE   SE  WNW    N  SSW  ENE    E   SW  NNE    W  WSW  ESE 
29.6 10.1  8.5  7.1  6.8  6.6  6.6  4.9  4.7  4.1  3.0  3.0  2.2  1.4  0.8  0.5
 
It can be seen that the relative frequency is less than 5% for more than half of the groups. Unfortunately, we don't have the actual wind direction in degrees. But, making use of some domain knowledge and the information in the table above, let's try to give a reasonable answer to the following question: Is it more likely for a value in the NNW group to be closer to NW to N?
(assuming the direction isn't exactly the midpoint between NW and N, in which case it would be a pure NNW). The numbers show that NW would be far more likely. Even though this logic doesn't necessarily apply to all of the directions we are trying to eliminate, let's assume this criterion of recode them as the closest ordinal direction (by definition, "NW","NE","SE", "SW" are called ordinal) in a new variable.

As long as the analyst knows to explain why and how some new variable was created, and bears in mind it may lack accuracy, it is perfectly fine to add it to the data set. It may be useful or useless, and that is something he will try to figure out during the stages of visualisation or modelling.

> # Transforming wind direction variable: from 16 to 8 principal winds 
 
> # Create a copy from the original variable...
> weather$dir.wind.8 <- weather$dir.wind 
 
> # ...and then simply recode some of the variables
> weather$dir.wind.8 <- ifelse(weather$dir.wind %in%  c("NNE","ENE"),
                               "NE",as.character(weather$dir.wind.8)) 
 
> weather$dir.wind.8 <- ifelse(weather$dir.wind %in% c("NNW","WNW"),
                               "NW",as.character(weather$dir.wind.8)) 
 
> weather$dir.wind.8 <- ifelse(weather$dir.wind %in% c("WSW","SSW"),
                               "SW",as.character(weather$dir.wind.8)) 
 
> weather$dir.wind.8 <- ifelse(weather$dir.wind %in% c("ESE","SSE"),
                               "SE",as.character(weather$dir.wind.8)) 
 
> # create factors, ordered by "levels" 
> weather$dir.wind.8 <- factor(weather$dir.wind.8,
                        levels = c("N","NE","E","SE","S","SW","W","NW"))
 
 
> # Checking the length of the new variable 
> length(unique(weather$dir.wind.8))
[1] 8 
 
> # A 2-way table (direction vs season), with relative frequencies calculated
    over margin = 2 (i.e., the columns)  
 
> round(prop.table(table(weather$dir.wind.8,weather$season),margin = 2)*100,1)
    
     Spring Summer Autumn Winter
  N     1.1    3.3   12.1    3.3
  NE   14.1    5.4   20.9   12.2
  E     0.0    0.0    5.5    6.7
  SE   13.0   14.1   20.9   14.4
  S     5.4   12.0    4.4    6.7
  SW    6.5    8.7    2.2   16.7
  W     2.2    0.0    1.1    2.2
  NW   57.6   56.5   33.0   37.8 


The function ifelse() is one of the classical ways to populate a new variable based on the value, or calculation over the value, of any other(s). When the original variable is numeric, cut() is often simpler and used instead.

Just as a side note, it is uncommon and considered bad practice to use for loops and if statements in R when the goal is to loop through the rows of a column and apply some function when a certain condition is met. R supports vectorisation, which is a much more efficient way to accomplish the same thing. In fact, the ifelse() function is the vectorised way that makes the for-if-else construct unnecessary.

 

 

We need a date


To create a date in R, we just need to pass to the function as.Date() a string with an appropriate format. We can then add and subtract days using the usual math operators. Here is the code to calculate the date based on the day.count variable.

> first.day <- "2014-01-01"
> class(first.day)
[1] "character" 
 
> first.day <- as.Date(first.day)
> class(first.day)
[1] "Date" 
 
> Here is where we actually calculate the date 
> weather$date  <- first.day + weather$day.count - 1 
 
> head(weather$day.count)
[1] 1 2 3 4 5 6 
 
> head(weather$date)
[1] "2014-01-01" "2014-01-02" "2014-01-03" "2014-01-04" "2014-01-05" "2014-01-06"

 

 

Not so hard times


The last thing we need to do is to round (to the nearest hour) the time at which a certain event occurred  (lower temperature, higher temperature, and wind gust). Working with times in R is a bit more complicated than working with dates, with the former having two alternative classes to represent it: POSIXct and POSIXlt. The first stores the date and time as a simple number, representing the seconds since the UNIX epoch (Jan 1, 1970); the second stores the date and time in a list, with elements for seconds, hours, years, among others. Since we are interested in extracting the hour information, after rounding, we will use the more complete POSIXlt class.

> # Store date and time as POSIXlt class
> l.temp.time.date <- as.POSIXlt(paste(weather$date,weather$l.temp.time))
> head(l.temp.time.date)
[1] "2014-01-01 01:25:00 GMT" "2014-01-02 07:30:00 GMT" "2014-01-03 21:00:00 GMT"
[4] "2014-01-04 10:35:00 GMT" "2014-01-05 01:40:00 GMT" "2014-01-06 19:35:00 GMT" 
 
> # Round to the nearest hour
> l.temp.time.date <- round(l.temp.time.date,"hours")
> head(l.temp.time.date)
[1] "2014-01-01 01:00:00 GMT" "2014-01-02 08:00:00 GMT" "2014-01-03 21:00:00 GMT"
[4] "2014-01-04 11:00:00 GMT" "2014-01-05 02:00:00 GMT" "2014-01-06 20:00:00 GMT" 
 
> # Which attributes are stored in the POSIXlt time variable?
> attributes(l.temp.time.date)
$names
 [1] "sec"    "min"    "hour"   "mday"   "mon"    "year"   "wday"   "yday"   "isdst" 
[10] "zone"   "gmtoff"

$class
[1] "POSIXlt" "POSIXt" 

$tzone
[1] ""    "GMT" "BST"

> # Extract the value of the hour attribute as a number and add it to the data set
> weather$l.temp.hour <- l.temp.time.date [["hour"]]
> head(weather$l.temp.hour)
[1]  1  8 21 11  2 20 
 
> # Lastly, the integer is converted to factor
> weather$l.temp.hour <- as.factor(weather$l.temp.hour)
> head(weather$l.temp.hour)
[1] 1  8  21 11 2  20
Levels: 0 1 2 3 4 5 6 7 8 9 10 11 12 17 18 19 20 21 22 23

 

 

The prepared data set


After all the processing we have done, let's call str() again to see what our final data set looks like. We now have a date variable (to plot a time series) and several factor variables (commonly used to identify different groups of a numerical variable when creating a data visualisation).

> str(weather)
'data.frame': 365 obs. of  19 variables:
 $ day.count     : int  1 2 3 4 5 6 7 8 9 10 ...
 $ day           : Factor w/ 31 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
 $ month         : Factor w/ 12 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ season        : Factor w/ 4 levels "Spring","Summer",..: 4 4 4 4 4 4 4 4 4 4 ...
 $ l.temp        : num  12.7 11.3 12.6 7.7 8.8 11.8 11.4 12.4 9.2 8.3 ...
 $ h.temp        : num  14 14.7 14.7 13.9 14.6 14.4 14.8 15.6 18.4 14.8 ...
 $ ave.temp      : num  13.4 13.5 13.6 11.3 13 13.1 13.5 14.1 12.9 11 ...
 $ l.temp.time   : chr  "01:25" "07:30" "21:00" "10:35" ...
 $ h.temp.time   : chr  "23:50" "11:15" "14:00" "01:50" ...
 $ rain          : num  32 64.8 12.7 20.1 9.4 38.9 2 1.5 0 0 ...
 $ ave.wind      : num  11.4 5.6 4.3 10.3 11.6 9.9 6.6 5.9 0.2 1.4 ...
 $ gust.wind     : num  53.1 41.8 38.6 66 51.5 57.9 38.6 33.8 16.1 24.1 ...
 $ gust.wind.time: chr  "15:45" "22:25" "00:00" "09:05" ...
 $ dir.wind      : Factor w/ 16 levels "E","ENE","ESE",..: 9 9 12 13 11 11 10 10 4 7 ...
 $ dir.wind.8    : Factor w/ 8 levels "N","NE","E","SE",..: 5 5 6 6 4 4 4 4 1 8 ...
 $ date          : Date, format: "2014-01-01" "2014-01-02" ...
 $ l.temp.hour   : Factor w/ 20 levels "0","1","2","3",..: 2 9 18 12 3 17 8 1 8 9 ...
 $ h.temp.hour   : Factor w/ 19 levels "0","1","2","3",..: 1 8 11 3 10 1 12 10 11 9 ...
 $ gust.wind.hour: Factor w/ 24 levels "0","1","2","3",..: 17 23 1 10 15 9 13 15 15 15 ...

  

 

Final notes on data "wrangling" and R

As we have seen in Part 1, the process of cleaning and transforming the raw data is almost always required prior to starting the actual analysis. In fact, in many real life cases, the visualisation and modeling stages are easier and less time-consuming than having the data ready to be explored.

The R language is extremely powerful to create visualisations and build models. It is often considered, however, that the language is not the most user-friendly when it comes to prepare the data (still true, but not as much as a few years ago). Here are some of the alternatives we, the analysts, have at our disposal:
  • For reasonably simple and tidy data sets, like the one we have been using in this tutorial, Excel is usually sufficient; a combination of Pivot Tables,Vlookup() and/or Index()/Match() and a few basic formatting functions would have accomplished the same we have done here;
  • When some more advanced processing is required (for instance, the use of regex), and even though R supports regular expressions and provides functions in its base package, some prefer to use other languages (in this case, Perl or Python would be good options);
  • An ever increasing alternative option is to use other R packages that provide wrapper functions for the ones in its base. For instance, we would probably have dealt with our three time variables easily using thelubridate package than using the base functions; had we needed to format strings and thestringr package would provide us with several consistent wrappers, making it simpler to process the text.
Onward and upward to Part 3 of this series of tutorials, where we will be creating a few  visualisations to gain insight from our weather data.        
内容概要:本文系统介绍了算术优化算法(AOA)的基本原理、核心思想及Python实现方法,并通过图像分割的实际案例展示了其应用价值。AOA是一种基于种群的元启发式算法,其核心思想来源于四则运算,利用乘除运算进行全局勘探,加减运算进行局部开发,通过数学优化器加速函数(MOA)和数学优化概率(MOP)动态控制搜索过程,在全局探索与局部开发之间实现平衡。文章详细解析了算法的初始化、勘探与开发阶段的更新策略,并提供了完整的Python代码实现,结合Rastrigin函数进行测试验证。进一步地,以Flask框架搭建前后端分离系统,将AOA应用于图像分割任务,展示了其在实际工程中的可行性与高效性。最后,通过收敛速度、寻优精度等指标评估算法性能,并提出自适应参数调整、模型优化和并行计算等改进策略。; 适合人群:具备一定Python编程基础和优化算法基础知识的高校学生、科研人员及工程技术人员,尤其适合从事人工智能、图像处理、智能优化等领域的从业者;; 使用场景及目标:①理解元启发式算法的设计思想与实现机制;②掌握AOA在函数优化、图像分割等实际问题中的建模与求解方法;③学习如何将优化算法集成到Web系统中实现工程化应用;④为算法性能评估与改进提供实践参考; 阅读建议:建议读者结合代码逐行调试,深入理解算法流程中MOA与MOP的作用机制,尝试在不同测试函数上运行算法以观察性能差异,并可进一步扩展图像分割模块,引入更复杂的预处理或后处理技术以提升分割效果。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值