Weather Data Analysis Example:Part 3b

本文通过使用ggplot2进行探索性数据分析(EDA),探讨了不同因素对于降雨量的影响。研究发现,风速与降雨量之间存在显著的相关性,并且这种相关性在不同季节中保持一致。
Part 3b: EDA with ggplot2

In Part 3a I have introduced the plotting system ggplot2. I talked about its concept and syntax with some detail, and then created a few general plots, using the weather data set we've been working with in this series of tutorials. My goal was to show that, in ggplot2, a nice looking and interesting graph can usually be created with just a few lines of code. Given the positive feedback that post has received, I believe I had some success with what I had in mind. Who knows if some of the readers will decide to give ggplot2 a try using their own data? It is often said that a picture is worth a thousand words, but I often see some highly experienced R users, after doing really complex data analyses, plotting the results in a way that falls short of the expectations, given the quality of their analytical work.

In Part 3b, we will continue to rely on visualisations to explore the data, but now with the goal of tackling the following question: Are there any good predictors, in our data, for the occurrence of rain on a given day? The numeric variable representing the amount of rain will be our dependent variable, also known as response or outcome variable, and all the remaining ones will be potential independent variables, also known as predictors or explanatory variables.

After framing the question, and before fitting any model, EDA techniques (such as visualisation) are used to gain insight from the data. Here are the steps I usually take at this stage:
  • Analyse the (continuous) dependent variable - at least, make an histogram of the distribution; when applicable, plot the variable against time to see the trend. Can the continuous variable be transformed to a binary one (i.e., did it rain or not on a particular day?), or to a multicategorical one (i.e., was the rain none, light, moderate, or heavy on a particular day?).
  • Search for correlations between the dependent variable and the continuous independent variables - are there any strong correlations? Are they linear or non-linear?
  • Do the same as above but try to control for other variables (faceting in ggplot2 is very useful  to do this), in order to assess for confounding and effect modification. Does the association between two continuous variables hold for different levels of a third variable, or is modified by them? (e.g., if there is a strong positive correlation between the rain amount and the wind gust maximum speed, does that hold regardless of the season of the year, or does it happen only in the winter?)
  • Search for associations between the dependent variable and the categorical independent variables (factors) - does the mean or median of the dependent variable change depending on the level of the factor? What about the outliers, are they evenly distributed across all levels, or seem to be present in only a few of them? 
So, let's now do some analyses, having the framework described above in mind.

 

Exploring the dependent variable - daily rain amount

 

 

# Time series of the daily rain amount, with smoother curve

ggplot(weather, aes(date,rain)) +
  geom_point(aes(colour = rain)) +
  geom_smooth(colour = "blue", size = 1) +
  scale_colour_gradient2(low = "green", mid = "orange",high = "red", midpoint = 20) +
  scale_y_continuous(breaks = seq(0,80,20)) +
  xlab("Date") +
  ylab("Rain (mm)") +
  ggtitle("Daily rain amount")
 
  

 

# Histogram of the daily rain amount

ggplot(weather,aes(rain)) + 
  geom_histogram(binwidth = 1,colour = "blue", fill = "darkgrey") +
  scale_x_continuous(breaks = seq(0,80,5)) +
  scale_y_continuous(breaks = seq(0,225,25)) +
  xlab("Rain (mm)") +
  ylab ("Frequency (days)") +
  ggtitle("Daily rain amount distribution")
 
 
    The time series plot shows that the daily rain amount varies wildly throughout the year. There are many dry days interspersed with stretches of consecutive wet ones, often severe, especially in the autumn and winter seasons. The histogram not only confirms what was said above, but also shows that the distribution is extremely right-skewed. As shown below, both informally (comparing the mean to the median), and formally (calculating the actual value), the positive skewness remains even after removing all the days where it did not rain.

    > # Heavily right-skewed distribution
    > summary(weather$rain)
       Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
      0.000   0.000   0.300   5.843   5.300  74.900 
     
    > # Right-skewness is still there after removing all the dry days
    > summary(subset(weather, rain > 0)$rain)
       Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
       0.30    1.15    5.10   11.17   16.10   74.90 
     
    > # Formal calculation of skewness (e1071 package)
    > library(e1071) 
     
    > skewness(weather$rain)
    [1] 2.99
    > skewness(subset(weather, rain >0)$rain)
    [1] 2.04
     
    It should be clear at this point that one possible approach would be to dichotomise the dependent variable (rain vs. no rain).  Note that it is common to consider days with rain only those where the total amount was at least 1mm (to allow for measurement errors), and that's the criterion we will adopt here. Here is the code to do it and a few interesting summaries.

    > # Create binary outcome (rained = {Yes, No})
     
    > # Number of dry days
    > nrow(subset(weather, rain == 0))
    [1] 174 
     
    > # Number of days that will also be considered dry days
    > nrow(subset(weather, rain <1 & rain >0))
    [1] 45
     
    > # The new binary variable is called "rained"
    > weather$rained <- ifelse(weather$rain >= 1, "Yes", "No")
     
    > # Dry and wet days (absolute)
    > table(rained = weather$rained) 
     
    rained
     No Yes 
    219 146 
     
    > # Dry and wet days (relative)
    > prop.table(table(rained = weather$rained)) 
     
    rained
     No Yes 
    0.6 0.4
     
    Porto is one of the wettest cities in Europe for a reason. There was rain in exactly 40% of the days of the year, and this is considering those days with rain < 1mm as dry. Should we set the cutoff at 0 mm, and more than 50% of the days in 2014 would have been considered wet.

     

    Looking at the association between rain and season of the year



    The time series plot seems to indicate that season of the year might be a good predictor for the occurrence of rain. Let's start by investigating this relation, with both the continuous rain variable and the binary one.

    Rain amount (continuous) by season

    # Jitter plot - Rain amount by season 
     
     ggplot(weather, aes(season,rain)) +
      geom_jitter(aes(colour=rain), position = position_jitter(width = 0.2)) +
      scale_colour_gradient2(low = "blue", mid = "red",high = "black", midpoint = 30) +
      scale_y_continuous(breaks = seq(0,80,20)) +
      xlab("Season") +
      ylab ("Rain (mm)") +
      ggtitle("Daily rain amount by season")
     
     

    > # Rain amount by season
    > # {tapply(x,y,z) applies function z to x, for each level of y} 
     
    > tapply(weather$rain,weather$season,summary) 
     
    $Spring
       Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
      0.000   0.000   0.000   2.934   2.550  28.200 
    
    $Summer
       Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
      0.000   0.000   0.000   2.936   1.350  68.300 
    
    $Autumn
       Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
      0.000   0.000   0.300   7.164   6.450  74.900 
    
    $Winter
       Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
       0.00    0.00    3.95   10.45   14.65   64.80

    We can see that most of the extreme values (outliers) are in the winter and autumn. There are still, however, many dry days in both these seasons, and hence the means are too close from each other. Fitting a model to predict the actual amount of rain (not the same as probability of occurrence) based on the season alone might not give the greatest results.

    Rain occurrence (binary) by season

    # Bar plot - dry and wet days by season (relative)
    ggplot(weather,aes(season)) +
      geom_bar(aes(fill = rained), position = "fill") +
      geom_hline(aes(yintercept = prop.table(table(weather$rained))["No"]),
                 colour = "blue",linetype = "dashed", size = 1) +
      annotate("text", x = 1, y = 0.65, label = "yr. w/o = 0.60", colour = "blue") +
      xlab("Season") +
      ylab ("Proportion") +
      ggtitle("Proportion of days without and with rain, by season")
     
    > round(prop.table(table(season = weather$season, rained= weather$rained),1),2
     
             rained
    season     No  Yes
      Spring 0.74 0.26
      Summer 0.72 0.28
      Autumn 0.57 0.43
      Winter 0.37 0.63

    It appears that, when it comes to calculate the likelihood of raining on a particular day, the season of the year may have some predictive power. This is especially true for the winter season, where the rainy days (63%) are well above the yearly average (40%).

     

    Looking at the correlations between rain and all numeric variables


    We are now going to calculate the linear (Pearson) correlations between the continuous outcome variable (daily rain amount) and all the numeric variables - both the actual numeric ones (such as the temperature and wind speed), and those that are actually factor variables, but still make sense if we want to see them as numeric (for instance, the hour of the day in which some event occurred).

    Note that we are not modelling at this stage, just trying to gain some insight from the data, and therefore we are neither concerned about whether the correlations are significant (p-values and/or confidence intervals) nor if the the relation between any two variables is in fact linear (we will be able to check this later, anyway, when we create the scatter plots with the lowess curves on top).

    Here is the code (the comments should make it easier to understand what we are trying to do).

    > # Create a new data frame with only the variables than can be numeric
    > weather.num <- weather[c("rain","l.temp","h.temp","ave.temp","ave.wind","gust.wind",
    + "l.temp.hour","h.temp.hour","gust.wind.hour")] 
     
    > # Convert the following factor variables to numeric
    > weather.num$l.temp.hour <- as.numeric(weather.num$l.temp.hour)
    > weather.num$h.temp.hour <- as.numeric(weather.num$h.temp.hour)
    > weather.num$gust.wind.hour <- as.numeric(weather.num$gust.wind.hour) 
     
    > # Pass the entire data frame to the cor() function to create a big correlation matrix
    > # Pick only the first row of the matrix [1,] ==> correlation between rain and all the other variables
    > round(cor(weather.num),2)[1,]
              rain         l.temp         h.temp       ave.temp       ave.wind 
              1.00          -0.14          -0.29          -0.21           0.48 
         gust.wind    l.temp.hour    h.temp.hour gust.wind.hour 
              0.61           0.19          -0.25          -0.16

    There seems to be a promising positive correlation between the rain amount and the wind variables, especially the wind gust maximum speed (moderate correlation value of 0.61) , i.e., higher wind speeds tend to be associated with higher amounts of precipitation. Always bear in mind that correlation does not imply causation, therefore while it is true that the wind correlates with rain, this does not necessarily mean that the wind itself is causing the rain. It could actually be the other way around or, what is very common, both of the variables are caused by a third one that we are not even considering.

    There are also some negative correlations with the temperatures (especially the daily high) that, even though not as strong as the wind ones, are still worth looking at. It appears higher amounts of rain are correlated with lower high temperatures. But let's think about it for a minute: we saw that it is in the winter when it rains the most, and in Part 3a we also saw that the temperatures were lower in the winter. Here is an example of a potential (yet to be confirmed) interaction with a third variable: it may not be the lower temperature by itself that causes more rain, but the fact the both the precipitation and lower temperatures tend to occur during a specific period of the year.

    Since the season seems to have an impact on these variables, I would like to explore it a bit further, calculating all these correlations by season and check whether the values hold. If the correlation between rain and some other variable is very dissimilar across all seasons, then there is the proof for an interaction.

    > # Let's split the data frame in four parts, one for each season
    > weather.num.season <- split(weather.num,weather$season) 
     
    > # The result is a list...
    > class(weather.num.season)
    [1] "list" 
     
    > # ...with 4 elements, where...
    > length(weather.num.season)
    [1] 4 
     
    > # ...each element of the list is a data frame (the seasons), with nine variables
    > summary(weather.num.season)
           Length Class      Mode
    Spring 9      data.frame list
    Summer 9      data.frame list
    Autumn 9      data.frame list
    Winter 9      data.frame list 
     
    > # Here are the names of each of the data frames of the list
    > attributes(weather.num.season)
    $names
    [1] "Spring" "Summer" "Autumn" "Winter"
    
    > # *apply family of functions are arguably the most powerful in base R, but also the most difficult to master
    > # {sapply(x,z) applies function z to each element of x}
    > # First go over the elements of the list and calculate the correlation matrix (all against all)
    > # For each season, return only the correlation between "rain" and everything else
    > sapply(weather.num.season, function (x) round(cor(x)["rain",],2))
                   Spring Summer Autumn Winter
    rain             1.00   1.00   1.00   1.00
    l.temp          -0.33   0.06   0.13   0.05
    h.temp          -0.45  -0.26  -0.07  -0.26
    ave.temp        -0.43  -0.14   0.06  -0.09
    ave.wind         0.34   0.50   0.38   0.66
    gust.wind        0.37   0.45   0.66   0.71
    l.temp.hour      0.24   0.16   0.01   0.33
    h.temp.hour     -0.13  -0.22  -0.16  -0.30
    gust.wind.hour  -0.26  -0.34  -0.18   0.06 
     
    > # Not going to run it, but here is an alternative that might be easier (or not) to understand
    > # It actually shows the correlations for each element of the list
    > # lapply(weather.num.season, function (x) round(cor(x)["rain",],2)) 

    What do you conclude from the table above? 

    The correlation between the rain and wind varies, but keeps moderately strong, regardless of the season of the year, making this the most promising variable in out data set; the correlation between the rain and daily high temperature does not confirm what I had hypothesised above. In fact, the correlation is even stronger in the spring than in the winter, and we would have to go even deeper if we really needed to understand what is going on (keep in mind we are just analysing one of the possible interactions - the season - but in practice there can be multiple ones). For the purpose of what we are doing now, it is enough to be aware that this correlation is not stable throughout the year, and actually goes to zero during the autumn. Lastly, the correlations between rain and the hour of the events (low and high temperatures, and wind gust) are rather weak but show some stability (see the l.temp.hour and h.temp.hour). They might have some predictive power, at least in a linear model.
    Now that we know that the wind has the strongest correlation with the rain, and that it holds true across all seasons, it's time to plot these variables, because we want to learn something we still don't know: what is the shape of this relation? Linear, piecewise linear, curvilinear? Let's find out using ggplot2 and the concept of faceting.

     

    Looking deeper at the wind and rain correlation - faceting technique



    The idea of faceting (also called Trellis plots) is a simple but important one. A graphic, of any type and mapping any number of variables, can be divided into several panels, depending on the value of a conditioning variable. This value is either each of the levels of a categorical variable, or a number of ranges of a numeric variable. This helps us check whether there are consistent patterns across all panels. In ggplot2 there are two functions to create facets - facet_wrap() and facet_grid() - that are used when we have one or two conditioning variables, respectively. As everything in ggplot2, this is just another layer in the plot, which means all geometries, mappings, and settings will be replicated across all panels.

    Let's then assess the linearity of the correlation of the amount of rain and the maximum wind gust speed, conditioning on the season of the year.

    # Amount of rain vs. wind, by season
     
    ggplot(weather,aes(gust.wind,rain)) +
      geom_point(colour = "firebrick") +
      geom_smooth(size = 0.75, se = F) +
      facet_wrap(~season) +
      xlab("Maximum wind speed (km/h)") +
      ylab ("Rain (mm)") +
      ggtitle("Amount of rain vs. maximum wind speed, by season")
     
     

    This plot confirms what we had already discovered: there is a positive correlation between rain and wind, and the association holds regardless of the season. But now we know more: this correlation is non-linear. In fact, if we were to generalise, we could say there is no correlation at all when the maximum wind speed is below 25 km/h. For values higher than that, there seems to be a linear association in the autumn and winter, not so linear in the spring, and definitely non-linear during the summer. If we wanted to model this relation, we would either fit a non-linear model (such as a regression tree) or we could try to force a piecewise linear model (linear spline), where the equation relating the outcome to the predictors would, itself, be different depending on the value of the wind.


     

    Occurrence of rain - more complex faceting to visualise variable association

     


    To finish this part of the series, let's check whether the variable that seemed to have some predictive power for the amount of rain (maximum wind speed), is also good in the case of a binary outcome (occurrence of rain), but now we will not only control for the season, but also for the daily high temperature (because, as we have seen before, this variable was interacting with both the rain and the season). We will do this simultaneously, using the faceting technique on two variables. But, since the daily high temperature variable is continuous, we need to transform it to categorical first. A common strategy is to split the continuous variables in four groups of (roughly) the same size, i.e., the quartiles. This is very simple to do in R, combining the cut() and quantile() functions.

    > # Using the defaults of the quantiles function returns 4 intervals (quartiles)
    > quantile(weather$h.temp)
      0%  25%  50%  75% 100% 
     9.8 14.4 19.1 23.3 31.5 
     
    > # All we need to do is to define the quartiles as the breaks of the cut function
    > # and label the intervals accordingly
    > weather$h.temp.quant <- cut(weather$h.temp, breaks = quantile(weather$h.temp),
                                labels = c("Cool","Mild","Warm","Hot"),include.lowest = T)
     
    > # The result 
    > table(weather$h.temp.quant)
    
    Cool Mild Warm  Hot 
      92   91   94   88  

      

    # Occurrence of rain, by season and daily high temperature 
    
    ggplot(weather,aes(rained,gust.wind)) +
      geom_boxplot(aes(colour=rained)) +
      facet_grid(h.temp.quant~season) +
      xlab("Occurrence of rain") +
      ylab ("Maximum wind speed (km/h)") +
      ggtitle("Occurrence of rain, by season and daily high temperature")
     
     
      

    The graph reveals a clear pattern: the median of the maximum wind speed is always higher when it rains, and this is not affected by the range the of daily high temperature, even after controlling for the temperature variation within each season of the year.

    I think we now have a much better understanding of the data. We know which variables matter the most, and which ones seem to be useless, when it comes to predict the rain, either the actual amount or the probability of its occurrence. Please note that it would be impossible to write about the analysis of every single variable and show every plot; behind the scenes, much more work than what I've shown here has been done.

    In Part 4, the last of this series, we will be using a few machine learning algorithms to find out how well the rain can be predicted, and the contribution of each variable to the accuracy and precision of each model.

    In the meanwhile, if you have done any kind of analysis with this data set, feel free to share your own findings, either by commenting on the post or by sending me a private message.
    标题SpringBoot智能在线预约挂号系统研究AI更换标题第1章引言介绍智能在线预约挂号系统的研究背景、意义、国内外研究现状及论文创新点。1.1研究背景与意义阐述智能在线预约挂号系统对提升医疗服务效率的重要性。1.2国内外研究现状分析国内外智能在线预约挂号系统的研究与应用情况。1.3研究方法及创新点概述本文采用的技术路线、研究方法及主要创新点。第2章相关理论总结智能在线预约挂号系统相关理论,包括系统架构、开发技术等。2.1系统架构设计理论介绍系统架构设计的基本原则和常用方法。2.2SpringBoot开发框架理论阐述SpringBoot框架的特点、优势及其在系统开发中的应用。2.3数据库设计与管理理论介绍数据库设计原则、数据模型及数据库管理系统。2.4网络安全与数据保护理论讨论网络安全威胁、数据保护技术及其在系统中的应用。第3章SpringBoot智能在线预约挂号系统设计详细介绍系统的设计方案,包括功能模块划分、数据库设计等。3.1系统功能模块设计划分系统功能模块,如用户管理、挂号管理、医生排班等。3.2数据库设计与实现设计数据库表结构,确定字段类型、主键及外键关系。3.3用户界面设计设计用户友好的界面,提升用户体验。3.4系统安全设计阐述系统安全策略,包括用户认证、数据加密等。第4章系统实现与测试介绍系统的实现过程,包括编码、测试及优化等。4.1系统编码实现采用SpringBoot框架进行系统编码实现。4.2系统测试方法介绍系统测试的方法、步骤及测试用例设计。4.3系统性能测试与分析对系统进行性能测试,分析测试结果并提出优化建议。4.4系统优化与改进根据测试结果对系统进行优化和改进,提升系统性能。第5章研究结果呈现系统实现后的效果,包括功能实现、性能提升等。5.1系统功能实现效果展示系统各功能模块的实现效果,如挂号成功界面等。5.2系统性能提升效果对比优化前后的系统性能
    在金融行业中,对信用风险的判断是核心环节之一,其结果对机构的信贷政策和风险控制策略有直接影响。本文将围绕如何借助机器学习方法,尤其是Sklearn工具包,建立用于判断信用状况的预测系统。文中将涵盖逻辑回归、支持向量机等常见方法,并通过实际操作流程进行说明。 一、机器学习基本概念 机器学习属于人工智能的子领域,其基本理念是通过数据自动学习规律,而非依赖人工设定规则。在信贷分析中,该技术可用于挖掘历史数据中的潜在规律,进而对未来的信用表现进行预测。 二、Sklearn工具包概述 Sklearn(Scikit-learn)是Python语言中广泛使用的机器学习模块,提供多种数据处理和建模功能。它简化了数据清洗、特征提取、模型构建、验证与优化等流程,是数据科学项目中的常用工具。 三、逻辑回归模型 逻辑回归是一种常用于分类任务的线性模型,特别适用于二类问题。在信用评估中,该模型可用于判断借款人是否可能违约。其通过逻辑函数将输出映射为0到1之间的概率值,从而表示违约的可能性。 四、支持向量机模型 支持向量机是一种用于监督学习的算法,适用于数据维度高、样本量小的情况。在信用分析中,该方法能够通过寻找最佳分割面,区分违约与非违约客户。通过选用不同核函数,可应对复杂的非线性关系,提升预测精度。 五、数据预处理步骤 在建模前,需对原始数据进行清理与转换,包括处理缺失值、识别异常点、标准化数值、筛选有效特征等。对于信用评分,常见的输入变量包括收入水平、负债比例、信用历史记录、职业稳定性等。预处理有助于减少噪声干扰,增强模型的适应性。 六、模型构建与验证 借助Sklearn,可以将数据集划分为训练集和测试集,并通过交叉验证调整参数以提升模型性能。常用评估指标包括准确率、召回率、F1值以及AUC-ROC曲线。在处理不平衡数据时,更应关注模型的召回率与特异性。 七、集成学习方法 为提升模型预测能力,可采用集成策略,如结合多个模型的预测结果。这有助于降低单一模型的偏差与方差,增强整体预测的稳定性与准确性。 综上,基于机器学习的信用评估系统可通过Sklearn中的多种算法,结合合理的数据处理与模型优化,实现对借款人信用状况的精准判断。在实际应用中,需持续调整模型以适应市场变化,保障预测结果的长期有效性。 资源来源于网络分享,仅用于学习交流使用,请勿用于商业,如有侵权请联系我删除!
    Foodpanda 的全面记录,包含 6000 条精心整理的记录,涵盖了从客户人口统计信息到订单、支付、评价和配送细节的各个方面。它为数据分析师和研究人员提供了一个丰富的资源,可用于深入分析和洞察 Foodpanda 的业务运营和客户行为。 数据集内容客户人口统计信息:数据集详细记录了客户的年龄、性别、收入水平、地理位置等基本信息。这些信息有助于了解不同客户群体的特征,为精准营销和客户细分提供数据支持。 订单信息:每条记录都包含了订单的日期、时间、金额以及购买的商品或服务。通过分析这些数据,可以发现客户的购买习惯和偏好,例如哪些时间段是订单高峰期,哪些菜品最受欢迎。 支付信息:数据集中还包含了支付方式、支付状态和支付金额等信息。这些数据可以帮助分析不同支付方式的使用频率,以及支付成功率等关键指标。 评价信息:客户对订单、服务或产品的评分和评论也被记录在数据集中。这些评价数据对于情感分析和客户满意度研究至关重要,能够帮助 Foodpanda 了解客户的真实反馈,从而改进服务质量。 配送细节:数据集还详细记录了配送时间、配送地址和配送状态等信息。通过分析这些数据,可以优化配送路线和时间,提高客户满意度。 数据集的应用场景:客户行为分析:通过分析客户的购买习惯、偏好和评价,可以更好地了解客户需求,从而提供更个性化的服务。 客户流失预测:利用数据集中的客户行为和评价数据,可以构建模型预测哪些客户可能会流失,以便提前采取措施挽留。 客户细分:根据客户的人口统计信息和购买行为,可以将客户划分为不同的群体,为每个群体提供定制化的服务和营销策略。 销售趋势分析:通过分析订单数据,可以发现销售的增长或下降趋势,为业务决策提供依据。 情感洞察:通过分析客户的评价和评论,可以了解客户对产品或服务的情感倾向,及时发现潜在问题并加以改进。
    评论
    添加红包

    请填写红包祝福语或标题

    红包个数最小为10个

    红包金额最低5元

    当前余额3.43前往充值 >
    需支付:10.00
    成就一亿技术人!
    领取后你会自动成为博主和红包主的粉丝 规则
    hope_wisdom
    发出的红包
    实付
    使用余额支付
    点击重新获取
    扫码支付
    钱包余额 0

    抵扣说明:

    1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
    2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

    余额充值