The data s huge and mass, and there are loads of ways to preprocessing the data .
The way i dealed with it , probably is not really good ,but still can show what i need .
When i started preprocessing the data , there s a lot of different trouble. like the tail of the data is really a mass.
> tail(storm)
X.STATE__.
1769564 68 kt (78 mph) at the Cape Lisburne AWOS.
1769565 Zone 202: Blizzard conditions were observed at Barrow from approximately 1021AKST through 1700AKST on the 9th. The visibility was frequently reduced to one quarter mile or less in blowing snow. There was a peak wind gust to 46 kt (53 mph) at the Barrow ASOS.
1769566 Zone 207: Blizzard conditions were observed at Kivalina from approximately 0400AKST through 1230AKST on the 9th. The visibility was frequently reduced to one quarter of a mile in snow and blowing snow. There was a peak wind gust to 61 kt (70 mph) at the Kivalina ASOS. The doors to the village transportation shed were blown out to sea. Many homes lost portions of their tin roofing
1769567 1.00
1769568 with rainfall remaining light to moderate during most its duration. The rainfall resulted in minor river flooding along the Little River
1769569 The rain mixed with and changed to snow across north Alabama during the afternoon and evening hours of the 28th
X.BGN_DATE.
1769564
1769565
1769566 and satellite dishes were ripped off of roofs. One home had its door blown off. At Point Hope
1769567 11/28/2011 0:00:00
1769568 Big Wills Creek and Paint Rock. A landslide occurred on Highway 35 just north of Section in Jackson County. A driver was trapped in his vehicle
1769569 and lasted into the 29th. The heaviest bursts of snow occurred in northwest Alabama during the afternoon and evening hours
..........
Then i started to clean the data like below:
healthData<-storm[,c("X.EVTYPE.","X.BGN_DATE.","X.FATALITIES.", "X.INJURIES.")]
healthData$FATALITIES<-as.numeric(healthData$X.FATALITIES.)
healthData<-subset(healthData,healthData$FATALITIES>0)
healthData<-healthData[,-3]
healthData$INJURIES<-as.numeric(healthData$X.INJURIES.)
healthData<-subset(healthData,healthData$INJURIES>0)
healthData<-healthData[,-3]
healthData$total <- healthData$FATALITIES + healthData$INJURIES
propData<-storm[,c("X.EVTYPE.","X.BGN_DATE.", "X.PROPDMG.", "X.PROPDMGEXP.")]
propData$pronum<-as.numeric(propData$X.PROPDMG.)
propData<-subset(propData,propData$pronum>0)
propData<-propData[,-3]
library(plyr)
propData <- mutate(propData, PropertyDamage = ifelse(toupper(X.PROPDMGEXP.) =="\"K\"" , pronum*1000,
ifelse(toupper(X.PROPDMGEXP.) =="\"M\"" , pronum*1000000,
ifelse(toupper(X.PROPDMGEXP.) == "\"B\"" , pronum*1000000000,
ifelse(toupper(X.PROPDMGEXP.) == "\"H\"" , pronum*100, pronum)))))
then u can see the result become easier to analysis .
to check out more , feel free to my Rpub:storm data analysis