第一个例子——纽约市出租车票价预测

纽约市出租车票价预测——数据挖掘

阅读数据与首次探

用到的数据集:https://github.com/woshizhangrong/train_raw

首先,我喜欢用一个新的数据集来探索数据。这意味着调查特征的数量、数据类型、含义和统计。

In [1]:

# load some default Python modules
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
% matplotlib inline
plt.style.use('seaborn-whitegrid')

In [2]:

# read data in pandas dataframe
df_train =  pd.read_csv('E:/NYC_Fare/train_raw.csv', parse_dates=["pickup_datetime"])

# list first few rows (datapoints)
df_train.head()

Out[2]:

 keyfare_amountpickup_datetimepickup_longitudepickup_latitudedropoff_longitudedropoff_latitudepassenger_count
02009-06-15 17:26:21.00000014.52009-06-15 17:26:21-73.84431140.721319-73.84161040.7122781
12010-01-05 16:52:16.000000216.92010-01-05 16:52:16-74.01604840.711303-73.97926840.7820041
22011-08-18 00:35:00.000000495.72011-08-18 00:35:00-73.98273840.761270-73.99124240.7505622
32012-04-21 04:30:42.00000017.72012-04-21 04:30:42-73.98713040.733143-73.99156740.7580921
42010-03-09 07:51:00.0000001355.32010-03-09 07:51:00-73.96809540.768008-73.95665540.7837621

In [3]:

# check datatypes
df_train.dtypes

Out[3]:

key                          object
fare_amount                 float64
pickup_datetime      datetime64[ns]
pickup_longitude            float64
pickup_latitude             float64
dropoff_longitude           float64
dropoff_latitude            float64
passenger_count               int64
dtype: object

In [4]:

# check statistics of the features
df_train.describe()

Out[4]:

 fare_amountpickup_longitudepickup_latitudedropoff_longitudedropoff_latitudepassenger_count
count200000.000000200000.000000200000.000000199999.000000199999.000000200000.000000
mean11.342877-72.50612139.922326-72.51867339.9255791.682445
std9.83785511.60809710.04894710.7242266.7511201.306730
min-44.900000-736.550000-3116.285383-1251.195890-1189.6154400.000000
25%6.000000-73.99205040.735007-73.99129540.7340921.000000
50%8.500000-73.98174340.752761-73.98007240.7532251.000000
75%12.500000-73.96706840.767127-73.96350840.7680702.000000
max500.0000002140.6011601703.09277240.851027404.6166676.000000

我注意到(在使用500 K数据时):
最小支付金额为负数。由于这似乎不现实,我会把它们从数据集中删除。一些最小和最大的经度/纬度坐标是偏离的。这些数据也将从数据集中删除(我将为坐标定义一个边界框,参见进一步)。平均票价约为11.4美元,差额为11.4美元,差额为9.9美元。当建立一个预测模型时,我们想要比9.9美元好:

In [5]:

print('Old size: %d' % len(df_train))
df_train = df_train[df_train.fare_amount>=0]
print('New size: %d' % len(df_train))
Old size: 200000
New size: 199987

In [6]:

# plot histogram of fare
df_train[df_train.fare_amount<100].fare_amount.hist(bins=100, figsize=(14,3))
plt.xlabel('fare $USD')
plt.title('Histogram');

在FRALYX量的直方图中,在UD40和UD60之间有一些小的尖峰。这可以指示一些固定票价(例如到机场)。这将在下面进一步探讨。
删除丢失的数据
始终检查是否有丢失的数据。由于这个数据集很大,删除数据缺失的数据点可能对正在训练的模型没有影响。

In [7]:

print(df_train.isnull().sum())
key                  0
fare_amount          0
pickup_datetime      0
pickup_longitude     0
pickup_latitude      0
dropoff_longitude    1
dropoff_latitude     1
passenger_count      0
dtype: int64

In [8]:

print('Old size: %d' % len(df_train))
df_train = df_train.dropna(how = 'any', axis = 'rows')
print('New size: %d' % len(df_train))
Old size: 199987
New size: 199986

测试数据

读取测试数据,检查统计数据并与训练集进行比较。

位置数据

当我们处理位置数据时,我想在地图上绘制坐标。这给出了更好的数据视图。为此,我使用以下网站:
易于使用的地图和GPS工具:https://w w w.gps-.s.net/计算地点之间的距离:https://w w w.travelmath.com/flying-./“开放街道地图”,在地图上使用弹跳框来抓取:https://w w w.openstreetmap.org/export#map=8/52.154/5.295

纽约城市坐标是(https://w.WW.TraceM.COM/城市/新+约克,+NY):long.=-74.0063889..=40.7141667,我使用来自测试集的最小和最大坐标定义了[long_min、long_max、._min、._max]感兴趣的边界框。这样,我确信为测试集的完全拾取/下拉坐标范围训练一个模型。
我从开放街道地图上抓取地图,我把任何数据放在这个盒子外面。

In [9]:

# this function will also be used with the test set below
def select_within_boundingbox(df, BB):
    return (df.pickup_longitude >= BB[0]) & (df.pickup_longitude <= BB[1]) & \
           (df.pickup_latitude >= BB[2]) & (df.pickup_latitude <= BB[3]) & \
           (df.dropoff_longitude >= BB[0]) & (df.dropoff_longitude <= BB[1]) & \
           (df.dropoff_latitude >= BB[2]) & (df.dropoff_latitude <= BB[3])
            
# load image of NYC map
BB = (-74.5, -72.8, 40.5, 41.8)
nyc_map = plt.imread('https://aiblog.nl/download/nyc_-74.5_-72.8_40.5_41.8.png')

# load extra image to zoom in on NYC
BB_zoom = (-74.3, -73.7, 40.5, 40.9)
nyc_map_zoom = plt.imread('https://aiblog.nl/download/nyc_-74.3_-73.7_40.5_40.9.png')

In [10]:

print('Old size: %d' % len(df_train))
df_train = df_train[select_within_boundingbox(df_train, BB)]
print('New size: %d' % len(df_train))
Old size: 199986
New size: 195801

In [11]:

# this function will be used more often to plot data on the NYC map
def plot_on_map(df, BB, nyc_map, s=10, alpha=0.2):
    fig, axs = plt.subplots(1, 2, figsize=(16,10))
    axs[0].scatter(df.pickup_longitude, df.pickup_latitude, zorder=1, alpha=alpha, c='r', s=s)
    axs[0].set_xlim((BB[0], BB[1]))
    axs[0].set_ylim((BB[2], BB[3]))
    axs[0].set_title('Pickup locations')
    axs[0].imshow(nyc_map, zorder=0, extent=BB)

    axs[1].scatter(df.dropoff_longitude, df.dropoff_latitude, zorder=1, alpha=alpha, c='r', s=s)
    axs[1].set_xlim((BB[0], BB[1]))
    axs[1].set_ylim((BB[2], BB[3]))
    axs[1].set_title('Dropoff locations')
    axs[1].imshow(nyc_map, zorder=0, extent=BB)

In [12]:

# plot training data on map
plot_on_map(df_train, BB, nyc_map, s=1, alpha=0.3)

In [13]:

# plot training data on map zoomed in
plot_on_map(df_train, BB_zoom, nyc_map_zoom, s=1, alpha=0.3)

去除水中数据

从上面的地图+散点图可以看出,一些数据点位于水中。这些显然是有噪声的数据点。为了删除这些数据点,我从纽约地图创建一个布尔土地/水地图。为此,我使用PS图象处理软件对蓝色的水和阈值清理地图。得到的地图如下所示。

In [14]:

# read nyc mask and turn into boolean map with
# land = True, water = False
nyc_mask = plt.imread('https://aiblog.nl/download/nyc_mask-74.5_-72.8_40.5_41.8.png')[:,:,0] > 0.9

plt.figure(figsize=(8,8))
plt.imshow(nyc_map, zorder=0)
plt.imshow(nyc_mask, zorder=1, alpha=0.7); # note: True is show in black, False in white.

接下来,我需要把经纬度坐标转换成XY像素坐标。功能lonlat_to_xy实施这一转变。注意,Y坐标需要反转,因为图像Y轴是从上到下定向的。
一旦对所有数据点计算xy像素坐标,使用NYC掩码计算布尔索引。

In [15]:

# translate longitude/latitude coordinate into image xy coordinate
def lonlat_to_xy(longitude, latitude, dx, dy, BB):
    return (dx*(longitude - BB[0])/(BB[1]-BB[0])).astype('int'), \
           (dy - dy*(latitude - BB[2])/(BB[3]-BB[2])).astype('int')

In [16]:

pickup_x, pickup_y = lonlat_to_xy(df_train.pickup_longitude, df_train.pickup_latitude, 
                                  nyc_mask.shape[1], nyc_mask.shape[0], BB)
dropoff_x, dropoff_y = lonlat_to_xy(df_train.dropoff_longitude, df_train.dropoff_latitude, 
                                  nyc_mask.shape[1], nyc_mask.shape[0], BB)

In [17]:

idx = (nyc_mask[pickup_y, pickup_x] & nyc_mask[dropoff_y, dropoff_x])
print("Number of trips in water: {}".format(np.sum(~idx)))
Number of trips in water: 34

In [18]:

def remove_datapoints_from_water(df):
    def lonlat_to_xy(longitude, latitude, dx, dy, BB):
        return (dx*(longitude - BB[0])/(BB[1]-BB[0])).astype('int'), \
               (dy - dy*(latitude - BB[2])/(BB[3]-BB[2])).astype('int')

    # define bounding box
    BB = (-74.5, -72.8, 40.5, 41.8)
    
    # read nyc mask and turn into boolean map with
    # land = True, water = False
    nyc_mask = plt.imread('https://aiblog.nl/download/nyc_mask-74.5_-72.8_40.5_41.8.png')[:,:,0] > 0.9
    
    # calculate for each lon,lat coordinate the xy coordinate in the mask map
    pickup_x, pickup_y = lonlat_to_xy(df.pickup_longitude, df.pickup_latitude, 
                                      nyc_mask.shape[1], nyc_mask.shape[0], BB)
    dropoff_x, dropoff_y = lonlat_to_xy(df.dropoff_longitude, df.dropoff_latitude, 
                                      nyc_mask.shape[1], nyc_mask.shape[0], BB)    
    # calculate boolean index
    idx = nyc_mask[pickup_y, pickup_x] & nyc_mask[dropoff_y, dropoff_x]
    
    # return only datapoints on land
    return df[idx]

In [19]:

print('Old size: %d' % len(df_train))
df_train = remove_datapoints_from_water(df_train)
print('New size: %d' % len(df_train))
Old size: 195801
New size: 195767

In [22]:

# plot training data
plot_on_map(df_train, BB_zoom, nyc_map_zoom, s=1, alpha=0.3)

每平方英里数据密度

拾取和脱落位置的散点图给出了密度的快速印象。然而,更精确地计算每个区域的数据点的数量来可视化密度。下面的代码对每平方英里的拾取和删除数据进行计数。这可以更好地了解“热点”。

In [23]:

# For this plot and further analysis, we need a function to calculate the distance in miles between locations in lon,lat coordinates.
# This function is based on https://stackoverflow.com/questions/27928/
# calculate-distance-between-two-latitude-longitude-points-haversine-formula 
# return distance in miles
def distance(lat1, lon1, lat2, lon2):
    p = 0.017453292519943295 # Pi/180
    a = 0.5 - np.cos((lat2 - lat1) * p)/2 + np.cos(lat1 * p) * np.cos(lat2 * p) * (1 - np.cos((lon2 - lon1) * p)) / 2
    return 0.6213712 * 12742 * np.arcsin(np.sqrt(a)) # 2*R*asin...

# First calculate two arrays with datapoint density per sq mile
n_lon, n_lat = 200, 200 # number of grid bins per longitude, latitude dimension
density_pickup, density_dropoff = np.zeros((n_lat, n_lon)), np.zeros((n_lat, n_lon)) # prepare arrays

# To calculate the number of datapoints in a grid area, the numpy.digitize() function is used. 
# This function needs an array with the (location) bins for counting the number of datapoints
# per bin.
bins_lon = np.zeros(n_lon+1) # bin
bins_lat = np.zeros(n_lat+1) # bin
delta_lon = (BB[1]-BB[0]) / n_lon # bin longutide width
delta_lat = (BB[3]-BB[2]) / n_lat # bin latitude height
bin_width_miles = distance(BB[2], BB[1], BB[2], BB[0]) / n_lon # bin width in miles
bin_height_miles = distance(BB[3], BB[0], BB[2], BB[0]) / n_lat # bin height in miles
for i in range(n_lon+1):
    bins_lon[i] = BB[0] + i * delta_lon
for j in range(n_lat+1):
    bins_lat[j] = BB[2] + j * delta_lat
    
# Digitize per longitude, latitude dimension
inds_pickup_lon = np.digitize(df_train.pickup_longitude, bins_lon)
inds_pickup_lat = np.digitize(df_train.pickup_latitude, bins_lat)
inds_dropoff_lon = np.digitize(df_train.dropoff_longitude, bins_lon)
inds_dropoff_lat = np.digitize(df_train.dropoff_latitude, bins_lat)

# Count per grid bin
# note: as the density_pickup will be displayed as image, the first index is the y-direction, 
#       the second index is the x-direction. Also, the y-direction needs to be reversed for
#       properly displaying (therefore the (n_lat-j) term)
dxdy = bin_width_miles * bin_height_miles
for i in range(n_lon):
    for j in range(n_lat):
        density_pickup[j, i] = np.sum((inds_pickup_lon==i+1) & (inds_pickup_lat==(n_lat-j))) / dxdy
        density_dropoff[j, i] = np.sum((inds_dropoff_lon==i+1) & (inds_dropoff_lat==(n_lat-j))) / dxdy

In [24]:

# Plot the density arrays
fig, axs = plt.subplots(2, 1, figsize=(18, 24))
axs[0].imshow(nyc_map, zorder=0, extent=BB);
im = axs[0].imshow(np.log1p(density_pickup), zorder=1, extent=BB, alpha=0.6, cmap='plasma')
axs[0].set_title('Pickup density [datapoints per sq mile]')
cbar = fig.colorbar(im, ax=axs[0])
cbar.set_label('log(1 + #datapoints per sq mile)', rotation=270)

axs[1].imshow(nyc_map, zorder=0, extent=BB);
im = axs[1].imshow(np.log1p(density_dropoff), zorder=1, extent=BB, alpha=0.6, cmap='plasma')
axs[1].set_title('Dropoff density [datapoints per sq mile]')
cbar = fig.colorbar(im, ax=axs[1])
cbar.set_label('log(1 + #datapoints per sq mile)', rotation=270)

这些图表清楚地表明,这些数据点集中在曼哈顿和三个机场(肯尼迪、EWS、LGR)周围。在Seymour(右上角)附近也有一个热点。因为我不是美国人,有人知道这个地方有什么特别之处吗?

距离与时间观

在构建模型之前,我想测试一些基本的“直觉”:
拾取和掉落位置之间的距离越长,票价就越高。有些旅行,比如到机场,都是固定费用。晚上的票价与白天不同。所以,让我们检查一下。

拾取和掉落位置之间的距离越长,票价越高。

为了可视化距离-票价关系,我们需要先计算行程的距离。

In [25]:

# add new column to dataframe with distance in miles
df_train['distance_miles'] = distance(df_train.pickup_latitude, df_train.pickup_longitude, \
                                      df_train.dropoff_latitude, df_train.dropoff_longitude)

df_train.distance_miles.hist(bins=50, figsize=(12,4))
plt.xlabel('distance miles')
plt.title('Histogram ride distances in miles')
df_train.distance_miles.describe()

Out[25]:

count    195767.000000
mean          2.070060
std           2.364879
min           0.000000
25%           0.780658
50%           1.338753
75%           2.427781
max          64.644331
Name: distance_miles, dtype: float64

似乎大多数的游乐设施都是短途旅行,在13英里处有一个小的高峰。这个峰值可能是由于机场驱动。
让我们来看看乘客人数的影响。

评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值