第一个例子——纽约市出租车票价预测_纽约出租车车费与距离分析(距离-票价关系可视化)用jupyter实现-优快云博客

纽约市出租车票价预测——数据挖掘

阅读数据与首次探

用到的数据集：https://github.com/woshizhangrong/train_raw

首先，我喜欢用一个新的数据集来探索数据。这意味着调查特征的数量、数据类型、含义和统计。

In [1]:

# load some default Python modules
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
% matplotlib inline
plt.style.use('seaborn-whitegrid')

In [2]:

# read data in pandas dataframe
df_train =  pd.read_csv('E:/NYC_Fare/train_raw.csv', parse_dates=["pickup_datetime"])

# list first few rows (datapoints)
df_train.head()

Out[2]:

	key	fare_amount	pickup_datetime	pickup_longitude	pickup_latitude	dropoff_longitude	dropoff_latitude	passenger_count
0	2009-06-15 17:26:21.0000001	4.5	2009-06-15 17:26:21	-73.844311	40.721319	-73.841610	40.712278	1
1	2010-01-05 16:52:16.0000002	16.9	2010-01-05 16:52:16	-74.016048	40.711303	-73.979268	40.782004	1
2	2011-08-18 00:35:00.00000049	5.7	2011-08-18 00:35:00	-73.982738	40.761270	-73.991242	40.750562	2
3	2012-04-21 04:30:42.0000001	7.7	2012-04-21 04:30:42	-73.987130	40.733143	-73.991567	40.758092	1
4	2010-03-09 07:51:00.000000135	5.3	2010-03-09 07:51:00	-73.968095	40.768008	-73.956655	40.783762	1

In [3]:

# check datatypes
df_train.dtypes

Out[3]:

key                          object
fare_amount                 float64
pickup_datetime      datetime64[ns]
pickup_longitude            float64
pickup_latitude             float64
dropoff_longitude           float64
dropoff_latitude            float64
passenger_count               int64
dtype: object

In [4]:

# check statistics of the features
df_train.describe()

Out[4]:

	fare_amount	pickup_longitude	pickup_latitude	dropoff_longitude	dropoff_latitude	passenger_count
count	200000.000000	200000.000000	200000.000000	199999.000000	199999.000000	200000.000000
mean	11.342877	-72.506121	39.922326	-72.518673	39.925579	1.682445
std	9.837855	11.608097	10.048947	10.724226	6.751120	1.306730
min	-44.900000	-736.550000	-3116.285383	-1251.195890	-1189.615440	0.000000
25%	6.000000	-73.992050	40.735007	-73.991295	40.734092	1.000000
50%	8.500000	-73.981743	40.752761	-73.980072	40.753225	1.000000
75%	12.500000	-73.967068	40.767127	-73.963508	40.768070	2.000000
max	500.000000	2140.601160	1703.092772	40.851027	404.616667	6.000000

我注意到（在使用500 K数据时）：
最小支付金额为负数。由于这似乎不现实，我会把它们从数据集中删除。一些最小和最大的经度/纬度坐标是偏离的。这些数据也将从数据集中删除（我将为坐标定义一个边界框，参见进一步）。平均票价约为11.4美元，差额为11.4美元，差额为9.9美元。当建立一个预测模型时，我们想要比9.9美元好：

In [5]:

print('Old size: %d' % len(df_train))
df_train = df_train[df_train.fare_amount>=0]
print('New size: %d' % len(df_train))

Old size: 200000
New size: 199987

In [6]:

# plot histogram of fare
df_train[df_train.fare_amount<100].fare_amount.hist(bins=100, figsize=(14,3))
plt.xlabel('fare $USD')
plt.title('Histogram');

在FRALYX量的直方图中，在UD40和UD60之间有一些小的尖峰。这可以指示一些固定票价（例如到机场）。这将在下面进一步探讨。
删除丢失的数据
始终检查是否有丢失的数据。由于这个数据集很大，删除数据缺失的数据点可能对正在训练的模型没有影响。

In [7]:

print(df_train.isnull().sum())

key                  0
fare_amount          0
pickup_datetime      0
pickup_longitude     0
pickup_latitude      0
dropoff_longitude    1
dropoff_latitude     1
passenger_count      0
dtype: int64

In [8]:

print('Old size: %d' % len(df_train))
df_train = df_train.dropna(how = 'any', axis = 'rows')
print('New size: %d' % len(df_train))

Old size: 199987
New size: 199986

测试数据

读取测试数据，检查统计数据并与训练集进行比较。

位置数据

当我们处理位置数据时，我想在地图上绘制坐标。这给出了更好的数据视图。为此，我使用以下网站：
易于使用的地图和GPS工具：https://w w w.gps-.s.net/计算地点之间的距离：https://w w w.travelmath.com/flying-./“开放街道地图”，在地图上使用弹跳框来抓取：https://w w w.openstreetmap.org/export#map=8/52.154/5.295

纽约城市坐标是（https://w.WW.TraceM.COM/城市/新+约克，+NY）：long.=-74.0063889..=40.7141667，我使用来自测试集的最小和最大坐标定义了[long_min、long_max、._min、._max]感兴趣的边界框。这样，我确信为测试集的完全拾取/下拉坐标范围训练一个模型。
我从开放街道地图上抓取地图，我把任何数据放在这个盒子外面。

In [9]:

# this function will also be used with the test set below
def select_within_boundingbox(df, BB):
    return (df.pickup_longitude >= BB[0]) & (df.pickup_longitude <= BB[1]) & \
           (df.pickup_latitude >= BB[2]) & (df.pickup_latitude <= BB[3]) & \
           (df.dropoff_longitude >= BB[0]) & (df.dropoff_longitude <= BB[1]) & \
           (df.dropoff_latitude >= BB[2]) & (df.dropoff_latitude <= BB[3])
            
# load image of NYC map
BB = (-74.5, -72.8, 40.5, 41.8)
nyc_map = plt.imread('https://aiblog.nl/download/nyc_-74.5_-72.8_40.5_41.8.png')

# load extra image to zoom in on NYC
BB_zoom = (-74.3, -73.7, 40.5, 40.9)
nyc_map_zoom = plt.imread('https://aiblog.nl/download/nyc_-74.3_-73.7_40.5_40.9.png')

In [10]:

print('Old size: %d' % len(df_train))
df_train = df_train[select_within_boundingbox(df_train, BB)]
print('New size: %d' % len(df_train))

Old size: 199986
New size: 195801

In [11]:

# this function will be used more often to plot data on the NYC map
def plot_on_map(df, BB, nyc_map, s=10, alpha=0.2):
    fig, axs = plt.subplots(1, 2, figsize=(16,10))
    axs[0].scatter(df.pickup_longitude, df.pickup_latitude, zorder=1, alpha=alpha, c='r', s=s)
    axs[0].set_xlim((BB[0], BB[1]))
    axs[0].set_ylim((BB[2], BB[3]))
    axs[0].set_title('Pickup locations')
    axs[0].imshow(nyc_map, zorder=0, extent=BB)

    axs[1].scatter(df.dropoff_longitude, df.dropoff_latitude, zorder=1, alpha=alpha, c='r', s=s)
    axs[1].set_xlim((BB[0], BB[1]))
    axs[1].set_ylim((BB[2], BB[3]))
    axs[1].set_title('Dropoff locations')
    axs[1].imshow(nyc_map, zorder=0, extent=BB)

In [12]:

# plot training data on map
plot_on_map(df_train, BB, nyc_map, s=1, alpha=0.3)

In [13]:

# plot training data on map zoomed in
plot_on_map(df_train, BB_zoom, nyc_map_zoom, s=1, alpha=0.3)

去除水中数据

从上面的地图+散点图可以看出，一些数据点位于水中。这些显然是有噪声的数据点。为了删除这些数据点，我从纽约地图创建一个布尔土地/水地图。为此，我使用PS图象处理软件对蓝色的水和阈值清理地图。得到的地图如下所示。

In [14]:

# read nyc mask and turn into boolean map with
# land = True, water = False
nyc_mask = plt.imread('https://aiblog.nl/download/nyc_mask-74.5_-72.8_40.5_41.8.png')[:,:,0] > 0.9

plt.figure(figsize=(8,8))
plt.imshow(nyc_map, zorder=0)
plt.imshow(nyc_mask, zorder=1, alpha=0.7); # note: True is show in black, False in white.

接下来，我需要把经纬度坐标转换成XY像素坐标。功能lonlat_to_xy实施这一转变。注意，Y坐标需要反转，因为图像Y轴是从上到下定向的。
一旦对所有数据点计算xy像素坐标，使用NYC掩码计算布尔索引。

In [15]:

# translate longitude/latitude coordinate into image xy coordinate
def lonlat_to_xy(longitude, latitude, dx, dy, BB):
    return (dx*(longitude - BB[0])/(BB[1]-BB[0])).astype('int'), \
           (dy - dy*(latitude - BB[2])/(BB[3]-BB[2])).astype('int')

In [16]:

pickup_x, pickup_y = lonlat_to_xy(df_train.pickup_longitude, df_train.pickup_latitude, 
                                  nyc_mask.shape[1], nyc_mask.shape[0], BB)
dropoff_x, dropoff_y = lonlat_to_xy(df_train.dropoff_longitude, df_train.dropoff_latitude, 
                                  nyc_mask.shape[1], nyc_mask.shape[0], BB)

In [17]:

idx = (nyc_mask[pickup_y, pickup_x] & nyc_mask[dropoff_y, dropoff_x])
print("Number of trips in water: {}".format(np.sum(~idx)))

Number of trips in water: 34

In [18]:

def remove_datapoints_from_water(df):
    def lonlat_to_xy(longitude, latitude, dx, dy, BB):
        return (dx*(longitude - BB[0])/(BB[1]-BB[0])).astype('int'), \
               (dy - dy*(latitude - BB[2])/(BB[3]-BB[2])).astype('int')

    # define bounding box
    BB = (-74.5, -72.8, 40.5, 41.8)
    
    # read nyc mask and turn into boolean map with
    # land = True, water = False
    nyc_mask = plt.imread('https://aiblog.nl/download/nyc_mask-74.5_-72.8_40.5_41.8.png')[:,:,0] > 0.9
    
    # calculate for each lon,lat coordinate the xy coordinate in the mask map
    pickup_x, pickup_y = lonlat_to_xy(df.pickup_longitude, df.pickup_latitude, 
                                      nyc_mask.shape[1], nyc_mask.shape[0], BB)
    dropoff_x, dropoff_y = lonlat_to_xy(df.dropoff_longitude, df.dropoff_latitude, 
                                      nyc_mask.shape[1], nyc_mask.shape[0], BB)    
    # calculate boolean index
    idx = nyc_mask[pickup_y, pickup_x] & nyc_mask[dropoff_y, dropoff_x]
    
    # return only datapoints on land
    return df[idx]

In [19]:

print('Old size: %d' % len(df_train))
df_train = remove_datapoints_from_water(df_train)
print('New size: %d' % len(df_train))

Old size: 195801
New size: 195767

In [22]:

# plot training data
plot_on_map(df_train, BB_zoom, nyc_map_zoom, s=1, alpha=0.3)

每平方英里数据密度

拾取和脱落位置的散点图给出了密度的快速印象。然而，更精确地计算每个区域的数据点的数量来可视化密度。下面的代码对每平方英里的拾取和删除数据进行计数。这可以更好地了解“热点”。

In [23]:

# For this plot and further analysis, we need a function to calculate the distance in miles between locations in lon,lat coordinates.
# This function is based on https://stackoverflow.com/questions/27928/
# calculate-distance-between-two-latitude-longitude-points-haversine-formula 
# return distance in miles
def distance(lat1, lon1, lat2, lon2):
    p = 0.017453292519943295 # Pi/180
    a = 0.5 - np.cos((lat2 - lat1) * p)/2 + np.cos(lat1 * p) * np.cos(lat2 * p) * (1 - np.cos((lon2 - lon1) * p)) / 2
    return 0.6213712 * 12742 * np.arcsin(np.sqrt(a)) # 2*R*asin...

# First calculate two arrays with datapoint density per sq mile
n_lon, n_lat = 200, 200 # number of grid bins per longitude, latitude dimension
density_pickup, density_dropoff = np.zeros((n_lat, n_lon)), np.zeros((n_lat, n_lon)) # prepare arrays

# To calculate the number of datapoints in a grid area, the numpy.digitize() function is used. 
# This function needs an array with the (location) bins for counting the number of datapoints
# per bin.
bins_lon = np.zeros(n_lon+1) # bin
bins_lat = np.zeros(n_lat+1) # bin
delta_lon = (BB[1]-BB[0]) / n_lon # bin longutide width
delta_lat = (BB[3]-BB[2]) / n_lat # bin latitude height
bin_width_miles = distance(BB[2], BB[1], BB[2], BB[0]) / n_lon # bin width in miles
bin_height_miles = distance(BB[3], BB[0], BB[2], BB[0]) / n_lat # bin height in miles
for i in range(n_lon+1):
    bins_lon[i] = BB[0] + i * delta_lon
for j in range(n_lat+1):
    bins_lat[j] = BB[2] + j * delta_lat
    
# Digitize per longitude, latitude dimension
inds_pickup_lon = np.digitize(df_train.pickup_longitude, bins_lon)
inds_pickup_lat = np.digitize(df_train.pickup_latitude, bins_lat)
inds_dropoff_lon = np.digitize(df_train.dropoff_longitude, bins_lon)
inds_dropoff_lat = np.digitize(df_train.dropoff_latitude, bins_lat)

# Count per grid bin
# note: as the density_pickup will be displayed as image, the first index is the y-direction, 
#       the second index is the x-direction. Also, the y-direction needs to be reversed for
#       properly displaying (therefore the (n_lat-j) term)
dxdy = bin_width_miles * bin_height_miles
for i in range(n_lon):
    for j in range(n_lat):
        density_pickup[j, i] = np.sum((inds_pickup_lon==i+1) & (inds_pickup_lat==(n_lat-j))) / dxdy
        density_dropoff[j, i] = np.sum((inds_dropoff_lon==i+1) & (inds_dropoff_lat==(n_lat-j))) / dxdy

In [24]:

# Plot the density arrays
fig, axs = plt.subplots(2, 1, figsize=(18, 24))
axs[0].imshow(nyc_map, zorder=0, extent=BB);
im = axs[0].imshow(np.log1p(density_pickup), zorder=1, extent=BB, alpha=0.6, cmap='plasma')
axs[0].set_title('Pickup density [datapoints per sq mile]')
cbar = fig.colorbar(im, ax=axs[0])
cbar.set_label('log(1 + #datapoints per sq mile)', rotation=270)

axs[1].imshow(nyc_map, zorder=0, extent=BB);
im = axs[1].imshow(np.log1p(density_dropoff), zorder=1, extent=BB, alpha=0.6, cmap='plasma')
axs[1].set_title('Dropoff density [datapoints per sq mile]')
cbar = fig.colorbar(im, ax=axs[1])
cbar.set_label('log(1 + #datapoints per sq mile)', rotation=270)

这些图表清楚地表明，这些数据点集中在曼哈顿和三个机场（肯尼迪、EWS、LGR）周围。在Seymour（右上角）附近也有一个热点。因为我不是美国人，有人知道这个地方有什么特别之处吗？

距离与时间观

在构建模型之前，我想测试一些基本的“直觉”：
拾取和掉落位置之间的距离越长，票价就越高。有些旅行，比如到机场，都是固定费用。晚上的票价与白天不同。所以，让我们检查一下。

拾取和掉落位置之间的距离越长，票价越高。

为了可视化距离-票价关系，我们需要先计算行程的距离。

In [25]:

# add new column to dataframe with distance in miles
df_train['distance_miles'] = distance(df_train.pickup_latitude, df_train.pickup_longitude, \
                                      df_train.dropoff_latitude, df_train.dropoff_longitude)

df_train.distance_miles.hist(bins=50, figsize=(12,4))
plt.xlabel('distance miles')
plt.title('Histogram ride distances in miles')
df_train.distance_miles.describe()

Out[25]:

count    195767.000000
mean          2.070060
std           2.364879
min           0.000000
25%           0.780658
50%           1.338753
75%           2.427781
max          64.644331
Name: distance_miles, dtype: float64