Inferring home and work locations using GPS trajectories and DBSCAN(使用GPS轨迹和DBSCAN推断家庭和工作地点)

本文介绍了一种使用GPS轨迹数据和DBSCAN算法推断个人家庭和工作地点的方法。通过分析用户001的GPS轨迹,识别出4个主要活动区域,并结合时间分布进一步推测其工作地点和居住地。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

               使用GPS轨迹和DBSCAN推断家庭和工作地点

在这篇文章中,我将演示如何使用移动用户的GPS轨迹来推断她的家庭和工作地点。 我使用的数据来自Microsoft Research Asia的GeoLife GPS Trajectories Dataset(下载链接)。 该数据集包含在3年内收集的182个用户的GPS轨迹。 我在这个演示中使用了user001的数据。 

#定义数据文件的路径
user = '001'
userdata = '/home/data/mobile/gps/Geolife Trajectories 1.3/Data/' + user + '/Trajectory/'

首先,我们读取user001的轨迹数据。 每个数据点都有纬度,经度,高度,日期和时间信息,如下表所示。 我们可以在地图上绘制这些GPS数据,以大致了解该用户的活动区域。

import numpy as np  #Python的一种开源的科学计算库
import matplotlib.pyplot as plt   #Python可视化库
import pandas as pd  #Python数据分析模块
import os  #统一的操作系统接口函数

# Enable inline plotting  启用内联绘图
%matplotlib inline        matplotlib内联

filelist = os.listdir(userdata)  #返回指定路径下所有文件和文件夹的名字,并存放于一个列表中
names = ['lat','lng','zero','alt','days','date','time']
df_list = [pd.read_csv(userdata + f,header=6,names=names,index_col=False) for f in filelist]  
#f为文件索引号,header为列数,names为列表列名,index_col为行索引的列编号或列名

df = pd.concat(df_list, ignore_index=True) #表格列字段不同的表合并

# delete unused column 删除未使用的列
df.drop(['zero', 'days'], axis=1, inplace=True) #drop函数默认删除行,列需要加axis = 1

# data is recorded every 1~5 seconds, which is too frequent. Reduce it to every minute
# 每隔1~5秒记录一次数据,这种情况太频繁了。 将它减少到每分钟
df_min = df.iloc[::12, :] #每隔12行取一次

df_min.head(10)  #查看前5行
 latlngaltdatetime
040.013635116.306926-592008-12-1023:55:24
1240.013896116.307355-2822008-12-1023:55:43
2440.014095116.3061271502008-12-1100:06:46
3640.014539116.3056811692008-12-1100:07:46
4840.015257116.3056421692008-12-1100:08:46
6040.015921116.3058661452008-12-1100:09:46
7240.015933116.3062301242008-12-1100:10:36
8440.016083116.307086942008-12-1100:11:36
9640.016240116.308069962008-12-1100:12:36
10840.016419116.3089151122008-12-1100:13:36
print 'Total GPS points: ' + str(df_min.shape[0])  #df.shape():查看行数和列数
Total GPS points: 9045  #一共9045个轨迹点
import gmplot
# declare the center of the map, and how much we want the map zoomed in
# 声明地图的中心,以及我们希望地图放大多少倍
gmap = gmplot.GoogleMapPlotter(df_min.lat[0], df_min.lng[0], 11)
gmap.plot(df_min.lat, df_min.lng)  #描绘轨迹点
gmap.draw("user001_map.html")   #显示图

 

 

然后我们使用DBSCAN算法来识别此数据集中的聚类。 DBSCAN是一种聚类算法,对于聚类具有许多异常值的空间数据特别有用(这篇文章给出了DBSCAN的一个很好的解释)。 使用该算法,我们在user001的GPS轨迹中识别出4个聚类。 直观地,群集的形成表明用户经常访问该特定区域。 因此,我假设用户的家庭和工作地点属于这4个集群。

信用:以下代码是根据Geoff Boeing的帖子调整的。 

 

from sklearn.cluster import DBSCAN
from sklearn import metrics

# represent GPS points as (lat, lon)   #将GPS点表示为(lat,lon)
coords = df_min.as_matrix(columns=['lat', 'lng'])

# earth's radius in km  地球的半径(公里)
kms_per_radian = 6371.0088
# define epsilon as 0.5 kilometers, converted to radians for use by haversine
# 将epsilon定义为0.5公里,转换为弧度以供hasrsine使用
epsilon = 0.5 / kms_per_radian

# eps is the max distance that points can be from each other to be considered in a cluster
# eps是要在群集中考虑的点之间的最大距离
# min_samples is the minimum cluster size (everything else is classified as noise)
# min_samples是最小簇大小(其他所有内容都归类为噪声)
db = DBSCAN(eps=epsilon, min_samples=100, algorithm='ball_tree', metric='haversine').fit(np.radians(coords))
cluster_labels = db.labels_   #获得数据聚类后的标签数
# get the number of clusters (ignore noisy samples which are given the label -1)
# 获取簇的数量(忽略给出标签-1的噪声样本)
num_clusters = len(set(cluster_labels) - set([-1]))

print 'Clustered ' + str(len(df_min)) + ' points to ' + str(num_clusters) + ' clusters'

# turn the clusters in to a pandas series  将群集转换为pandas系列
clusters = pd.Series([coords[cluster_labels == n] for n in range(num_clusters)])
print(clusters)

 

Clustered 9045 points to 4 clusters

0    [[40.013635, 116.306926], [40.013896, 116.3073...
1    [[40.069887, 116.334147], [40.069912, 116.3357...
2    [[39.966773, 116.432298], [39.966313, 116.4328...
3    [[39.909055, 116.411944], [39.90893, 116.41128...
dtype: object
from shapely.geometry import MultiPoint
from geopy.distance import great_circle
def get_centermost_point(cluster):  #函数1
    centroid = (MultiPoint(cluster).centroid.x, MultiPoint(cluster).centroid.y)
    centermost_point = min(cluster, key=lambda point: great_circle(point, centroid).m)
    return tuple(centermost_point)  #返回元组

# get the centroid point for each cluster  获取每个群集的质心点
centermost_points = clusters.map(get_centermost_point) #map表示每个簇类cluster调用函数1,返回中心点
lats, lons = zip(*centermost_points)  #zip合并后,赋给两列
rep_points = pd.DataFrame({'lon':lons, 'lat':lats})  #返回簇类中心点的两列经纬度
# fig, ax = plt.subplots(1,3,figsize=(15,7)),这样就会有1行3个15x7大小的子图。
fig, ax = plt.subplots(figsize=[10, 6])  #建立一个fig对象,建立一个axis对象
# scatter参数(点的经纬度,c为点的颜色,edgecolor为边缘颜色,s为点的标量,alpha混合值,介于0(透明)和1(不透明)之间
rs_scatter = ax.scatter(rep_points['lon'][0], rep_points['lat'][0], c='#99cc99', edgecolor='None', alpha=0.7, s=450)
ax.scatter(rep_points['lon'][1], rep_points['lat'][1], c='#99cc99', edgecolor='None', alpha=0.7, s=250)
ax.scatter(rep_points['lon'][2], rep_points['lat'][2], c='#99cc99', edgecolor='None', alpha=0.7, s=250)
ax.scatter(rep_points['lon'][3], rep_points['lat'][3], c='#99cc99', edgecolor='None', alpha=0.7, s=150)
df_scatter = ax.scatter(df_min['lng'], df_min['lat'], c='k', alpha=0.9, s=3) #描绘轨迹点
ax.set_title('Full GPS trace vs. DBSCAN clusters')  #图表标题
ax.set_xlabel('Longitude')  #图表x轴标签
ax.set_ylabel('Latitude')  #图表y轴标签
# 轨迹点和群集,(两个标签),loc为位置代码
ax.legend([df_scatter, rs_scatter], ['GPS points', 'Cluster centers'], loc='upper right')

labels = ['cluster{0}'.format(i) for i in range(1, num_clusters+1)]  #簇类标签
for label, x, y in zip(labels, rep_points['lon'], rep_points['lat']):
    plt.annotate(                                                
        label,   #在图形中添加注释,标签箭头,# xy设置箭头尖的坐标
        xy = (x, y), xytext = (-25, -30),  # xytext设置注释内容显示的起始位置
        textcoords = 'offset points', ha = 'right', va = 'bottom',
        bbox = dict(boxstyle = 'round,pad=0.5', fc = 'white', alpha = 0.5),
        arrowprops = dict(arrowstyle = '->', connectionstyle = 'arc3,rad=0'))

plt.show()

 

为了推断家庭和工作地点,我在这里使用了一个非常简单的启发式:时间。 下面我绘制四个集群中每个集群中GPS数据点的时间分布。 我们可以看到,从早上9点到晚上18点,用户停留在集群1区域,而在午夜到早上8点,用户倾向于留在集群2和集群3中。因此,我推断用户001的工作位置在集群1和家中 位于集群2中。集群3可能是她的次要居住地点。

当然,我们可以应用更复杂的启发式来推断家庭和工作地点。 例如,我们还可以在工作日和周末检查用户的位置,以便为我们提供更多线索。

# Get the hours for each cluster  获取每个群集的小时数
M = []
def myfunc(row):
    t = df_min[(df_min['lat']==row[0]) & (df_min['lng']==row[1])]['time'].iloc[0]
    return t[:t.index(':')]
for i in range(num_clusters):
    hours = np.apply_along_axis(myfunc, 1, clusters[i]).tolist()
    M.append(map(int, hours))
#创建4个子图,sharex共享x轴,figsize为子图大小
f, axarr = plt.subplots(4, sharex=True, figsize=(6,10))
axarr[0].hist(M[0])
axarr[0].text(20, 1600, "cluster 1")  #指定位置显示文字,plt.text()
axarr[1].hist(M[1])
axarr[1].text(20, 50, "cluster 2")
axarr[2].hist(M[2])
axarr[2].text(20, 40, "cluster 3")
axarr[3].hist(M[3])
axarr[3].text(20, 50, "cluster 4")
axarr[3].set_xlabel("Hours of a day") #显示x轴标题
plt.xticks(np.arange(0, 25, 2.0))
# 显示子图标题,'~',中心位置,垂直显示
f.text(0.04, 0.5, '# of GPS points', va='center', rotation='vertical')

fig, ax = plt.subplots(figsize=[10, 6])
rs_scatter = ax.scatter(rep_points['lon'][0], rep_points['lat'][0], c='#99cc99', edgecolor='None', alpha=0.7, s=450)
ax.scatter(rep_points['lon'][1], rep_points['lat'][1], c='#99cc99', edgecolor='None', alpha=0.7, s=250)
ax.scatter(rep_points['lon'][2], rep_points['lat'][2], c='#99cc99', edgecolor='None', alpha=0.7, s=250)
ax.scatter(rep_points['lon'][3], rep_points['lat'][3], c='#99cc99', edgecolor='None', alpha=0.7, s=150)
df_scatter = ax.scatter(df_min['lng'], df_min['lat'], c='k', alpha=0.9, s=3)
ax.set_title('Full GPS trace vs. DBSCAN clusters')
ax.set_xlabel('Longitude')
ax.set_ylabel('Latitude')
ax.legend([df_scatter, rs_scatter], ['GPS points', 'Cluster centers'], loc='upper right')

labels = ['Work', 'Home', 'Home 2']
for label, x, y in zip(labels, rep_points['lon'][:num_clusters-1], rep_points['lat'][:num_clusters-1]):
    plt.annotate(
        label, 
        xy = (x, y), xytext = (-25, -30),
        textcoords = 'offset points', ha = 'right', va = 'bottom',
        bbox = dict(boxstyle = 'round,pad=0.5', fc = 'white', alpha = 0.5),
        arrowprops = dict(arrowstyle = '->', connectionstyle = 'arc3,rad=0'))

plt.show()

 

 最后,我们将这些推断的位置(work,home,home2)标记回地图。

gmap = gmplot.GoogleMapPlotter(rep_points['lat'][0], rep_points['lon'][0], 11)
gmap.plot(df_min.lat, df_min.lng)
gmap.heatmap(rep_points['lat'][:3], rep_points['lon'][:3], radius=20)
gmap.draw("user001_work_home.html")

 

### Gaussian Mixture Models (GMMs): EM Algorithm versus Variational Inference In the context of machine learning, both Expectation-Maximization (EM) algorithms and variational inference serve as powerful tools for parameter estimation within probabilistic models such as Gaussian mixture models (GMMs). However, these methods differ significantly in their approach to handling uncertainty. #### The Expectation-Maximization (EM) Algorithm The EM algorithm is an iterative method used primarily when dealing with incomplete data or latent variables. It alternates between two steps until convergence: - **E-step**: Compute the expected value of the log likelihood function concerning unobserved data given current estimates. - **M-step**: Maximize this expectation over parameters to find new values that increase the probability of observing the training set[^2]. For GMMs specifically, during each iteration, the E-step calculates responsibilities indicating how likely it is for a point to belong to any particular cluster; meanwhile, the M-step updates means, covariances, and mixing coefficients based on those computed probabilities. ```python from sklearn.mixture import GaussianMixture gmm_em = GaussianMixture(n_components=3, covariance_type='full') gmm_em.fit(X_train) ``` #### Variational Inference Approach Variational inference takes a different path by approximating complex posterior distributions through optimization rather than sampling techniques like Markov Chain Monte Carlo (MCMC). This approximation involves constructing a simpler family of densities—often referred to as "variational distribution"—and finding its member closest to the true posterior according to Kullback-Leibler divergence criteria[^1]. When applied to GMMs, instead of directly computing exact posteriors which might be computationally prohibitive due to high dimensionality or large datasets, one defines a parametric form q(z|x), where z represents hidden states while x denotes observed features. Then optimize parameters so that KL[q||p] becomes minimal possible under chosen constraints. ```python import tensorflow_probability as tfp tfd = tfp.distributions model = tfd.JointDistributionSequential([ # Prior p(pi) tfd.Dirichlet(concentration=[alpha]*num_clusters), lambda pi: tfd.Sample( tfd.Normal(loc=tf.zeros([dim]), scale=tf.ones([dim])), sample_shape=num_clusters, name="means" ), ]) ``` #### Key Differences & Applications While both approaches aim at inferring unknown quantities from noisy observations, they exhibit distinct characteristics making them suitable for various scenarios: - **Computational Efficiency:** Generally speaking, EM tends to converge faster but may get stuck into local optima more easily compared to VI whose global search capability can sometimes lead to better solutions albeit slower computation time. - **Flexibility:** Due to reliance upon specific assumptions about underlying structure, traditional EM implementations are less flexible regarding model specification changes whereas Bayesian nonparametrics paired with VI offer greater adaptability without sacrificing much performance. - **Uncertainty Quantification:** One significant advantage offered by VI lies in providing full density functions over learned parameters thus enabling richer interpretations beyond mere point estimates provided typically via maximum likelihood estimators employed inside standard EM procedures. --related questions-- 1. How does the choice between EM and VI impact real-world applications involving massive datasets? 2. Can you provide examples illustrating situations favoring either technique over another? 3. What modifications could enhance classical EM's robustness against poor initialization issues commonly encountered? 4. Are there hybrid strategies combining strengths of both methodologies worth exploring further?
评论 7
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值