Taxi Trip Time Winners' Interview: 3rd place, BlueTaxi

数据驱动与机器学习在出租车行程预测竞赛中的应用
本文详细介绍了由来自不同国家的数据科学家组成的Blue Taxi团队如何利用数据探索、特征工程和机器学习技术,在Kaggle举办的出租车行程预测竞赛中获得第三名。团队成员分享了他们的背景、竞赛策略和技术细节,包括特征创建、预处理方法、模型选择和性能评估。文章还强调了数据探索、特征工程和模型选择的重要性,以及开源工具的应用。

Taxi Trip Time Winners' Interview: 3rd place, BlueTaxi

This spring, Kaggle hosted two competitions with the ECML PKDD conference in Porto, Portugal. The competitions shared a dataset but focused on different problems. Taxi Trajectory asked participants to predict where a taxi would drop off a customer given partial information on their journey, while Taxi Trip Time's goal was to predict the amount of time a journey would take given the same dataset.

taxi_triptime

418 players on 345 teams competed to predict the time a taxi journey would take.

Team BlueTaxi finished 3rd in Taxi Trip Time and 7th in Taxi Trajectory. This blog outlines how their team of data scientists from five different countries came together, and their winning approach to the Taxi Trip Time competition.

The Basics

The BlueTaxi Team

The BlueTaxi team is very multicultural, we are Ernesto (El Salvador),Lam (Vietnam), Alessandra (Italy), Bei (China), and Yiannis (Greece). We had great fun participating and winning the third place in this ECML/PKDD Discovery Challenge which was organized as the Kaggle, Taxi Trip Time, competition. In this post we would like to share with you how we did it.

What made you decide to enter this competition?

Ernesto: The Discovery Challenges, organized annually by ECML/PKDD, are always very interesting and this year was not the exception, especially this one organized on top of Kaggle. After a small chat with Lam, we decided to go for it and enter as the BlueTaxi team for both tracks of the challenge.

I took the lead for the trip time prediction and Lam led the destination prediction.

We invited Alessandra, Bei, and Yiannis to consolidate the final team of five.

Lam: I decided to enter because the competition was hosted by Kaggle and the ECML/PKDD 2015 conference will be a great a opportunity for researchers to benefit from exchanging ideas and their experience during the workshop session being held in conjunction with the main conference in Porto next September.

Alessandra: Lam and Ernesto presented the challenge to me and I thought it was very interesting so I decided to join.

Bei: For me that was also the case.

Yiannis: The problem of predicting the destination and trip time for a taxi route seemed very challenging and that's why I decided to participate and join the BlueTaxi team.

What was your background prior to entering this challenge?

Lam: I did my PhD in pattern mining for data streams at Technische Universiteit Eindhoven (TU/e) and joined IBM Research Lab in Ireland about a year and a half ago. My research interests include mining big and fast data on big data platforms with applications in telcos, transportation under the smarter city project.

Alessandra: My background is in transportation analytics. Specifically I work on estimation and prediction of traffic and urban traffic control systems. I hold a PhD degree in Information Technology from Politecnico di Milano and currently I am a Research Scientist at IBM Research - Ireland (also known as the IBM Smarter Cities Technology Centre in Dublin, Ireland.)

Bei: I am a statistician with primary interests in time series analysis, forecasting, resampling/subsampling methods for dependent data and financial econometrics. My recent work focuses on statistical methods in urban applications. I received my PhD in Statistics from the University of Waterloo, Canada. I joined IBM Research, Ireland, in late 2012.

Yiannis: I am a Research Software Engineer at Smarter Cities Technologies Center, IBM Research - Ireland. I hold a Masters Degree from Athens University of Economics and Business in Computer Science. Lately, I have been working with spatio-temporal data on various projects focusing on data curation, efficient storing and analysis. Moreover, I have experience with visualisation of similar type of data, helping data scientists to gain insights.

Ernesto: I hold a PhD in Computer Science from the L3S Research Center in the University of Hannover, Germany. My background is in supervised machine learning applied to Web Science, Social Media Analytics, and Recommender Systems. I joined IBM Research, Ireland, early 2014.

Do you have any prior experience or domain knowledge that helped you succeed in this competition?

Lam: In the IBM Research lab I have been working on several projects with similar data, e.g. with GPS traces from buses used for prediction of bus arrival time at bus stop, e.g., see our related paper regarding this topic (Flexible Sliding Window for Kernel Regression Based Bus Arrival Time Prediction) in the industry track at ECML/PKDD 2015.

Ernesto: I did not have any particular domain knowledge in transportation systems, but my experience in machine learning, data science and analytics were of course valuable for the competition.

Alessandra: Yes, my knowledge in the transportation field helped me in the challenge.

Yiannis: My experience with spatio-temporal data helped me in the competition.

Bei: I had some experience analyzing transportation data, which helped me in this challenge.

How did you get started competing on Kaggle?

Ernesto: I joined Kaggle a couple of years ago during my PhD. I did enter in some competitions before. The datasets available in Kaggle are usually very interesting and some of them were very useful for my research. But to be honest, I never got good traction in a competition until this time.

Lam: I also joined Kaggle 4 years ago when I was a PhD student. But I haven't tried to compete since then.

Bei: I joined Kaggle in 2012, and this was my second competition since then.

Alesandra and Yiannis: For us this was the first time that we had entered a Kaggle competition :)

Let's get technical

What preprocessing and supervised learning methods did you use?

First we created our local training set by selecting the cut-off times the same as the five snapshots on the test set (the same week-date as well). We also observe that 14th of August is the day before a big holiday in Portugal and 21th of December is the last Sunday before Christmas, both very particular days.

We created a bunch of features as follows:

1. Features from 10-NN. For every test trip we find 10 nearest neighbours w.r.t the Euclidean distance and consider the durations of those trips as predictors.

2. Features from Kernel Regression. Similar to 10-NN, kernel regression was used to predict the duration. Remember that kernel regression is a smooth version of kNN and these features yield very good results.

3. Some features from the partial trips: travelled distance, number of GPS updates, last GPS coordinates, average speed at different part of the trips and accelerations at different part of the trips.

When matching a test trip with the training trips, we only consider to match the last 100, 200, 300, 400, 500, 1000 meters and the full trips as well. This is because the later part of the trip is more important in some cases. Our model show that the last 500 meters of the trip is very important.

trajectory visualization

Two trips with different starting points but with the same destination (Porto Airport). The later part of the trajectories are very close to each other. Therefore via trip matching we can guess destination of the other trip if we can guess the destination of trips with similar route.

We also considered contextual matching (match only trips with same taxi id, the same week date, the same call id, etc.) because we observed different distributions of destination for these contexts. The kernel regression on taxi id context produced the best results.

When modelling, we did not predict the duration of the whole trip but instead predict the additional delta travel time with respect to the cut-off timestamp. Since the evaluation metric was RMSLE, we log-transformed the time target labels, i.e., the log of the additional delta time.

Outlier handling: we found that trips with missing values (identified at speed limits 160, 140, 100 Km/h) are more difficult to predict, we try to recover this information on the test set by looking at the gap between the cut-off timestamp and the start timestamp. Unfortunately, this information is not reliable so we decided to remove outliers based on the number of GPS updates based on an absolute deviation from the median of 3.5.

The test dataset for this competition was very small (320 instances), which makes it very prone to overfitting. Our final solution was a robust Ensemble of several models that included: Random Forests, Gradient Boosted Trees, and Extremely Randomized Trees.

BlueTaxi the overall winner approach ? We expected some variation in the top ranking positions given the small test set, but were quite surprised on all the shuffling we observed from the public to the final private leaderboard. Note that if we average the performance in the 100% of the test set (public and private scores), the BlueTaxi approach is actually the most robust and with the best overall RSMLE score ?

What was your most important insight into the data?

There are many interesting patterns found during data exploration, e.g., some trips with the same call id are very regular, trips to the airport are very easy to predict as well.

We used some open source journey planners to predict travel time but the results, using for example the destination prediction results from the other track of the challenge, but we found that the results were not very good for the taxi trips. This tells us that journey planners should include the real-time traffic information to improve its estimation of travel time.

Destination visualization

Heatmap of destinations of 155 trips with call id 59708. There are only 8 major destinations. Using call id information one can narrow down the possible destinations of the given taxi.

Which tools did you use?

R, Python (Numpy, Scipy, Sklearn and Pandas), Matlab, and Javacript and Leaftlet for visualisation.

How did you spend your time on this competition?

A rough estimate would be 80% for feature engineering, 10% in data exploration, and 10% in modelling.

What was the run time for both training and prediction of your winning solution?

Most features were based on kNN search, which was very time consuming in general, but we implemented a novel indexing technique to solve the efficiency issues, the running time was reduced significantly with the index.

For the modelling part, the most time consuming step is always the model parameter selection, which we conducted using cross validation. The final training is relatively fast. We included in our ensemble Extremely Randomized Trees, which given their nature are much faster to train with respect to Random Forest or Gradient Boosted Trees, and their individual results were quite competitive in our validation set.

Words of wisdom

What have you taken away from this competition?

Our data-driven and machine learning solution in the end led to significant better results than other models based solely on transportation domain knowledge.

We learnt that data exploration, feature extraction and engineering are very important. Unfortunately, this task still requires a lot of human effort. Modelling is much easier now with many open-source off-the-shelf machine learning tools.

This fact encourages us to continue working on research directions that try to find a generic solution to automate or at least semi-automate this process, i.e. make ML and AI closer to each other.

Do you have any advice for those just getting started in data science?

Be always curious and open your mind to what the data has to tell you.

Make data science your passion!

In Kaggle competitions, what it really matters is your passion to learn and discover new things and not what you already know.

Teamwork

How did competing on a team help you succeed?

We think that competing as a team was the fundamental key for success. Remember that we were participating in both tracks of the ECML/PKDD challenge. Basically two Kaggle competitions at the same time and this was our side project and hobby.

In the end, our BlueTaxi finished 3rd in the taxi trip time prediction and 7th in the destination prediction track. We would not be able to achieve such high rankings if we were not competing as a team.

Bio

Ernestoernesto headshot: Ernesto is passionate about Web Science, Big Data analytics and the power of Machine Learning to support decision making processes. His research focus is to contribute novel and intelligent filtering approaches that leverage social interactions, multidimensional relationships, and other ubiquitous data in the Social Web in order to relieve people's information overload. Ernesto's research has advanced the state-of-the-art in personalized ranking and recommender systems and has demonstrated to be valuable for industrial practitioners as well. Ernesto holds a PhD from the L3S Research Center in the University of Hannover, Germany, and currently is a Research Scientist in Machine Learning and Data Science at IBM Research - Ireland.

LamLam headshot: Lam did his PhD in pattern mining in data stream at Technische Universiteit Eindhoven in December 2012. He joined IBM research lab in Dublin Ireland right after that and has been working on different research projects, proposing novel machine learning, statistical modelling approaches to solve real-world problems to improve city life.

AlessandraAlessandra headshot: Alessandra Pascale received the B.Sc. degree (with honors) from Politecnico di Bari in 2006 and the M.Sc. degree (mark 110/110) from Politecnico di Milano in 2009, both in Telecommunication Engineering, and the Ph.D. degree in Information Technology from Politecnico di Milano in 2013. She was also enrolled in the ASP (Alta Scuola Politecnica) program, a school of excellence between Politecnico di Milano and Politecnico di Torino. She was visiting researcher at University of California, Berkeley, in the TOPL group, from February to July 2012. In 2013, she worked for 6 months in the framework of the European research project DIWINE, "Dense cooperative wireless cloud network".

At the present time she works as Research Scientist at the IBM Smarter Cities Technology Centre in Dublin, Ireland. Her research interests are in the field of signal processing, in particular statistical estimation and prediction of vehicular traffic, distributed/cooperative estimation approaches and urban traffic control systems.

BeiBei headshot: Bei Chen received her BMath in Statistics and Actuarial Science (with Dean's Honour's list and President's Award), MMath and PhD in Statistics (with Outstanding Achievement in Graduate Studies Award) from the University of Waterloo, Canada. She is the recipient of the 2011 Pierre Robillard Award of the Statistical Society of Canada, which recognizes the best thesis in probability and statistics defended at a Canadian university.

Prior to joining IBM, Bei Chen was an tenure-track Assistant Professor (September 2011 - October 2012) in the Department of Mathematics and Statistics at McMaster University, Canada. Her research was funded by various grants, including NSERC, ECR and MITACS.

YiannisYiannis headshot: Yiannis Gkoufas holds a Masters Degree from Athens University of Economics and Business in Computer Science. He is working as a Research Software Engineer on the High-Performance Systems Group in Dublin,Ireland since 2012. His main focus is building domain-specific applications and architectures on top of distributed computing frameworks. Moreover, he is responsible for developing the middleware supporting web-based applications. He has been involved in a vast variety of funded projects evolving around Energy and Demand Forecast Analytics, Transportation Monitoring and Planning and Event Detection from Social Media.

Disclaimer

Disclaimer from the authors: This post is our own and doesn't necessarily represent IBM's positions, strategies or opinions.

<think>嗯,用户的问题是关于上面的Python代码中如何插入参与者。我需要先仔细看一下他们提供的代码,然后理解他们的需求。首先,用户可能想知道如何修改或使用这个lottery_draw函数来添加更多的参与者,或者可能是在问如何正确地将参与者列表传递给函数。 让我看看代码的结构。函数lottery_draw的参数participants是参与者列表,可以是各种类型的标识,比如字符串、数字等。在示例部分,用户创建了一个all_people列表,里面有100个参与者,格式是“参与者_001&rdquo;到“参与者_100&rdquo;。然后调用lottery_draw(all_people, 10)进行抽奖。 用户的问题“参与者怎么插入&rdquo;可能是在问如何添加新的参与者到这个列表中。比如,他们可能想手动添加特定的参与者,或者从文件、数据库读取数据,或者动态生成更多的参与者。 接下来,我需要考虑用户可能的真实需求。他们可能不太清楚如何构造participants列表,或者想扩展现有的参与者数据。例如,他们可能需要添加不同的标识符类型,或者合并多个来源的数据。 需要检查代码中的参数验证部分,确保当participants列表的长度足够时不会报错。比如,如果用户提供的参与者数量少于中奖人数,函数会抛出ValueError。所以,在插入参与者时,必须确保总数不少于所需的中奖人数。 可能的解决方案包括手动添加元素到列表,使用循环生成,或者从外部数据源导入。比如,用户可能想手动添加几个特定的参与者,或者从CSV文件中读取身份证号等。 此外,需要注意participants列表中的元素可以是任意类型,但通常使用字符串或数字作为唯一标识。比如,身份证号、用户ID、姓名等。因此,在插入时,用户需要确保每个参与者的标识符是唯一的,并且符合他们的业务需求。 在示例代码中,用户使用了列表推导式生成参与者,这可能是一个模板。用户可能需要替换这部分代码,插入自己的数据。比如,如果他们有一个真实的参与者名单,可能需要将其转换为Python列表。 另一个可能性是用户想动态添加参与者,比如通过用户输入。这时候可能需要使用input函数或者读取文件来构建列表。但根据代码结构,这需要在调用lottery_draw之前准备好participants列表。 总结一下,用户可能需要知道如何构建或修改participants参数,以正确传入参与者数据。因此,回答应该包括几种常见的插入参与者的方法,比如手动添加、循环生成、从文件读取等,并给出示例代码。同时需要提醒参数的有效性,比如确保参与者数量足够,元素类型正确等。 可能还需要提到异常处理,比如当参与者数量不足时,函数会抛出错误,用户需要处理这种情况。或者,如果参与者列表中有重复项,是否会影响抽奖结果。不过根据当前代码,使用random.sample会打乱顺序,但如果有重复项,可能会导致中奖者重复,但sample函数不会重复选择同一个元素,除非原列表中有重复。因此,需要提醒用户确保参与者列表中的元素是唯一的,否则可能出现重复中奖的情况。 因此,回答应该分步骤说明如何插入参与者,提供多种方法,并指出注意事项,如唯一性、数量要求等。这样用户可以根据自己的实际情况选择合适的方式。</think>以下是关于如何插入参与者的分步说明: $$participants = [元素1, 元素2, ..., 元素N]$$ 只需修改这个列表即可插入任意类型的参与者标识。以下是具体方法: 一、🖋️ 手动插入示例 ```python # 方式1:直接列出所有元素 participants = ["张三_310112199909091234", 1001, "李四", "王五@example.com"] # 方式2:混合不同类型(数字、字符串等) all_people = [ "用户A", 2023001, # 支持数字ID {"name": "临时用户", "type": "VIP"}, # 甚至支持字典(需注意后续使用方式) "身份证号_31011520000101111X" ] ``` 二、🔄 批量生成示例 ```python # 生成1000个带编号的参与者(格式自定义) participants = [f"用户_{i:04d}" for i in range(1, 1001)] # 生成数字ID(如学号) student_ids = [20230000 + i for i in range(1, 501)] # 生成20230001~20230500 ``` 三、📥 从外部数据源加载 ```python # 从文本文件读取(每行一个参与者) with open("users.txt", "r") as f: participants = [line.strip() for line in f] # 从CSV读取(使用pandas库示例) import pandas as pd df = pd.read_csv("participants.csv") participants = df[&#39;身份证号&#39;].tolist() # 假设列名为"身份证号" ``` 四、⚠️ 重要注意事项 1. 唯一性要求:列表元素不需要唯一,但若原列表有重复元素,可能被多次选中 $$P(\text{重复选中}) = \frac{\text{元素出现次数}}{\text{列表总长度}}$$ 2. 数量验证:调用时需要满足 $$len(participants) \geq winners\_num > 0$$ 否则会触发`ValueError` 3. 元素类型:支持任意可哈希类型 $$ \text{有效元素} \in \{ \text{str}, \text{int}, \text{tuple}, ... \} $$ 但需注意:若使用字典等不可哈希类型,`random.sample()`会报错 五、✨ 完整调用示例 ```python # 创建含200个学生的列表(学号从2023001开始) students = [f"STU{2023000 + i}" for i in range(1, 201)] # 执行抽奖(抽取30个中奖者) winners, non_winners = lottery_draw(students, 30) # 打印结果 print(f"中奖率:{30/200*100}%") # 输出:中奖率:15.0% ``` 通过修改`participants`列表的内容和生成方式,即可灵活适应各类抽奖场景。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值