Accurate prediction for future number of taxi transactions enables taxi hailing companies to allocate drivers, design incentives and set prices accordingly. In this project, I showed that Random Forest Regressor achieved 80% less root mean squared error (RMSE) than the baseline time series model on the task of predicting future number of taxi transactions. Exploratory analyses were first conducted on taxi transaction records in New York City from 2015 to 2019 and temporal regularities were found. Hourly numbers of transactions in December 2019 in 8 taxi zones were used to test the performance of different algorithms. Using hour, day, month and number of transactions at previous time points as features, Random Forest Regression outperformed other algorithms (time series models, Long-Short Term Memory) and reached RMSE of 9.6, thereby reducing the RMSE of the exponential smoothing baseline by 80%. Slides and code for this project can be found on GitHub.
Background ๐
Predicting the future number of taxi transactions with fine spatial and temporal resolution is helpful for taxi hailing companies. This problem is a great example of time series forecasting. The approach can also be applied to other time series problems such as predicting future biological indices, product demand, stock price, pollution level, etc. In this project, I first conducted exploratory analysis to look for patterns in data. Then, I tried potential algorithms for the prediction tasks (e.g. time series models, Random Forest Regressors, LSTMs) and compared their results. The findings from the exploratory analysis phase were used to add features to machine learning models. The code can be found on my GitHub repository and the results are summarized in the slides here).
Dataset ๐
Taxi transaction records from New York City Taxi and Limousine Commission (TLC) (2015-2021, 7 consecutive years) were used for the analysis (download page).
Exploratory Analysis: Finding Patterns of Taxi Demand. ๐
1. Number of taxi transactions follows a temporal trend: When aggregating the transaction records, one can observe that the number of taxi transactions fluctuate monthly, daily and hourly with similar patterns. Annual number of transactions dropped dramatically in 2020-2021, presumably due to the COVID pandemics.
2. Trends vary in different taxi zones: Pattern of taxi transactions varies from zone to zone. For example, the number of taxi transactions peaks in the evening in some zones (e.g. Zone 25, 228, and 83). In other zones, there is one peak in the morning and another in the evening (i.e. Zone 166). Some zones have exceptionally high numbers of transactions. Most taxi zones have much smaller numbers of transactions in a given period of time.
Taxi zones can thus be grouped into 4 tiers. Tier 1 zones have more than 250 daily transactions in average, accounting for 10.7% (28/260) of all taxi zones in NYC. Tier 2 zones have more than 50 daily transactions on average, accounting for 13.5% (35/260) of all taxi zones in NYC. Tier 3 zones have more than 25 daily transactions in average, accounting for 12.3% (32/260) of all taxi zones in NYC. Zones with less than 25 daily transactions on average were categorized as Tier 4, accounting for 63.5% (165/260) of all taxi zones in NYC.
Framing The Task ๐
The goal of this project is to predict the number of taxi transactions in the future with fine temporal and spatial resolution. The transaction records from 8 randomly selected taxi zones (2 from each tier) from December 2019 are used to test different algorithms. Transaction records from before December 2019 were used to train models (if needed). Root Mean Squared Error (RMSE) is used to evaluate prediction results; the lower the RMSE, the closer the prediction is to the actual number of taxi transactions.
Models ๐
1. Time Series Models: Using only the previous number of transactions, time series models can be applied to predict future transaction amounts. In the project, exponential smoothing (baseline), Autoregressive model (AR), Moving Average model (MA), and ARMA model were tested.
2. Machine Learning Models: Machine learning models can leverage more features for the prediction task. For example, since we discovered hourly, daily and monthly trends in taxi transactions, we can use hour of day, day of the week and month can be used to predict the amount of transactions at future time points. In the project, Random Forest Regression and Long Short-Term Memory (LSTM) models were tested.
Results and Interpretation ๐
Random Forest Regression outperformed other algorithms (time series models, long-short term memory) and RMSE of 9.6, achieving nearly 80% less RMSE than the baseline model exponential smoothing (RMSE 45.7). Hour of day, day of the week, month, number of taxi transactions in the taxi zone and 2 neighboring zones (t-1~t-6) were used as features.
Performance enhancement using Random Forest Regression can be due to the algorithm itself and the features chosen. Monthly, daily, hourly trends affect the number of transactions, so they can serve as useful features to determine the amount of transaction at a given hour. However, the same features did not work as well on the LSTM model. This can be because it is difficult for the LSTM model to converge and find optimal parameters. Number of transactions at t-1 to t-6 of every taxi zone and 2 neighbors were used as features. However, at different time points, the relationship between the current number of transactions and previous numbers are different. For example, at 5pm the number of transactions is high in one taxi zone and from 11am to 4pm are low. However, at 11pm, the reverse might be true. The poor performance could be due to the nature of the LSTM model and sparse data per hour and zone.
Next Steps ๐
1. Include other features: Other features, such as weather and temperature may be helpful for the prediction task.
2. Leverage other algorithms: Other models can be leveraged or combined for future prediction of taxi transaction numbers.
3. Scale up: The algorithms can be used to predict future transactions in every zone in New York City. Data points from longer periods of time can be used for model training.
4. Modify model on the go: Different patterns and features might occur as time goes. Thus, the model and weights should be constantly adjusted to optimize prediction results.