
LIU Yun-Chung
Data Science MS Student
Accurate prediction for future number of taxi transactions enables taxi hailing companies to allocate drivers, design incentives and set prices accordingly. In this project, I showed that Random Forest Regressor achieved 80% less root mean squared error (RMSE) than the baseline time series model on the task of predicting future number of taxi transactions. Exploratory analyses were first conducted on taxi transaction records in New York City from 2015 to 2019 and temporal regularities were found. Hourly numbers of transactions in December 2019 in 8 taxi zones were used to test the performance of different algorithms. Using hour, day, month and number of transactions at previous time points as features, Random Forest Regression outperformed other algorithms (time series models, Long-Short Term Memory) and reached RMSE of 9.6, thereby reducing the RMSE of the exponential smoothing baseline by 80%. Slides and code for this project can be found on GitHub.
(Read More)
One of recent breakthroughs in machine translation is the application of the transformer. Compared with the sequence-to-sequence model based on recurrent neural networks (RNN), the attention-based transformer model makes use of encoder output at every time step, which further enhances machine translation performances. In this project, I trained both models and showed that attention-based models gained 6.8% in BLEU score on an English-French translation dataset from Tatoeba.
(Read More)
Word embedding is an effective way to represent words, carrying semantic and syntactic information with a predefined length of numbers. In a seminal paper by Mikolov et al. (2013), they demonstrated the Word2vec technique to train word embeddings, including the continuous bag-of-words (CBOW) and skip-gram model. Words with similar neighbors (i.e. context) have similar embeddings. The technique has inspired many later works on training word embedding, which can be applied to analogy and other downstream tasks. In this project, I used NumPy to train word embeddings applying the CBOW model to demonstrate the structure and math of the model and applied the resulting embeddings to an analogy task.
(Read More)
Name Entity Recognition (NER) is a natural language processing task for categorizing words into name entities with numerous applications (e.g., searching, text classification, etc.). The following is a demo on how I trained a Bi-directional Long-Short Term Memory (BiLSTM) model to predict name entities with an accuracy of over .95 using the CoNLL-2003 Dataset.
(Read More)
Machine learning (ML) can be applied for medical decision support in clinical settings. In this post, I demonstrated the workflow of training ML models for pediatric pneumonia patient risk evaluation using randomly generated electronic health records (EHRs). I started with data cleansing and exploratory analysis on physician-identified biological indices, such as pathogens, lab data and vital signs. The goal train model to classify patients into 2 groups: high-risk patients with the need for intensive care and low-risk patients. Various ML models, such as logistic regression, boosted trees and feedforward neural networks were applied for the classification task. The workflow was applied on real-world EHRs at National Taiwan University Hospital and achieved an AuROC of 0.99. The detailed methodologies and results were published in JMIR Medical Informatics (Liu et al., 2022).
(Read More)
In this project, I took a modified version of the famous aroma wheel and used word-count to represent the characteristics of French wine. Don’t know how to choose a bottle of wine? Check out your favorite aroma!
(Read More)