Tools used: Python(Pandas, scikit-learn, Matplotlib), R(dplyr)
Supervised a team of five to conduct a comprehensive analysis of 100M+ U.S. flight delay and cancellation records. Through weekly coordination and task tracking, we streamlined the data workflow and identified key predictors of flight delays.
We examined over 40 features related to flight schedules, airport conditions, and weather using visualizations in Python and R. Classification models including logistic regression, decision trees, and random forests achieved up to 72% accuracy with a 46% true positive rate.
To further improve performance, we applied Bayesian logistic regression with MCMC(Monte Carlo Markov Chain) simulation and hierarchical modeling, reaching 78% accuracy on a refined dataset of 500 observations.