Portfolio of Work

The Master's Program offers a comprehensive Portfolio of Work option as a completion exam for students who are not pursuing a thesis or a Capstone project. As part of this option, students are required to take the one-credit, graded course STA 583 - Communicating Statistics and Data Science. This course is specifically designed for portfolio students and is intended to be completed during the Spring semester of their second year, which should coincide with their final semester in the program.

Portfolio Topics

The portfolio topics can be derived from various sources such as internship projects, independent studies, datathons or hackathons, campus-wide applied projects like Data+, self-study on publicly available data, or other applied projects in which students have participated and successfully completed by the beginning of the last Spring semester. Please note that course projects alone do not automatically qualify as portfolio topics unless students go above and beyond the requirements of the course project to extend the analysis.

STA 583 - Communicating Statistics & Data Science

Communication is a critical, yet often overlooked, part of data science and statistics. This one-credit course aims to help second-year Master’s students develop and practice their statistical communications skills as they prepare for their Portfolio of Work. Through interactive sessions, students will learn how to communicate complex data issues effectively, gain experience in explaining and interpreting results clearly and concisely to a diverse range of stakeholders, and create professional-quality reports and presentations, all in a supportive and welcoming peer review environment. This course is mandatory for all portfolio students and may also be taken by thesis or Capstone students.

Course Requirements

At the start of the course, students are expected to have a draft of 15 minute presentation slides and a draft of a 5-page report ready. These drafts will serve as the foundation for further improvement and updates as relevant topics are covered in the course. More detailed instructions regarding the presentation and report will be provided throughout the course.

Course Duration and Assessment

The course spans eight to ten sessions, commencing at the beginning of the Spring semester and concluding at the end of March. The final few sessions are reserved for small group presentations, where students will present their work in front of a small group and a faculty member and receive feedback before their final Portfolio of Work presentations in mid-April.

Portfolio Presentations

During the presentation, students will be evaluated for the following:

  • Achievement in core areas of statistical modeling, applied statistics, and statistical computing;
  • Achievement in defining the ability to address and solve real-world problems with relevant statistical and computational methods; and
  • Achievements in communicating in oral and written form with a professional audience.

Students completing the Portfolio of Work presentation must satisfy all of the above criteria at a Satisfactory or Excellent level. A student will otherwise receive written feedback on any aspects marked Unsatisfactory, including comments on recommended remedial paths.

MSS Portfolio Award

Each second-year MSS student completing an MSS portfolio will be eligible for the Master’s Portfolio Award. The purpose of the Portfolio Award is to encourage the development of data analysis skills, to enhance presentation skills, and to recognize outstanding work by Master's students. The Statistical Science Portfolio Committee selects the award based on the submitted portfolios and presentations. 

Examples of Work 

Portfolio Presentations


Download Example Poster 1 (pdf - 857.01 KB)
Download Example Poster 2 (pdf - 508.15 KB)


Portfolio Titles of Previous Graduates

  • Anomaly Detention Using Convolutional Autoencoders
  • XGBoost Prediction on Airbnb Availability Data
  • Sepsis Detection by Machine Learning Techniques
  • Facing the Heat: Predicting Durham County Heat Advisories Subject to High Class Imbalance and Multicollinearity (Part 1)
  • Facing the heat: predicting Durham County heat advisories subject to high class imbalance and multicollinearity (Part 2)
  • Data Pipelining using PySpark
  • Airbnb Availability Prediction Based On Machine Learning Algorithms
  • Testing Differences in Performance of Pricing Models
  • Ambulance Demand Predictions
  • Pricing Model in Mortgage Portfolio
  • Estimating Causal Effect in the Absence of Treatment Observability
  • Cloud Detection in R
  • Statistical Applications to Sports Wagering
  • Machine Learning Approaches to Sentiment Analysis
  • Click Through Rate (CTR) prediction for digital ads under high cardinaliy
  • Real-time eSport win rate prediction
  • Nearest Neighbor Protopool Model
  • Predicting potential account users by machine learning models
  • Predicting Claim Severity with Machine Learning and Natural Language Process Techniques
  • Calibrated Bayes Estimates for Bipartite Record Linkage
  • Tackling Imbalanced Dataset for Individual Risk Prediction
  • Bioenergy with Carbon Capture and Storage (BECCS) in the Western United States
  • Exploring Difference in Difference Estimator Behavior Under Different Settings
  • Anomaly Detection on Time Series Data
  • Voice-of-Policygenius-Customer (presenting with MIDS capstone)
  • Hot Shot: A Statistical Analysis of Duke Women’s Basketball Offensive Field Goal Efficiency
  • A Comparison of Record Linkage Methods applied to Real and Synthetic Data
  • Classification Models applied to Airbnb Listings in Asheville, North Carolina
  • Hidden Markov Models for Part of Speech Tagging
  • GDP Forecasting Using MIDAS and LSTM models with Macroeconomic Indicators
  • Modern Classification Approaches applied to In-vehicle Coupon Recommendations
  • Airbnb Availability Prediction with Machine Learning
  • Estimating North Pacific Right Whale Population Density using Machine Learning Methods
  • Modern Statistical Machine Learning Methods Applied to Airbnb
  • Survival Dynamic Generalized Linear Model in Private Market Funding
  • Predicting NBA Draft Prospect Value Using LTR
  • Predicting Fantasy Football with Bayesian Hierarchical Model
  • Predicting Auto Loan Refinances using Machine Learning
  • Assessing Firm Success Based on Board Member Composition Via Hierarchical Modeling
  • Causal Analysis on the Right Heart Catheterization Data
  • Dynamic interventions for COVID-19
  • Classification on Coupon Recommendation Data
  • Estimation of Dynamic Treatment Regimes using Contextual Bandits with Hierarchical Surrogate Outcomes.
  • Predicting Coupon Acceptance Using Machine Learning Algorithms
  • Classifying Email Text Data Using Natural Language Processing and Machine Learning Techniques
  • Analysis of street price data on diverted pharmaceutical substances provided by StreetRX
  • Part-of-speech Tagging
  • Stack height estimation from satellite imagery with statistical models
  • Sentiment Analysis with Naive Bayes and Neural Network Classifiers
  • Digitization of healthcare diagnosis: a validation tool for practitioners to assess heart disease diagnoses


  • Using Gradient Boosting Machines to Build an Unconstrained Pure Premium Model
  • Text Classification for Conduct Surveillance and Price Prediction with Gradient Boosting Machines
  • Machine Learning in Pharmacodynamic Modeling of Anti-HIV Microbicide
  • Bayesian Hierarchical Approaches to Topic Modeling and Text Classification
  • Mixed Models to Investigate Sex Difference in Effects of Environmental Interaction on Cognitive Resilience
  • Cost Reduction Analysis with Pharmaceutical Insurance Claim Data and Prediction of Annual Influenza Vaccination Status
  • Text Classification of Active Directory Data with Long Short-term Memory Networks
  • Detecting Medical Insurance fraud with Ensemble clustering
  • Hierarchical Dirichlet Processes for Topic Modeling
  • Forecasting Models in Business Field - Applications in Real Estate and Ecommerce Short Text Classification and Financial Machine Learning
  • Models in Adult Income Prediction and Futures Hedging Strategy
  • Hyperparameter Tuning and Model Selection for Classification Problem
  • Applications of Machine Learning Methods for Classification
  • Highly Multiclass Text Classification in a Business Setting and Airbnb Listing Price Prediction
  • Application of Time-Varying Multivariate Models on Energy Consumption and Economic Data
  • Applied Signal Processing in Medical Device Development
  • Drivers of Course Rating and Models to Predict Ecommerce Sales
  • Multilabel Text Classification and Image Steganalysis
  • Multilevel Models Analysis and Optimization on Product Financial Data
  • Co-occurence Analysis on MIMIC Dataset
  • Clustering-Based Movie Recommendation System
  • Traffic Index Prediction and Word Embedding
  • Auto-Encoding Graph-Valued Data with Applications to Brain Connectomes and Recommender Systems
  • Applied Forecasting Models in Government Revenue Data
  • Identifying Significant Variables through Random Forest and Ridge Regression
  • Lorenz Interpolation: A Method for Estimating Income Statistics from Tabular Income Data
  • Identifying Musical Similarities Across Geographical Regions
  • Integrating Record Linkage and Propensity Score Matching
  • Spatio-Temporal Analysis of Gun Violence Victims and its Relation with Unemployment Rate in the USA
  • Nonlinear Regression and Network Inference for Neural Spike Count Data
  • Bayesian Item Response Modeling for Assessing State Interventions
  • Interpretable, Fair and Accurate Machine Learning for Criminal Recidivism
  • Developing a Clinical Decision Support Tool for Talaromycosis: A Case Study in Model Selection with Missing Data
  • Density Estimation with Mixture of Spherelets
  • Modified Leave-One-Out Cross-Validation for Linear Model Selection
  • Hierarchical Mixed Model for Influenza Outbreak Detection
  • Bayesian Hierarchical Model Evaluating Heart Surgical Program
  • Email Classification with Machine Learning
  • Hierarchical Modeling for Ranking Pediatric Heart Surgery Mortality
  • A Machine Learning Case Study from an Insurance Data Set
  • A Note of Hierarchical Incremental Gradient Descent on Riemannian Manifold
  • Web Attack Detection using Deep Learning
  • Generating Cartoon Characters with Style Generative Adversarial Network
  • A Statistical Model to Assess Hospitals Net Income and Rankings
  • Study of Hierarchical Model Applications on Amphetamines
  • Multivariate Linear Regression with Sparsity Estimators
  • Quantification of Cross-Shopping in E-commerce
  • Bayesian Diagnosis Model on Fever in Moshi, Tanzania
  • Analysis and Implementation of K-Mean++ with Parallel Initialization
  • Exploring Bayesian TIme-Series Models with Financial Data
  • Effect of Democratic Campaign Spending on 2018 House Midterms
  • A Two -stage Labeling Framework for Effective Text Classification 
  • Extensions of Predictive Models
  • Bayesian Applications in Time Series 
  • Applied Machine Learning: Classification and Regression Examples
  • Comparing the Performance of DID and LDV in Different Scenarios
  • An R-based Prediction Tool for Optimizing Forecast
  • Applications of Sampling and Clustering Methods
  • Phase Transitions in Linear Models and DID Causal Inference Analysis
  • Community Detection Thresholds in Heterogeneous Graphs
  • Using Biclustering Methods to Classify High Dimensional Data
  • The Application of TVAR Method on Financial Data
  • Approaches to Data Visualization and Prediction: Healthcare to Art
  • Application of Statistical Methods on Financial and Medical Data
  • Machine Learning Models in Health Care
  • Time Series Model in Inventory Optimization Management
  • Unsupervised Exploratory Analysis Tool for Biclustering
  • The Yelp Restaurant Recommendation System
  • Prediction of Default Risks with Statistical Models
  • Machine Learning Application in Video Game Outcome Prediction
  • Statistical Modeling and Insights in Financial Industry
  • Trends in Balloon Catheter Dilation of Paranasal Sinuses
  • Inferring Drug Innovation with Adverse Events 
  • Machine Learning Methods for Spatial and Financial Applications
  • Applied Bayesian Methods for Text Mining
  • Dynamic Factor Analysis in Internet Search Volume and Stock Volatility 
  • Comparing  Model-based Ranking Methods to Evaluate Physicians and Hospitals
  • Prediction of Medication Non-adherence with Clinical Notes
  • Evaluating Performance of Hospitals and Physicians using a Binomial Generalized Linear Mixed Model 
  • Text Analysis and Other Exploration
  • Deep Learning for the Automatic Grading of Diabetic Retinopathy 
  • Modeling Economic and Political Dynamics in the Middle East
  • Python Implementation of Bayesian Hierarchical Clustering
  • Implementation and Applications of Bayesian Hierarchical Clustering
  • Multi-Scale Topological Data Analysis to Identify Brain Fiber Connectivity for Biological Systems Applications
  • Bayesian Approach on Correcting Model Performance given Biased Estimates of Feature Values
  • Predicting Patient Admissions in the Medicare Shared Savings Program
  • Comparison of Machine Learning Methods in the Estimation of Housing Prices
  • Evaluating the Performance of a Generalized Recommendation Engine for the Financial Services Industry
  • Predictive Analytics in Healthcare and Medical Data Exploration
  • Establishing a Realistic Prior Model for Complex Geometrical Objects
  • Graph-Coupled HMMs and Deep Neural Network for Modeling Infection and Medical Diagnosis
  • Empirical Study of Topic Modeling in Movie Recommendation
  • Statistical Modeling and Traffic Violation Analysis
  • News' Predictive Power on St. Louis Fed Financial Stress Index
  • Application of Neural Networks with Joint Embedding for Medical Document Classification
  • Analysis and Implementation of Classification Algorithms (Kmeans + +, CONCOR)