(Programming language: Python )
(August 2020 - April 2021)
(joint work with Sanjay Pillay, Allison Roderick, Hao Wang and John Santerre)
Antimicrobial Resistance (AMR) is a growing concern in the medical field. Over-prescription of antibiotics as well as bacterial mutations have caused some once lifesaving drugs to become ineffective against bacteria. However, the problem of AMR might be addressed using Machine Learning (ML) thanks to increased availability of genomic data and large computing resources. The Pathosystems Resource Integration Center (PATRIC) has genomic data of various bacterial genera with sample isolates that are either resistant or susceptible to certain antibiotics. Past research has used this database to use ML algorithms to model AMR with successful results, including accuracies over 80%. To better aid future biologists and healthcare workers who may need a predictive model without the benefit of thousands of bacteria samples, this paper explores quantifying the empirical quality of some machine learning models—that is, quantifying how well a model will perform without prior knowledge of how the model performed on a training dataset. WeightWatcher is a Python package that offers various algorithms to measure model quality. This research uses the empirical quality metrics that WeightWatcher introduces for Deep Neural Network (DNN) models to evaluate AMR models, even on datasets based on small sample sizes of bacterial strains. The use of ML in AMR and pharmacogenetic research can help lead to increased efficacy of antibiotic treatments by predicting whether a strain of bacteria will be resistant or susceptible to an antibiotic.
Paper: https://scholar.smu.edu/datasciencereview/vol5/iss1/10/
(Nguyen, Huy H.; Pillay, Sanjay; Roderick, Allison; Wang, Hao; and Santerre, John (2021) "Analyzing Empirical Quality Metrics of Deep Learning Models for Antimicrobial Resistance," SMU Data Science Review: Vol. 5 : No. 1 , Article 10. )
(January - April 2021)
(joint work with T. Abera, J. Coleman, G. Gonzales and S. McWhirter )
This research covers the Apache Spark ecosystem, where it fits into the cloud framework, its functions underneath the hood, and machine learning examples to illustrate how the ecosystem operates. Apache Spark is a distributed computing framework that is ideal for handling big data. Spark is an ideal solution to explore, as the ability to handle big data can be a burdensome administrative task yet very necessary. This research will discuss where Spark fits into the framework of IaaS, PaaS, and SaaS, as well as its benefits and drawbacks compared to proprietary solutions. This research also provides a brief demonstration on how computing on Spark actually takes place to further illustrate its functions.
Final report: https://hnguye01.github.io/spark.pdf
(June - August 2020)
In this leadership role, I serve in an advisory capacity scientifically and technically for an intern team (multidisciplinary students team). By supervising the team, we will figure out how to integrate research content into marketing/ad materials to create engaging content for ads/marketing.
(Programming language: Python )
(joint work with Sanjay Pillay, Hao Wang and Wallun Chung)
(May - August 2020)
Visualization and Data Preprocessing
Kaggle dataset (https://www.kaggle.com/stefanoleone992/fifa-20-complete-player-dataset) players_20.csv is used in this lab. The dataset provides detailed information of all the soccer players statistics in various clubs of major soccer leagues in the world. The data is originally from the FIFA soccer game created by EA sports. This game estimates the abilities of the actual players and built the game according to the data. And the Kaggle dataset is scraped from www.sofifa.com, where the gaming data is collected. The data we used is last updated on Sept 19th 2019.
The dataset is important because the abilities of players is estimated from the actual players. FIFA is a very popular game and EA Sports is one of the largest sport video game developer. The player abilities estimate is quite accurate. Therefore, useful knowledge can be mined from the data for analyzing variety problems in the soccer industry. For example, wage analysis, player analysis, training strategy, budget analysis, sport gambling strategy plan can be performed using this dataset. Soccer has been a big industry which the market size is estimated to worth $488 billion in 2018 according to the Business Wire.
Some of the analyses we are interested in to performs are:
Run predictive model to predict wage from players abilities. We will select some features to run regression model. Also, we will try running PCA and regression model. To measure the effectiveness, we will use RMSE, MAE and R squared.
Run classification model to classify players position from the players ability. To validate our model, we will use accuracy, precision, F1 score and ROC.
We also intent to use the detailed game statistics from fbref.com. We can potential run analysis using the win/lose results from the real time data.
Final report: https://hnguye01.github.io/DS7331/lab1_team_pms.html
Logistic Regression, Stochastic Gradient Descent and Support Vector Machine
In this study, we will use different machine learning classification algorithms such as Logistic Regression (LR), Stochastic Gradient Descent Classifier (SGD) and Support Vector Machine (SVMs). We carry over the Business Understanding and Data Preparation from Lab 1 to this Minilab. More information related to Visualization and Data Preprocessing can be found in Lab 1 - Visualization and Data Processing.
Final report: https://hnguye01.github.io/DS7331/minilab.html
Classification and Regression
In this lab, we will use different machine learning algorithms for our two main tasks documented below.
Run predictive model to predict wage from players abilities (regression models). We will select some features to run regression models. Also, we will try running LDA and regression model. To measure the effectiveness, we will use RMSE, MAE and R squared.
Run classification model to classify players (classification models) position from the players ability. To validate our models, we will use accuracy, precision, F1 score and ROC.
We carry over the Business Understanding and Data Preparation from Lab 1. More information related to Visualization and Data Preprocessing can be found in Lab 1 - Visualization and Data Processing.
Final report: https://hnguye01.github.io/DS7331/lab2.html
Clustering Analysis
In this lab, we will use different machine learning algorithms to implement clustering as a feature selection technique. We will compare a few of our existing models from Lab 2 to see if clustering provides any advantage on those models.
For the the regression models that predicted player wages, we make an assumption that the super star players, average players, new players, or other types of players can make a big influence on the players' wage. Therefore, there maybe outliers that may affect the accuracy of the regression models. A superstar player usually get extremely high wages and we believe the key feature to identify those players are the attacking skill features. We will implement clustering to find the potential groups from the attacking features. Then, we will use the clustered result combining with rest of the features to run regression tests.
We will be using MAE for the comparison. MAE measures the average magnitude of the errors in a set of predictions without considering their direction. It takes the average over the test sample of the absolute differences between prediction and actual observation where all individual differences have equal weight.
Final report: https://hnguye01.github.io/DS7331/lab3.html
(joint work with Gautam Kapila and Hao Wang)
(April 2020)
We present an overview and tutorial of the Apache Spark ecosystem used for Big Data Analyics. Apache Spark is a unified computing engine and a set of libraries for parallel data processing on computer clusters. It has emerged as a leading contender for big data processing. Big Data analytics has been one of the most active research areas in recent times. Techniques of database dealing with large data are extension from traditional relational database. Apache Spark technique can deal with different large data processing types, for example, in real-time data analysis or graph processing. It is a fast and general engine for large-scale data processing and a open source for developers or data scientists interested in Big Data Analytics.
Final report: https://hnguye01.github.io/DS7330/Apache-Spark.pdf
(Programming language: R )
(joint work with Dustin Bracy and Sabrina Purvis)
(April 2020)
The retail banking industry provides financial services to families and individuals. Banks’ main functions are threefold; they issue credit in the forms of loans and credit lines, provide a secure location to deposit money, and allow a mechanism to manage finances in the form of checking and savings accounts. This analysis will focus specifically on the influential factors from direct marketing campaigns managed by a Portuguese banking institution in an attempt to get secure commitment for term deposits. Understanding not only which marketing campaigns were most effective, but also the timing of the campaign and the socioeconomic demographics will allow the retail banking industry to further target and tune their approach to securing term deposits.
Bank Marketing data from this data set were used to address two project objectives:
Display the ability to perform EDA and perform a logistic regression analysis and provide interpretation of the regression coefficients including hypothesis testing, and confidence intervals.
With a simple logistic regression model as a baseline, implement additional competing models to improve on prediction performance metrics.
Final report: https://hnguye01.github.io/DS6372/Stats2Pr2.pdf
(Programming languages: R and SAS)
(joint work with Andrew Leppla and Ikenna Nwaogu)
(February 2020)
The data set from the online forum BeerAdvocate.com has over 1.5 million individual beer reviews that cover 66,055 unique beers from 5,840 breweries. Most of these reviews focus on craft beers and are not representative of mass-market beers like Budweiser, Miller, Coors, etc. The reviews span about 15 years from 1996 to the end of 2011. In addition to brewery and beer names, reviews include 5 different ratings: overall, taste, appearance, aroma, and palate. This project focused on Overall Rating as the primary response. Additional data are provided on beer style, alcohol by volume (ABV), and review time. Beer review data from this data set were used to address two project objectives:
Build predictive regression models using cross validation with metrics to compare multiple models. Provide interpretation of the regression model(s), including: hypothesis testing, interpretation of regression coefficients, and confidence intervals as well as practical vs. statistical significance.
Perform a secondary analysis using Time Series and address if the assumpon of independent errors is valid for the final regression model.
Final report: https://hnguye01.github.io/DS6372/Stats2Pr1.pdf
(Programming language: R)
(December 2019)
DDSAnalytics is an analytics company that specializes in talent management solutions for Fortune 100 companies. Talent management is defined as the iterative process of developing and retaining employees. It may include workforce planning, employee training programs, identifying high-potential employees and reducing/preventing voluntary employee turnover (attrition). To gain a competitive edge over its competition, DDSAnalytics is planning to leverage data science for talent management. The executive leadership has identified predicting employee turnover as its first application of data science for talent management. Before the business green lights the project, they have tasked your data science team to conduct an analysis of existing employee data.
Here I will do a data analysis on a given dataset CaseStudy2-data.csv to identify factors that lead to attrition. I will identify the top three factors that contribute to turnover (backed up by evidence provided by analysis). There may or may not be a need to create derived attributes/variables/features. The business is also interested in learning about any job role specific trends that may exist in the data set (e.g., “Data Scientists have the highest job satisfaction”). I also provide any other interesting trends and observations from the analysis. The analysis will be backed up by robust experimentation and appropriate visualization. Experiments and analysis are conducted in R. I will also build a model to predict attrition.
Github Link: https://github.com/hnguye01/6306two
Presentation slides: https://hnguye01.github.io/DS6306/html/Presentation.pdf
Data Visualization App: https://hnguye01.shinyapps.io/DDSAnalyticsApp/
(Programming language: SAS)
(joint work with Ikenna Nwaogu and Hao Wang)
(December 2019)
Kaggle is an online platform for data scientist communities, it is also a place where datasets are explored in order to build models that can serve the community.
When we ask a home buyer to describe their dream house, and he/she probably will not begin with the height of the basement ceiling or the proximity to an east-west railroad. However, this playground competition's dataset proves that much more influences price negotiations than the number of bedrooms or a white-picket fence.
In this project, we will use 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa, to predict the final price of each home by using different methods in order to choose the best model. We will practice feature engineering and regression algorithms to achieve the lowest prediction error.
Kaggle project Description: https://hnguye01.github.io/6371Description.html
Github link: https://github.com/hnguye01/6371Kaggle
Final Report: https://hnguye01.github.io/DS6371/Kaggle_ReportMSDS6371.pdf
(Programming language: R)
(November 2019)
Using R Shiny to make my first app based on Budweiser - Case Study 1 (Project 1)
Beer Shiny App link: https://hnguye01.shinyapps.io/beerapp/
(Programming language: R)
(joint work with Jaclyn Coate)
(October 2019)
The focus of our project was to use the given data sets in order to advise our company (Budweiser) on how to compete with the growing microbrewery industry by tailoring new beer releases by region with the most common styles, mean ABVs, and median IBUs in the market.
Github Link: https://github.com/hnguye01/6306one
Presentation slides: https://hnguye01.github.io/Pre01.html