• NLP learning of attractive questions for Stack Overflow (project repo)
    Jan 2019- Now



    This self-motivated project intended to build language facilitation app for both new registers and coding gurus to format their questions on StackOverflow. According to the EDA, less than 10% of new registers from 2017-18 post question within one-year, what is more, only 3.6% percent of their posted question got accepted answers. This indicating, we should provide more insight suggestions and assistance for newbies to make them truly taking advantage of the community. Detailed data queries can be found in this notebook.

  • Disease risk analysis with Optum-Labs
    Mar 2018- Now


    • Implemented data ETL pipeline exploring over 200 million patient insurance claims and EMRs (SQL).
    • Designed hypothesis testing and identified a new feature explained 95% variance of fracture (Python/SciPy).
    • Led proposal writing and got $300,000+ data mining grant from National Institute of Health.


  • Merchant Category Recommendation with Elo
    Jan 2019- Now



    This project tried to predict customer loyalty score in order to target discounts to specific users.
    • Conducted feature engineering based on credit card transaction and merchant categories (Python/Pandas).
    • Applied ensembled modeling, the first level is a gradient boosting tree regression model, second level model is logistic regression classifier to identify outliers (Sci-kit Learn/ Light-GBM).


  • Deal probability prediction with Avito
    May 2018- June 2018


    • Solved the regression problem with extracted NLP features from unstructured advertisement (TF-IDF).
    • Implemented parallel processing functionality, extracting features from 200GB+ customer uploaded images.
    • Improved the RMSE by 2.8% through model stacking techniques (LightGBM, Linear Regression).


  • Clinical Diagnosis to Image Translator (NCBI NIH Hackathon)
    Apr 2018


    Deployed a text-to-image translator package using Docker container on AWS EC2 machine.
    Built pipeline parsing symptoms query and plot related functional brain regions over a mask.