I grew up in a working class family in a dying former rail town at the intersection of Appalachia and America’s rust belt on the borderlands of rural Northern Pennsylvania and Upstate New York. Despite a childhood of free school lunches and countless detentions, I became a first-generation college student. After a mighty struggle to adapt to life outside my hometown while also working numerous jobs, I not only earned a B.S. in Physics from Pennsylvania State University but also was awarded a full scholarship to pursue a Ph.D. in Physics at Colorado State University.
I left graduate school early and moved to Shanghai, China, where I taught physics for two years before starting my own education consulting company. During my seven years in Shanghai, I ran some half-marathons, played on a cricket team, self-studied Mandarin Chinese (and, to a lesser extent, Italian as well as Cebuano – a major language of the Philippines), and married fellow science teacher Jenny Soriano.
We repatriated from Shanghai, China to Washington, DC, where I briefly taught high school mathematics before relocating to Los Angeles, California to pursue a career in data science while continuing to tutor math and physics.
While an undergraduate, I was drawn to the puzzles surrounding ultra-high-energy cosmic rays (including the so-called “Oh-My-God” particles), and helped design and analyze data from a lab experiment to measure the angular spread of scattered fluorescence light that would be similarly observed from one of these cosmic rays crashing into the Earth’s atmosphere above the Pierre Auger Observatory in Argentina.
In graduate school, I continued to study scattered fluorescence light and sifted through large sets of calibration data from the real fluorescence detectors in Argentina to find previously unknown anomalies, including periodic atmospheric effects and potentially defective pixels.
At Shanghai Normal University’s affiliated Cambridge International Centre, I taught physics to final year high school students using the UK-patterned CIE A-Level Physics curriculum. While there, I standardized the physics department’s summative assessments and developed a unified student tracking and reporting system. I assisted the school’s nascent IT department in implementing their first student information system, SIMS, and trained other teachers to use the learning management system Moodle.
After returning to the United States, I taught math at the Landon School, just outside Washington, DC, where I redeveloped the existing 9th and 11th grade math syllabus to emphasize more statistical thinking, practical real-world modeling, and project-based learning. Since then, I have been tutoring math and physics to dozens of high school and university students, frequently using the WyzAnt platform to meet new clients where I’ve maintained a (nearly perfect!) 5 star rating.
In 2011, I secured the investment capital (which, prior to recent deregulation, was a set amount set by the State Administration of Industry and Commerce) needed to form the Newton American Education Studio, a licensed Wholly Foreign-Owned Enterprise, to help students and their families to prepare for, apply to, and succeed at America’s top universities.
I hired, trained, and managed a mixed foreign/local team of ten employees at peak size consisting of marketers, academic counselors, teachers, and parent liaison account managers. Unlike many of our competitors in the industry, we positioned Newton as a boutique agency that provided honest, long-term guidance to a self-selected clientele that sought legitimate college preparation. Our clients agreed with our philosophy of not only focusing on our students gaining admission to top schools but also to prepare them to be able to succeed once they had enrolled.
My staff and I built strong competencies understanding and dealing with each others’ cultures in order to develop relationships with key figures in the education community, government officials, leaders of influential parents’ groups, and school administrators that were required to build reputation and critical word-of-mouth sales.
In addition to guiding our team, I built and maintained nearly all aspects of our company’s technology infrastructure, including building our company website, mail servers, Ministry of Industry and Information Technology compliance, VPNs to jump the Great Firewall, company accounts on various Chinese social media platforms, and customized the Salesforce platform to help track our marketing efforts, students’ academic progress, and the status of our key community relationships. I created a primitive “match rating” tool derived from college admissions data scraped using bash scripts to assist clients in selecting feasible schools, majors, and appropriate extracurricular activities.
I designed the initial course curricula for the SAT, ACT, AP Calculus, AP Physics, UK A-Level Physics, and IB Physics classes and personally worked with some students through the college application process to the most selective universities in the US. We also sponsored and fundraised for the inaugural and second annual Shanghai Battle of the Bands for the Heart-to-Heart charity, which raised funds to support two pediatric heart surgeries.
Data science journey
Ever since I put my first fan page for my (still) favorite game on Angelfire as a kid, I have used HTML, CSS and other web technologies extensively. Through middle and high school, I built a local community portal for kids in our hometown, an online gaming community site, and even websites for-profit for local businesses as a side job in high school. My high school senior project was a HTML/CSS tutorial website on how to make websites.
I learned to use SPICE, LabVIEW, MATLAB, and Mathematica in college, and self-studied some C, but I never needed to use it much for anything. I took a very useful course on Unix and learned to write shell scripts and use tools like grep and awk. While running my company, I used content management systems like Drupal and WordPress (which I’m using now). It still amazes me how much of what I learned about building websites as a kid is no longer practically needed these days to put together a basic site. I suppose this may become true of data science in the future, too.
Probability, statistics, and linear algebra
Before I started diving into learning any specific technologies used in data science, I wanted to make sure my math was up to the level needed to comprehend what the data science tools would soon be telling me. Surprisingly, I was not required to take a dedicated statistics course in high school or in college – all the statistical understanding I had came via physics courses like quantum mechanics and statistical mechanics. I started thoroughly studying Probability Theory by E.T. Jaynes, but after a few chapters I found that book to be overkill for a quick crash course and moved on comfortable enough with the basics already under my belt.
For statistics, I experimented with using Youtube and was fortunate to discover Brandon Foltz’ incredibly encouraging and accessible statistics channel, which quickly caught me up with the basics of probability distributions, sampling distributions, hypothesis testing, ANOVA, linear and multiple regression, logistic regression, and more. Since then, I haven’t had much difficulty picking up statistics concepts from relevant Wikipedia articles directly, as needed.
Luckily, as a former physics student, I already had a fairly solid understanding of linear algebra. I keep a copy of Linear Algebra and Its Applications by Gilbert Strang as a reference that I use, when needed.
Data science technologies
I picked up an introductory book on python, but like learning a human language (like Chinese, for instance), I found myself unmotivated and unable to form any brain-muscle memory when using the language – I needed to be doing something “real” with it. I discovered DataCamp and paid for a subscription but I found their linear, fill-in-the-blanks lessons to be very easy to breeze through without really thinking about what I was doing and without forming that brain-muscle memory.
After going through a few courses in this fashion, I stepped back from doing more lessons and started writing my first web scraper to pull the college admissions data I previously used bash scripts to get while working at Newton – an exercise which took a tremendous amount of time at first but did finally consolidate that brain-muscle memory. Only then did I move forward with a few more easy data camp lessons before again stopping and working further implementing the content of those lessons on the nascent project. I continued in this fashion until I completed all the courses on a few different Data Camp career tracks, including Python Programmer, Data Analyst, and Data Scientist, and implemented the content of most of those lessons into my project.
Data structures and algorithms
With the exception of that one course on using Unix, I have never formally studied anything related to computer science. I meticulously read through the first half of the book Introduction to Algorithms by Cormen, Leiserson, Rivest, and Stein, completing nearly every exercise in python along the way. I worked with some friends on practice interview challenge problems and started to get a more sturdy handle on sorting algorithms, hash tables, data structures, dynamics programming, and more.
Gender gap project
I had been exposed to college admissions data for years while running Newton, but only after building my primitive shell script data scraper to pull college data to build a “match” tool did I realize there seemed to be a systemic gender gap in yield rates – the ratio of students accepting offer letters of admission to total offers given out by the school.
After brushing up on probability, statistics, and linear algebra, I starting learning the fundamental tools of modern data science and felt that analyzing the old admissions data I used at Newton would be a perfect candidate for a personal project to help cement my learning. This project has provided me hundreds of hours of experience to become comfortable using the following technologies and I now feel confident in my ability to:
- with Jupyter notebooks (soon upgrading to Jupyter Lab),
- run jupyter notebooks remotely (and securely) from a server.
- work on notebooks from my iPad at the cafe!
- with BeautifulSoup,
- along with requests, retrieve large numbers of pages without supervision.
- isolate and extract targeted values in web pages.
- with PostgreSQL / SQLite,
- install database software on a server.
- use complex SQL queries to select specific data.
- with pandas,
- perform advanced manipulation of data.
- comfortably use MultiIndexing.
- efficiently clean text data using string methods and regular expressions.
- with difflib / fuzzywuzzy,
- join tables with close but not identical string keys.
- with matplotlib / seaborn / bokeh,
- visualize distributions with boxplots, ECDFs, etc.
- finely control customization of figures.
- create user interactive visualization tools.
- with scipy.stats,
- conduct statistical hypothesis testing.
- normalize data with Box-Cox transforms.
- with scikit-learn,
- split training and testing data and cross validate.
- build linear, lasso, ridge, and isotonic regression models.
- use logistic regression models for classification.
In addition, I have a pretty solid grasp on how multiple imputation by chained equations (MICE) works and have utilized the iterative imputation approach to handle missing data in my project.
You can find the detailed write-ups of this project here.