Time Series Analysis with Facebook's Prophet

Not quite a year ago, Facebook released its Prophet forecasting tool for Python. Prophet is an interface built on PyStan, making it relatively to do time series analysis based on Bayesian methods. Prophet has a lot of features to help with time series analysis. For one, Having recently learned of the existence of Prophet I wanted to take it for a test run and see what kind of results I could get with it. I decided to use a what I thought would be a pretty unpredictable time series for my data: Bitcoin prices starting from January 1, 2012 through January 8, 2018.

Read More

Investigating the Return for Major League Baseball Free Agents

At the conclusion of the World Series each year, fans of all but one of the 30 major league teams are left disappointed with their team’s finish. The easiest way for teams to try and improve is by signing players on the free agent market. Free agents are players who have accrued at least six years of major league service time and are no longer under contract with any team and, thus, can negotiate terms with any major league franchise. For the teams, free agents represent a significant financial commitment, with some contracts being for tens or even hundreds of millions of dollars. However, there is significant risk involved in signing a free agent as a player’s performance can vary greatly from that of his recent past. I was interested in looking deeper into how often Major League franchises get what they paid for on the free agent market and which organizations are the best at evaluating which players to sign.

Read More

Classifying Songs by Genre with the Million Song Dataset

A few weeks into my Metis bootcamp, we were tasked to identify a classification problem to attack with machine learning. My immediate thought was to turn to music. If you’re like me, you spent a significant amount of brainpower in your teens obsessing over the correct classification and sub-classification of music (In my case, this revolved mostly around heavy metal, which is blessed/cursed with an overwhelming number of sub-genres). The Million Song Dataset (MSD) provided me with plenty of data to turn my attention to for this problem.

Read More

Using MTA Turnstile Data (My First Foray into Data Science)

The Problem

The Metropolitan Transit Authority provides its turnstile data freely online. At Metis, it’s the ocs of our first data science project. Specifically, we were tasked to use the data to advise a non-profit organization called Women Tech Women Yes (make our WTWY) in how to best deploy street teams to canvass for people to join their email list at New York train stations. Their goal is to promote awareness of their organization and, hopefully, expand their donation base. Those you join the email list are offered tickets to a gala hosted by WTWY in the beginning of the summer.

My team and I further framed the problem to thusly: The gala would be held on July 1, 2018 in the area just north of Madison Square Park. WTWY has 4 street teams available, each of which would work two 4-hour shifts canvassing each week. We were comfortable further defining the problem in this way as it would ensure that our conclusions and recommendations would be focused. In the end, our results would be easily adapted to whatever specific needs WTWY might need. Recommending 4-hour shifts for the street teams make sense since MTA data provides turnstile register numbers in 4-hour intervals.

Assumptions

What stood out to me from the beginning to the end of the process of my first data science project was the subtle role that assumptions played at every step of the way. Throughout the investigation into MTA data, the team’s work was based on a set of assumptions. Without making assumptions about the way the world in general work, even a simple analysis would become impractically complicated. For instance, the work throughout is based on the assumption that any large group of randomly selected MTA riders would be have roughly the same proportion of men and women. With that in mind, our process was guided by the idea that the easiest way to engage more women (Who would, presumably, more interested in receiving informational emails from a group promoting women in the tech industry - another assumption) would be to engage the largest number of people in general.

The Data

We decided to look at data from the month of May and June for the years 2017, 2016, and 2015. We figured that these would be most representative of the time of year that WTWY’s street teams would be canvassing. After importing the MTA data into a pandas dataframe certain anomalies became evident: Some register data was reported at irregular time intervals, as opposed to every four hours of the day starting at midnight as most of the data was. Some of the changes in register number seemed impossibly high, even when accounting for the possibility of a register counter resetting through 0000000000. Some register counts decreased over time, rather than increased as most did. In order to deal with these anomalies, we made the following decisions:

  • Ignore stations reports that were an extremely short or extremely long time interval apart. We felt comfortable doing so since stations that reported in this way were eventually “corrected” to report in the regular 4-hour schedule most others were. This indicated that data from these reports was not reliable. By looking at 3 years worth of data, we were able to make reasonable projections about what each station’s expected ridership would be on any particular day of the week.
  • We ignored extremely large changes in register counts. We defined “extremely large” to mean indicating more than one person passing through an individual turnstile each second over a 4-hour interval. We also assumed register counts that were descending were counting accurately and, thus, still providing usable data.

Again, in order to make progress, the team made assumptions about how the world worked in order to determine which data was reliable. I am confident that establishing what data was reliable based on our reason and understanding of the world lead to more accurate results than if we had included all of the data in our analysis. However, there was a trade-off: Register differences reported at PATH train stations were much larger than at other MTA turnstiles. I hypothesize that this means that hardware at those stations differ significantly than those used at New York City Subway stations. Regardless, the team decided at this point that we should only focus on NYC subway lines as we had a better understanding of how to interpret the data from these machines and could make conclusions we were confident in.

We also made the decision at this point to make our decisions based on the total number of riders both entering and exiting a subway station rather than one or the other. From my point of view, it is reasonable that most people entering a subway station between 8am and 12 noon are on their way to work, that those entering between 4 pm and 8 pm were leaving work, etc. However, there was no good way of determining what percentage of people using the subway were commuting or were doing so for some other reason. With that in mind, we decided to not make any assumption in that regard and simply determine when and where ridership was at its peak.

The Analysis

Read More