A data science workflow

in #datascience5 years ago

Some pitfalls of teaching yourself data science

In the process of teaching myself data science, I tend to follow the following process for Kaggle competitions:

  1. Find some starter code
  2. Hack the crap out of the starter code
  3. Add my special sauce.
  4. ???
  5. Profit!

While this has served me well enough to learn the basics, it's not a long term sustainable strategy. This will only teach someone the bare bones basics, and will not result in a deeper understanding of the topic. To that point, I have managed to rank roughly between the top 60% and top 82% using this strategy.

Developing a new data science workflow

Realising that I needed a deeper understanding of the discipline, I decided to patch together a new strategy from various sources. This is so I may have a more structured, general approach at my fingertips for future data science projects.

My new and improved data science steps

  1. Clarify the mission statement. This step involves understanding the objective. What is the purpose of this competition, or data science project? What has been done on the topic already? Which research papers are available, to give me a solid background in the research problem? Often, my coding skills are pretty good by now, but my domain knowledge of statistics or the actual domain of the competition like say, seismology, is not up to the task.
  2. Get the data. This step involves obtaining a framework that has the data packaged. It can be a colab notebook, or a Kaggle notebook, or a standalone project, or perhaps something in a Docker container. Each of these approaches has their advantages and disadvantages, and the choice would depend on my objective, and the data. Remember that I am trying to avoid starter code, though. This means that I will write my own starter code, and provide that to the community for the right price.
  3. Exploratory Data Analysis (EDA). Once the data has been obtained and read into a framework, I need to pop the hood and see how bad the damage is. This step may be as simple as calling the describe function on a pandas frame, or more involved like say having a look at audio spectrograms.
  4. Data wrangling. Unless I am very lucky, the data would need to be sanitised. Here I check for missing values, and decide what to do about them. I usually export the sanitised data to a clean csv file or some other suitable format, so I can just read it in during following iterations. This step and the following two would need to be repeated a few times, and it may involve feature engineering in subsequent iterations, depending on what my models require.
  5. Do baseline modelling. For the preliminary baseline, use a straightforward model like linear regression. This is to get an idea of what you're aiming for, and to see if your data wrangling has helped you to get anything done in terms of modelling. To note, I have decided to use top 71% as my baseline ranking in order to evaluate my new data science workflow. Meta, bro. Meta AF.
  6. Try to beat the baseline with more modelling. Now introduce more fancy models like CatBoost. Do some parameter tuning. See if the results improve on the baseline. Go back to data wrangling, if necessary, and do more modelling to improve upon the baseline.
  7. Communicate the results. Since I am still teaching myself data science, I do not need to develop a full data science product or to furnish a data report. I might still do this, in order to build my portfolio and get used to the rest of the tools. However, the most valuable communication for me would be a debriefing by means of a blog post. This takes the form of a report on the actions that I took and any lessons learnt, so I don't repeat the same mistakes in future and have a record of my successes.

Resources

I got most of this process from this great Data Science Worfklow post. For my purposes, I wanted to keep code and worfklow completely separate. Check out that post for an integrated approach, complete with some code in a repo.