The foundations for making data make sense

Photo by Robynne Hu on Unsplash

While Excel wizardry might cut it for many analytics tasks, data science work relies heavily on the nuance, reproducibility, and scalability of programming. From statistical tests only available in specialized R and Python libraries, to being able to show step-by-step how a model is formulated and generates predictions, to being able to go from processing one dataset to 1,000 with a few keystrokes, fluency in programming is essential for being an effective data scientist.

We’ll therefore focus on programming skills that are key to effectively manipulating and analyzing data. …


Getting Started

The foundations for analytical thinking

Photo by Artem Maltsev on Unsplash

Whether you’re spending all day in spreadsheets or TensorFlow, being an effective data scientist requires a solid understanding of statistics fundamentals. It’s hard not to write a textbook about all the stats that can be useful, but in this post I’ve narrowed down a set of foundational skills that will get you started for your role, no matter where you go. Let’s get started!

Hold on. Do I actually need to learn stats?

In the era of big data and machine learning, it’s tempting to shrug off learning any stats. When the average laptop is 2 million times more powerful than the computer that got us to the moon…


The pros, cons, and questions to ask before taking the plunge

Photo by Camerauthor Photosandstories on Unsplash

Data science in startups is notorious for being a memorable ride. From work that pivots on a dime from spreadsheets to customer interviews to CI/CD pipelines, to being handed more responsibility than you likely know what to do with, you’re guaranteed to learn nonstop in this role.

But what about being the first data scientist at a startup, the tip of the spear? Or what if you’re the second to join, the first extra hands for those more ambitious projects?

To answer these questions, I spoke with my great friend and colleague, Minkyung Kang (MK), who founded the data science…


Decorators to the rescue

Typing on old keyboard
Typing on old keyboard
Photo by Denise Jans on Unsplash.

When it comes to writing complex pipelines running in production, it’s critical to have a clear understanding of what each function does and how its outputs affect downstream functions. But despite our best efforts to write modular, well-tested functions, bugs love hiding in the handoffs between functions — and they can be hard to catch even with end-to-end tests.

This article will cover a Python decorator for input validation that we can use to “lock” the inputs to our functions and immediately notice when there’s an unexpected mismatch.

Example Pipeline

Consider a simple pipeline where we query an API, clean the data…


How to find a job you won’t leave in six months

Photo by Adam Miller on Unsplash

The data science hype is real. Glassdoor labeled data scientist as the best job in America four years in a row, nudged out of the top spot only last year. Data science is transforming medicine, healthcare, finance, business, nonprofits, and government. MIT is spending a billion dollars on a college dedicated solely to AI.


Exploring GitHub, Docker, and Heroku

Photo by Paweł Czerwiński on Unsplash

I’ve been programming now for eight years, and it wasn’t until just months ago that I was able to answer a question I’ve had this whole time: “How do I share my project with someone?”

When I say “project,” I’m not talking about a single R script or a handful of bash commands − even 22-year old me could figure out copy and paste! I mean a project that has several files, perhaps in multiple languages, with external dependencies. Do I just throw it all into a zip folder? How do I deal with new versions of languages and packages…


Hands-on Tutorials

Impress your friends with SQLAlchemy and PyMongo

Artist’s interpretation of a MongoDB database. Photo by Joel Filipe on Unsplash

From ancient government, library, and medical records to present-day video and IoT streams, we have always needed ways to efficiently store and retrieve data. Yesterday’s filing cabinets have become today’s computer databases, with two major paradigms for how to best organize data: the relational (SQL) versus non-relational (NoSQL) approach.

Databases are essential for any organization, so it’s useful to wrap your head around where each type is useful. We’ll start with a brief primer on the history and theory behind SQL and NoSQL. But memorizing abstract facts can only get you so far − we’ll then actually create each type…


How to get more likes when you share your project on Instagram

Photo by Leio McLaren (leiomclaren.com) on Unsplash

Welcome to the final post in our spam catching saga! In the first post, we covered the theory for how to build a model to catch spam. In the last post, we built out the backend for our app by creating the spam classifier and a small Flask app to serve the model. We ended by creating an API and enabling our model to be invoked from Python scripts anywhere on our computer.

In this post, we’ll take it a step further by designing a nice frontend so you can interact with the model outside of Python. …


Well-said! I like how you explored the "two C's" options.

Matt Sosna

Data scientist. PhD from Princeton. Passionate about sustainability. Addicted to learning how things work. www.mattsosna.com

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store