Scan a novel, calculate pi, and run regression on 50 million rows

Photo by bert brrr on Unsplash

Cloud services firm Domo estimates that for every minute in 2020, WhatsApp users sent 41.7 million messages, Netflix streamed 404,000 hours of video, $240,000 changed hands on Venmo, and 69,000 people applied for jobs on LinkedIn. In that firehose of data are patterns those companies use to understand the present, predict the future, and ultimately stay alive in a hyper-competitive market.

But how is it possible to extract insights from datasets so large they freeze your laptop when you try to load them into pandas? When a dataset has more rows than dollars the median household will earn in 50…

The accuracy is in the aggregate

Photo by Quang Nguyen Vinh from Pexels

Animal groups are greater than the sum of their parts. The individual termite wanders cluelessly while the colony builds a sturdy and well-ventilated mound. The lone stork loses its way while the flock successfully migrates. Across the spectrum of cognitive complexity, we regularly see the emergence of behaviors at the group level that the members alone aren’t capable of. How is this possible?

I spent my Ph.D. puzzling over how golden shiner fish − a generally hopeless and not very intelligent creature − form schools capable of elegantly evading predators. …

Tips and Tricks

The code around your code

Photo by Nathan Waters on Unsplash

The programming concepts in the “Analytics Essentials for Data Science” post covered how to work with data once it’s sitting in front of you. These concepts are sufficient if your workflow looks something like downloading a CSV from Google Drive onto your laptop, analyzing the data, then attaching a PDF to a report.

Yet, what happens when you start a project that requires combining data from hundreds of CSVs? Clicking and dragging can only get you so far − even if you have the patience, your manager likely doesn’t! …

The foundations for making data make sense

Photo by Robynne Hu on Unsplash

While Excel wizardry might cut it for many analytics tasks, data science work relies heavily on the nuance, reproducibility, and scalability of programming. From statistical tests only available in specialized R and Python libraries, to being able to show step-by-step how a model is formulated and generates predictions, to being able to go from processing one dataset to 1,000 with a few keystrokes, fluency in programming is essential for being an effective data scientist.

We’ll therefore focus on programming skills that are key to effectively manipulating and analyzing data. …

Getting Started

The foundations for analytical thinking

Photo by Artem Maltsev on Unsplash

Whether you’re spending all day in spreadsheets or TensorFlow, being an effective data scientist requires a solid understanding of statistics fundamentals. It’s hard not to write a textbook about all the stats that can be useful, but in this post I’ve narrowed down a set of foundational skills that will get you started for your role, no matter where you go. Let’s get started!

Hold on. Do I actually need to learn stats?

In the era of big data and machine learning, it’s tempting to shrug off learning any stats. When the average laptop is 2 million times more powerful than the computer that got us to the moon…

The pros, cons, and questions to ask before taking the plunge

Photo by Camerauthor Photosandstories on Unsplash

Data science in startups is notorious for being a memorable ride. From work that pivots on a dime from spreadsheets to customer interviews to CI/CD pipelines, to being handed more responsibility than you likely know what to do with, you’re guaranteed to learn nonstop in this role.

But what about being the first data scientist at a startup, the tip of the spear? Or what if you’re the second to join, the first extra hands for those more ambitious projects?

To answer these questions, I spoke with my great friend and colleague, Minkyung Kang (MK), who founded the data science…

Decorators to the rescue

Typing on old keyboard
Typing on old keyboard
Photo by Denise Jans on Unsplash.

When it comes to writing complex pipelines running in production, it’s critical to have a clear understanding of what each function does and how its outputs affect downstream functions. But despite our best efforts to write modular, well-tested functions, bugs love hiding in the handoffs between functions — and they can be hard to catch even with end-to-end tests.

This article will cover a Python decorator for input validation that we can use to “lock” the inputs to our functions and immediately notice when there’s an unexpected mismatch.

Example Pipeline

Consider a simple pipeline where we query an API, clean the data…

How to find a job you won’t leave in six months

Photo by Adam Miller on Unsplash

The data science hype is real. Glassdoor labeled data scientist as the best job in America four years in a row, nudged out of the top spot only last year. Data science is transforming medicine, healthcare, finance, business, nonprofits, and government. MIT is spending a billion dollars on a college dedicated solely to AI.

Exploring GitHub, Docker, and Heroku

Photo by Paweł Czerwiński on Unsplash

I’ve been programming now for eight years, and it wasn’t until just months ago that I was able to answer a question I’ve had this whole time: “How do I share my project with someone?”

When I say “project,” I’m not talking about a single R script or a handful of bash commands − even 22-year old me could figure out copy and paste! I mean a project that has several files, perhaps in multiple languages, with external dependencies. Do I just throw it all into a zip folder? How do I deal with new versions of languages and packages…

Matt Sosna

Data scientist. PhD from Princeton. Passionate about sustainability. Addicted to learning how things work.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store