How I learned Dask

Published Mar 10, 2021

I am a 25 years old Tech Lead from Brazil: Working with data since I was 19 years old, I have accumulated an outstanding amount of professional experience for someone my age, taking on leadership roles in companies and projects I was involved with. Over the years I had the opportunity to develop projects across diverse domains like scientometrics, law, sustainability, health, and more recently retail.

In my work I aim to combine my theoretical understanding of multiple Artificial Intelligence and Machine Learning techniques with the practical implementation details that are required to build excellent products that add value to my stakeholders. In my experiences over the years, I have worked with a great number of tools and techniques: From complex networks to deep learning, from PostgreSQL to Cassandra, from Docker to Databricks.

My goal is to further develop my leadership skills and hone my expertise in MLOps, solving business challenges with scalable and performable ML pipelines, a field I have already started exploring both as a self-learner and as a consultant trying to adapt to the demands of my data products. Progress is what makes me wake up everyday and give my best!

Imagine that you are going to do some EDA on a super cool database you just found. You are prepared! Opened Jupyter Lab, created a notebook, imported the Open Data Stack libraries and, as you are loading the dataset on a pandas dataframe... BAM! ... MemoryError!!

Now, imagine a second scenario, where you are on the middle of feature engineering and then, in the most complex data manipulation... ZzZzZzZzZzZ ... Extremely long run time...

I bet those scenarios has happened to you. Yeah, and they are becoming more and more frequent, as companies are collecting more and more data! The truth is: Dealing with massive datasets is becoming more and more common. But fear nothing, because where there is will there is way. Dask can help you with these problems! But what's Dask?

Dask is a Python library that parallelize computational tasks natively. Because Dask uses common objects to parallelize, such as Pandas DataFrames and Numpy Arrays, it is very easy to learn to use it and to apply it on your big data problems!

To get a grasp of this technology, I started by reading the all tutorials and best practices on the Dask website: dask.org

After that, I started applying those concepts to a dataset. In my case, I used the New York Trip Duration dataset from Kaggle: kaggle.com/c/nyc-taxi-trip-duration

In the end, to make my knowledge solid, I started reading a book from Jesse C. Daniel called "Data Science with Python and Dask": Data Science with Python and Dask

The biggest challenge I faced was getting used to the lazy way Dask operates. Dask works by building a Direct Acyclic Graph of the workload before you execute it. That was the biggest difference from Pandas, or Numpy.

From this experience, what I enjoyed more was to apply my skills to massive datasets! It is so much fun to process a 100 GB dataset with ease by using a familiar tool!

So, my advice to learn Dask is: Read the website, practice and, after that, solidify your knowledge with a more extensive book!

Now that I know Dask, I want to apply it in scenarios that I would apply Spark instead! Although Spark is a great tool, Dask is much more flexible and familiar for data scientists!

Dask Batch processing Data Science

Report

Enjoy this post? Give Gabriel Fritz Sluzala a like if it's helpful.

Gabriel Fritz Sluzala

Data is never done!

Discover and read more posts from Gabriel Fritz Sluzala

get started