Basic Pandas

Published Mar 28, 2017Last updated Mar 29, 2017
Basic Pandas

Pandas is a data analysis library written in Python. In this post, I will show you how powerful it is to help you quickly get some insight from different dataset.

Install pyenv

We will install pyenv first, pyenv is a conveient tool if you want to use multi python version in your laptop.

$ brew install pyenv

Install python 3 and pandas

We use python 3 here

$ pyenv install 3.5.0
$ pyenv global 3.5.0
$ pyenv rehash
$ pip install pandas

Start coding!

Ok, let's start our pandas adventure! By the way Visual Studio Code is the best editor to work with pandas. Don't forget to install python extension

First, we need to import pandas library. Just create a file, and add the line below.

import pandas as pd

Download quiz.csv and users.json
which is used to demo pandas's utility

You can read json file using pd.read_json, it will store the data in DataFrame, you can imagine DataFrame like a virtual table

## read data from json and store in dataframe
user_df = pd.read_json('users.json')
## show first 5 data

Load csv data, basically the same operation like above, just different file format, pandas suport a lot file format like json, csv, excel...

quiz_df = pd.read_csv('quiz.csv')

Now we can start find some insight in data, first let's try to find max year in quiz

# find max year in quiz data
max_years = quiz_df['years'].max()

Try to get data with max year in quiz, pandas use boolean mask to filter data, you will find boolean mask is a powerful tool when you want to query data with some complicate condition

quiz_df['years'] == max_years
quiz_df[quiz_df['years'] == max_years]
# aggregate average years in quiz data
mean_years = quiz_df['years'].mean()
# agregate familiar language count
result = quiz_df["familiar language"].value_counts()
# find user using the most popular language
popular_language = result.index[0]
quiz_user_with_popular_language = quiz_df[quiz_df['familiar language']==popular_language]

# join quiz with user using right join
quiz_with_user = pd.merge(user_df, quiz_df, how='right', left_on = 'email', right_on = 'email')
# drop na user data
result = quiz_with_user.dropna()
# find user willing to use code editor
result = result[result['will you want to use code editor']=='T']
Discover and read more posts from Ben Yeh
get started