# Data Science with Python & R: Exploratory Data Analysis

Published Jul 27, 2015Last updated Feb 09, 2017

## Introduction and Getting Data refresh

Here we are again, with a new episode in our series about doing data science with the two most popular open-source platforms you can use for the job nowadays. In this case we will have a look at a crucial step of the data analytics process, that of the Exploratory Data Analysis.

Exploratory data analysis takes place after gathering and cleaning data, and before any modeling and visualisation/presentation of results. However, it is part of an iterative process. After doing some EDA we can try to build some models or present some visualisations. At the same time, based on the results of the later we can perform some more EDA and so on. It is all about quickly finding clues and not so much about details or aesthetics. Among the main purposes of this type of analysis are of course getting to know our data, its tendencies and its quality, and also to check or even start formulating our hypothesis.

And with that idea in mind we will explain how to use descriptive statistics and basic plotting, together with data frames, in order to answer some questions and guide our further data analysis.

All the source code for the different parts of this series of tutorials and applications can be checked at GitHub. Feel free to get involved and share your progress with us!

### Getting data

We will continue using the same datasets we already loaded in the part introducing data frames. So you can either continue where you left in that tutorial, or re-run the section that gets and prepares the data.

## Questions we want to answer

In any data analysis process, there is one or more questions we want to answer. That is the most basic and important step in the whole process, to define these questions. Since we are going to perform some Exploratory Data Analysis in our TB dataset, these are the questions we want to answer:

• Which are the countries with the highest and infectious TB incidence?
• What is the general world tendency in the period from 1990 to 2007?
• What countries don't follow that tendency?
• What other facts about the disease do we know that we can check with our data?

## Descriptive Statistics

### Python

The basic data descriptive statistics method for a pandas.DataFrame is describe(). It is the equivalent to R data.frame function summary().

df_summary = existing_df.describe()
df_summary

country Afghanistan Albania Algeria American Samoa Andorra Angola Anguilla Antigua and Barbuda Argentina Armenia ... Uruguay Uzbekistan Vanuatu Venezuela Viet Nam Wallis et Futuna West Bank and Gaza Yemen Zambia Zimbabwe
count 18.000000 18.000000 18.000000 18.000000 18.000000 18.000000 18.000000 18.000000 18.000000 18.000000 ... 18.000000 18.000000 18.000000 18.000000 18.000000 18.000000 18.000000 18.000000 18.000000 18.000000
mean 353.333333 36.944444 47.388889 12.277778 25.277778 413.444444 35.611111 10.833333 61.222222 74.944444 ... 28.055556 128.888889 186.000000 40.888889 282.666667 126.222222 43.388889 194.333333 535.277778 512.833333
std 64.708396 6.915220 4.487091 9.886447 7.274497 97.751318 1.243283 2.812786 20.232634 16.129885 ... 3.717561 15.911109 62.027508 2.422660 57.322616 86.784083 8.332353 52.158131 91.975576 113.411925
min 238.000000 22.000000 42.000000 0.000000 17.000000 281.000000 34.000000 7.000000 35.000000 49.000000 ... 23.000000 102.000000 102.000000 38.000000 220.000000 13.000000 31.000000 130.000000 387.000000 392.000000
25% 305.000000 32.000000 44.000000 6.000000 19.250000 321.250000 35.000000 9.000000 41.250000 62.000000 ... 25.000000 116.500000 128.750000 39.000000 234.250000 63.250000 36.250000 146.750000 459.000000 420.750000
50% 373.500000 40.500000 45.500000 9.000000 22.500000 399.000000 35.000000 10.000000 60.500000 77.000000 ... 27.500000 131.500000 185.000000 41.000000 257.000000 106.000000 43.000000 184.500000 521.500000 466.000000
75% 404.500000 42.000000 50.750000 16.250000 31.500000 512.000000 36.000000 12.750000 77.000000 85.750000 ... 30.750000 143.000000 240.000000 42.000000 349.000000 165.750000 51.500000 248.500000 620.000000 616.750000
max 436.000000 44.000000 56.000000 42.000000 39.000000 530.000000 38.000000 16.000000 96.000000 99.000000 ... 35.000000 152.000000 278.000000 46.000000 365.000000 352.000000 55.000000 265.000000 680.000000 714.000000
###### 8 rows × 207 columns

There is a lot of information there. We can access individual summaries as follows.

df_summary[['Spain','United Kingdom']]

country Spain United Kingdom
count 18.000000 18.000000
mean 30.666667 9.611111
std 6.677442 0.916444
min 23.000000 9.000000
25% 25.250000 9.000000
50% 29.000000 9.000000
75% 34.750000 10.000000
max 44.000000 12.000000

There is a plethora of descriptive statistics methods in Pandas (check the documentation). Some of them are already included in our summary object, but there are many more. In following tutorials we will make good use of them in order to better understand our data.

For example, we can obtain the percentage change over the years for the number of tuberculosis cases in Spain.

tb_pct_change_spain = existing_df.Spain.pct_change()
tb_pct_change_spain

    year
1990         NaN
1991   -0.045455
1992   -0.047619
1993   -0.075000
1994   -0.054054
1995   -0.028571
1996   -0.029412
1997   -0.090909
1998    0.000000
1999   -0.066667
2000   -0.035714
2001   -0.037037
2002    0.000000
2003   -0.038462
2004   -0.040000
2005    0.000000
2006    0.000000
2007   -0.041667
Name: Spain, dtype: float64


And from there get the maximum value.

tb_pct_change_spain.max()

    0.0


And do the same for the United Kingdom.

existing_df['United Kingdom'].pct_change().max()

    0.11111111111111116


If we want to know the index value (year) we use argmax (callex idmax in later versions of Pandas) as follows.

existing_df['Spain'].pct_change().argmax()

    '1998'

existing_df['United Kingdom'].pct_change().argmax()

    '1992'


That is, 1998 and 1992 were the worst years in Spain and the UK respectibely regarding the increase of infectious TB cases.

### R

The basic descriptive statistics method in R is, as we said, the function summary().

existing_summary <- summary(existing_df)
str(existing_summary)

##  'table' chr [1:6, 1:207] "Min.   :238.0  " "1st Qu.:305.0  " ...
##  - attr(*, "dimnames")=List of 2
##   ..$: chr [1:6] "" "" "" "" ... ## ..$ : chr [1:207] " Afghanistan" "   Albania" "   Algeria" "American Samoa" ...


It returns a table object where we have summary statistics for each of the columns in a data frame. A table object is good for visualising data, but not so good for accessing and indexing it as a data frame. Basically we use positional indexing to access it as a matrix. This way, if we want the first column, that corresponding to Afghanistan, we do:

existing_summary[,1]

##
## "Min.   :238.0  " "1st Qu.:305.0  " "Median :373.5  " "Mean   :353.3  "
##
## "3rd Qu.:404.5  " "Max.   :436.0  "


A trick we can use to access by column name is use the column names in the original data frame together with which(). We also can build a new data frame with the results.

data.frame(
Spain=existing_summary[,which(colnames(existing_df)=='Spain')],
UK=existing_summary[,which(colnames(existing_df)=='United Kingdom')])

##             Spain               UK
## 1 Min.   :23.00   Min.   : 9.000
## 2 1st Qu.:25.25   1st Qu.: 9.000
## 3 Median :29.00   Median : 9.000
## 4 Mean   :30.67   Mean   : 9.611
## 5 3rd Qu.:34.75   3rd Qu.:10.000
## 6 Max.   :44.00   Max.   :12.000


Being R a functional language, we can apply functions such as sum, mean, sd, etc. to vectors. Remember that a data frame is a list of vectors (i.e. each column is a vector of values), so we can easily use these functions with columns. We can finally combine these functions with lapply or sapply and apply them to multiple columns in a data frame.

However, there is a family of functions in R that can be applied to columns or rows in order to get means and sums directly. These are more efficient than using apply functions, and also allows us to apply them not just by columns but also by row. If you type ?colSums' for example, the help page describes all of them.

Let's say we wan to obtain the average number of existing cases per year. We need a single function call.

rowMeans(existing_df)

##    X1990    X1991    X1992    X1993    X1994    X1995    X1996    X1997
## 196.9662 196.4686 192.8116 191.1739 188.7246 187.9420 178.8986 180.9758
##    X1998    X1999    X2000    X2001    X2002    X2003    X2004    X2005
## 178.1208 180.4734 177.5217 177.7971 179.5169 176.4058 173.9227 171.1836
##    X2006    X2007
## 169.0193 167.2560


## Plotting

In this section we will take a look at the basic plotting functionality in Python/Pandas and R. However, there are more powerful alternatives like ggplot2 that, although originally created for R, has its own implementation for Python from the Yhat guys.

### Python

Pandas DataFrames implement up to three plotting methods out of the box (check the documentation). The first one is a basic line plot for each of the series we include in the indexing. The first line might be needed when plotting while using IPython notebook.

%matplotlib inline

existing_df[['United Kingdom', 'Spain', 'Colombia']].plot()


Or we can use box plots to obtain a summarised view of a given series as follows.

 existing_df[['United Kingdom', 'Spain', 'Colombia']].boxplot()


There is also a histogram() method, but we can't use it with this type of data right now.

### R

Base plotting in R is not very sophisticated when compared with ggplot2, but still is powerful and handy because many data types have implemented custom plot() methods that allow us to plot them with a single method call. However this is not always the case, and more often than not we will need to pass the right set of elements to our basic plotting functions.

Let's start with a basic line chart like we did with Python/Pandas.

uk_series <- existing_df[,c("United Kingdom")]
spain_series <- existing_df[,c("Spain")]
colombia_series <- existing_df[,c("Colombia")]

xrange <- 1990:2007
plot(xrange, uk_series,
type='l', xlab="Year",
ylab="Existing cases per 100K",
col = "blue",
ylim=c(0,100))
lines(xrange, spain_series,
col = "darkgreen")
lines(xrange, colombia_series,
col = "red")
legend(x=2003, y=100,
lty=1,
col=c("blue","darkgreen","red"),
legend=c("UK","Spain","Colombia"))


You can compare how easy it was to plot three series in Pandas, and how doing the
same thing with basic plotting in R gets more verbose. At least we need
three function calls, those for plot and line, and then we have the legend, etc. The base plotting in R is really intended to make quick and dirty charts.

Let's use now box plots.

boxplot(uk_series, spain_series, colombia_series,
names=c("UK","Spain","Colombia"),
xlab="Year",
ylab="Existing cases per 100K")


This one was way shorter, and we don't even need colours or a legend.

Let's now start with the real fun. Once we know our tools (from the previous tutorial about data frames and this one), let's use them to answer some questions about the incidence and prevalence of infectious tuberculosis in the world.

Question: We want to know, per year, what country has the highest number of existing and new TB cases.

### Python

If we want just the top ones we can make use of apply and argmax. Remember that, by default, apply works with columns (the countries in our case), and we want to apply it to each year. Therefore we need to transpose the data frame before using it, or we can pass the argument axis=1.

 existing_df.apply(pd.Series.argmax, axis=1)

    year
1990            Djibouti
1991            Djibouti
1992            Djibouti
1993            Djibouti
1994            Djibouti
1995            Djibouti
1996            Kiribati
1997            Kiribati
1998            Cambodia
1999    Korea, Dem. Rep.
2000            Djibouti
2001           Swaziland
2002            Djibouti
2003            Djibouti
2004            Djibouti
2005            Djibouti
2006            Djibouti
2007            Djibouti
dtype: object


But this is too simplistic. Instead, we want to get those countries that are in the fourth quartile. But first we need to find out the world general tendency.

In order to explore the world's general trend, we need to sum up every countries' values for the three datasets, per year.

deaths_total_per_year_df = deaths_df.sum(axis=1)
existing_total_per_year_df = existing_df.sum(axis=1)
new_total_per_year_df = new_df.sum(axis=1)


Now we will create a new DataFrame with each sum in a series that we will plot using the data frame plot() method.

world_trends_df = pd.DataFrame({
'Total deaths per 100K' : deaths_total_per_year_df,
'Total existing cases per 100K' : existing_total_per_year_df,
'Total new cases per 100K' : new_total_per_year_df},
index=deaths_total_per_year_df.index)

world_trends_df.plot(figsize=(12,6)).legend(
loc='center left',
bbox_to_anchor=(1, 0.5))


It seems that the general tendency is for a decrease in the total number of existing cases per 100K. However the number of new cases has been increasing, although it seems reverting from 2005. So how is possible that the total number of existing cases is decreasing if the total number of new cases has been growing? One of the reasons could be the observed increase in the number of deaths per 100K, but the main reason we have to consider is that people recover from tuberculosis thanks to treatment. The sum of the recovery rate plus the death rate is greater than the new cases rate. In any case, it seems that there are more new cases, but also that we cure them better. We need to improve prevention and epidemics control.

###### Countries out of tendency

So the previous was the general trend of the world as a whole. So what countries with a different trend (for worse)? In order to find this out, first we need to know the distribution of deaths by countries in an average year.

deaths_by_country_mean = deaths_df.mean()
deaths_by_country_mean_summary = deaths_by_country_mean.describe()
existing_by_country_mean = existing_df.mean()
existing_by_country_mean_summary = existing_by_country_mean.describe()
new_by_country_mean = new_df.mean()
new_by_country_mean_summary = new_by_country_mean.describe()


We can plot these distributions to have an idea of how the countries are distributed in an average year.

deaths_by_country_mean.order().plot(kind='bar', figsize=(24,6))


We want those countries beyond 1.5 times the inter quartile range (50%). We have these values in:

deaths_outlier = deaths_by_country_mean_summary['50%']*1.5
existing_outlier = existing_by_country_mean_summary['50%']*1.5
new_outlier = new_by_country_mean_summary['50%']*1.5


Now we can use these values to get those countries that, across the period 1990-2007, have exceeded beyond those levels.

# Now compare with the outlier threshold
outlier_countries_by_deaths_index =
deaths_by_country_mean > deaths_outlier
outlier_countries_by_existing_index =
existing_by_country_mean > existing_outlier
outlier_countries_by_new_index =
new_by_country_mean > new_outlier


What proportion of countries do we have out of trend? For deaths:

num_countries = len(deaths_df.T)
sum(outlier_countries_by_deaths_index)/num_countries

    0.39613526570048307


For existing cases (prevalence):

 sum(outlier_countries_by_existing_index)/num_countries

    0.39613526570048307


For new cases (incidence):

 sum(outlier_countries_by_new_index)/num_countries

    0.38647342995169082


Now we can use these indices to filter our original data frames.

outlier_deaths_df = deaths_df.T[ outlier_countries_by_deaths_index ].T
outlier_existing_df = existing_df.T[ outlier_countries_by_existing_index ].T
outlier_new_df = new_df.T[ outlier_countries_by_new_index ].T


This is serious stuff. We have more than one third of the world being outliers on the distribution of existings cases, new cases, and deaths by infectious tuberculosis. But what if we consider an outlier to be 5 times the IQR? Let's repeat the previous process.

deaths_super_outlier = deaths_by_country_mean_summary['50%']*5
existing_super_outlier = existing_by_country_mean_summary['50%']*5
new_super_outlier = new_by_country_mean_summary['50%']*5

super_outlier_countries_by_deaths_index =
deaths_by_country_mean > deaths_super_outlier
super_outlier_countries_by_existing_index =
existing_by_country_mean > existing_super_outlier
super_outlier_countries_by_new_index =
new_by_country_mean > new_super_outlier


What proportion do we have now?

sum(super_outlier_countries_by_deaths_index)/num_countries

    0.21739130434782608


Let's get the data frames.

super_outlier_deaths_df =
deaths_df.T[ super_outlier_countries_by_deaths_index ].T
super_outlier_existing_df =
existing_df.T[ super_outlier_countries_by_existing_index ].T
super_outlier_new_df =
new_df.T[ super_outlier_countries_by_new_index ].T


Let's concentrate on epidemics control and have a look at the new cases' data frame.

super_outlier_new_df

country Bhutan Botswana Cambodia Congo, Rep. Cote d'Ivoire Korea, Dem. Rep. Djibouti Kiribati Lesotho Malawi ... Philippines Rwanda Sierra Leone South Africa Swaziland Timor-Leste Togo Uganda Zambia Zimbabwe
year
1990 540 307 585 169 177 344 582 513 184 258 ... 393 167 207 301 267 322 308 163 297 329
1991 516 341 579 188 196 344 594 503 201 286 ... 386 185 220 301 266 322 314 250 349 364
1992 492 364 574 200 209 344 606 493 218 314 ... 380 197 233 302 260 322 320 272 411 389
1993 470 390 568 215 224 344 618 483 244 343 ... 373 212 248 305 267 322 326 296 460 417
1994 449 415 563 229 239 344 630 474 280 373 ... 366 225 263 309 293 322 333 306 501 444
1995 428 444 557 245 255 344 642 464 323 390 ... 360 241 279 317 337 322 339 319 536 474
1996 409 468 552 258 269 344 655 455 362 389 ... 353 254 297 332 398 322 346 314 554 501
1997 391 503 546 277 289 344 668 446 409 401 ... 347 273 315 360 474 322 353 320 576 538
1998 373 542 541 299 312 344 681 437 461 412 ... 341 294 334 406 558 322 360 326 583 580
1999 356 588 536 324 338 344 695 428 519 417 ... 335 319 355 479 691 322 367 324 603 628
2000 340 640 530 353 368 344 708 420 553 425 ... 329 348 377 576 801 322 374 340 602 685
2001 325 692 525 382 398 344 722 412 576 414 ... 323 376 400 683 916 322 382 360 627 740
2002 310 740 520 408 425 344 737 403 613 416 ... 317 402 425 780 994 322 389 386 632 791
2003 296 772 515 425 444 344 751 396 635 410 ... 312 419 451 852 1075 322 397 396 652 825
2004 283 780 510 430 448 344 766 388 643 405 ... 306 423 479 898 1127 322 405 385 623 834
2005 270 770 505 425 443 344 781 380 639 391 ... 301 418 509 925 1141 322 413 370 588 824
2006 258 751 500 414 432 344 797 372 638 368 ... 295 408 540 940 1169 322 421 350 547 803
2007 246 731 495 403 420 344 813 365 637 346 ... 290 397 574 948 1198 322 429 330 506 782
###### 18 rows × 22 columns

Let's make some plots to get a better impression.

super_outlier_new_df.plot(figsize=(12,4)).legend(loc='center left', bbox_to_anchor=(1, 0.5))


We have 22 countries where the number of new cases during an average year is greater than 5 times the median value of the distribution. Let's create a country that represents the average of these 22 countries.

average_super_outlier_country = super_outlier_new_df.mean(axis=1)
average_super_outlier_country

    year
1990    314.363636
1991    330.136364
1992    340.681818
1993    352.909091
1994    365.363636
1995    379.227273
1996    390.863636
1997    408.000000
1998    427.000000
1999    451.409091
2000    476.545455
2001    502.409091
2002    525.727273
2003    543.318182
2004    548.909091
2005    546.409091
2006    540.863636
2007    535.181818
dtype: float64


Now let's create a country that represents the rest of the world.

avearge_better_world_country =
new_df.T[ - super_outlier_countries_by_new_index ].T.mean(axis=1)
avearge_better_world_country

    year
1990    80.751351
1991    81.216216
1992    80.681081
1993    81.470270
1994    81.832432
1995    82.681081
1996    82.589189
1997    84.497297
1998    85.189189
1999    86.232432
2000    86.378378
2001    86.551351
2002    89.848649
2003    87.778378
2004    87.978378
2005    87.086022
2006    86.559140
2007    85.605405
dtype: float64


Now let's plot this country with the average world country.

two_world_df =
pd.DataFrame({
'Average Better World Country': avearge_better_world_country,
'Average Outlier Country' : average_super_outlier_country},
index = new_df.index)
two_world_df.plot(title="Estimated new TB cases per 100K",figsize=(12,8))


The increase in new cases' tendency is really stronger in the average super outlier country, so much stronger that is difficult to perceive that same tendency in the better world country. The 90's decade brought a terrible increase in the number of TB cases in those countries. But let's have a look at the exact numbers.

two_world_df.pct_change().plot(title="Percentage change in estimated new TB cases", figsize=(12,8))


Based on this plot, the decceleration and reversion of that tendency seem to happen at the same time in both average better and outlier countries, and something happened around 2002. We will try to find out what's going on in the next section.

### R

We already know that we can use max with a data frame column in R and get the maximum value. Additionally, we can use which.max in order to get its position (similarly to the use og argmax in Pandas). If we use the transposed data frame, we can use lapply or sapply to perform this operation in every year column, getting then either a list or a vector of indices (we will use sapply that returns a vector). We just need a little tweak and use a countries vector that we will index to get the country name instead of the index as a result.

country_names <- rownames(existing_df_t)
sapply(existing_df_t, function(x) {country_names[which.max(x)]})

##              X1990              X1991              X1992
##         "Djibouti"         "Djibouti"         "Djibouti"
##              X1993              X1994              X1995
##         "Djibouti"         "Djibouti"         "Djibouti"
##              X1996              X1997              X1998
##         "Kiribati"         "Kiribati"         "Cambodia"
##              X1999              X2000              X2001
## "Korea, Dem. Rep."         "Djibouti"        "Swaziland"
##              X2002              X2003              X2004
##         "Djibouti"         "Djibouti"         "Djibouti"
##              X2005              X2006              X2007
##         "Djibouti"         "Djibouti"         "Djibouti"


Again, in order to explore the world general tendency, we need to sum up every countries’ values for the three datasets, per year.

But first we need to load the other two datasets for number of deaths and number of new cases.

# Download files

text = deaths_file,
row.names=1,
stringsAsFactor=F)
text = new_cases_file,
row.names=1,
stringsAsFactor=F)

# Cast data to int (deaths doesn't need it)
new_df[1:18] <- lapply(
new_df[1:18],
function(x) { as.integer(gsub(',', '', x) )})

# Transpose
deaths_df_t <- deaths_df
deaths_df <- as.data.frame(t(deaths_df))
new_df_t <- new_df
new_df <- as.data.frame(t(new_df))


And now the sums by row. We need to convert this to a data frame since the function returns a numeric vector.

deaths_total_per_year_df <- data.frame(total=rowSums(deaths_df))
existing_total_per_year_df <- data.frame(total=rowSums(existing_df))
# We pass na.rm = TRUE in order to ignore missing values in the new
# cases data frame when summing (no missing values in other dataframes though)
new_total_per_year_df <- data.frame(total=rowSums(new_df, na.rm = TRUE))


Now we can plot each line using what we have learnt so far. In order to get a vector with the counts to pass to each plotting function, we use R data frame indexing by column name.

xrange <- 1990:2007
plot(xrange, deaths_total_per_year_df$total, type='l', xlab="Year", ylab="Count per 100K", col = "blue", ylim=c(0,50000)) lines(xrange, existing_total_per_year_df$total,
col = "darkgreen")
lines(xrange, new_total_per_year_df$total, col = "red") legend(x=1992, y=52000, lty=1, cex = .7, ncol = 3, col=c("blue","darkgreen","red"), legend=c("Deaths","Existing cases","New cases"))  The conclusions are obviously the same as when using Python. ###### Countries out of tendency So what countries are outliers of the trend (for the worse)? Again, in order to find this out, first we need to know the distribution of countries in an average year. We use colMeans for that purpose. deaths_by_country_mean <- data.frame(mean=colMeans(deaths_df)) existing_by_country_mean <- data.frame(mean=colMeans(existing_df)) new_by_country_mean <- data.frame(mean=colMeans(new_df, na.rm=TRUE))  We can plot these distributions to have an idea of how the countries are distributed in an average year. We are not so interested about the individual countries but about the distribution itself. barplot(sort(deaths_by_country_mean$mean))


Again we can see there are three trends in the plot, with a slowly decreasing part at the beginning, a second more step section, and a final peak that is clearly apart from the rest.

Let's skip this time the 1.5-outlier part and go diretcly to the 5.0-outliers. In R we will use a different approach. We will use the quantile() function in order to get the inter-quartile range and determine the outlier threshold.

Since we already know the results from our Python section, let's do it just for the new cases, so we generate also the plots we did before.

new_super_outlier <-
quantile(new_by_country_mean$mean, probs = c(.5)) * 5.0 super_outlier_countries_by_new_index <- new_by_country_mean > new_super_outlier  And the proportion is. sum(super_outlier_countries_by_new_index)/208  ## [1] 0.1057692  Let's obtain a data frame from this, with just those countries we consider to be outliers. super_outlier_new_df <- new_df[, super_outlier_countries_by_new_index ]  Now we are ready to plot them. xrange <- 1990:2007 plot(xrange, super_outlier_new_df[,1], type='l', xlab="Year", ylab="New cases per 100K", col = 1, ylim=c(0,1800)) for (i in seq(2:ncol(super_outlier_new_df))) { lines(xrange, super_outlier_new_df[,i], col = i) } legend(x=1990, y=1800, lty=1, cex = 0.5, ncol = 7, col=1:22, legend=colnames(super_outlier_new_df))  Definitely we can see here an advantage of using Pandas basic plotting versus R basic plotting! So far our results match. We have 22 countries where the number of new cases on an average year is greater than 5 times the median value of the distribution. Let’s create a country that represents on average these 22. We will use rowMeans() here. average_countries_df <- data.frame( averageOutlierMean=rowMeans(super_outlier_new_df, na.rm=T) ) average_countries_df  ## averageOutlierMean ## X1990 314.3636 ## X1991 330.1364 ## X1992 340.6818 ## X1993 352.9091 ## X1994 365.3636 ## X1995 379.2273 ## X1996 390.8636 ## X1997 408.0000 ## X1998 427.0000 ## X1999 451.4091 ## X2000 476.5455 ## X2001 502.4091 ## X2002 525.7273 ## X2003 543.3182 ## X2004 548.9091 ## X2005 546.4091 ## X2006 540.8636 ## X2007 535.1818  Now let’s create a country that represents the rest of the world. average_countries_df$averageBetterWorldMean <-
rowMeans(new_df[ ,- super_outlier_countries_by_new_index ], na.rm=T)
average_countries_df

##       averageOutlierMean averageBetterWorldMean
## X1990           314.3636               105.2767
## X1991           330.1364               107.3786
## X1992           340.6818               108.0243
## X1993           352.9091               110.0388
## X1994           365.3636               111.6942
## X1995           379.2273               113.9369
## X1996           390.8636               115.0971
## X1997           408.0000               118.6408
## X1998           427.0000               121.2913
## X1999           451.4091               124.8350
## X2000           476.5455               127.6505
## X2001           502.4091               130.5680
## X2002           525.7273               136.0194
## X2003           543.3182               136.0388
## X2004           548.9091               136.8155
## X2005           546.4091               135.5121
## X2006           540.8636               134.4493
## X2007           535.1818               133.2184


Now let’s plot the outlier country with the average world country.

xrange <- 1990:2007
plot(xrange, average_countries_df$averageOutlierMean, type='l', xlab="Year", ylab="New cases per 100K", col = "darkgreen", ylim=c(0,600)) lines(xrange, average_countries_df$averageBetterWorldMean, col = "blue")
legend(x=1990, y=600,
lty=1, cex = 0.7,
ncol = 2,
col=c("darkgreen","blue"),
legend=c("Average outlier country", "Average World Country"))


### Googling about events and dates in Tuberculosis

We will use just Python in this section. About googling, actually we just went straight to Wikipedia's entry about the disease. In the epidemics sections we found the following:

• The total number of tuberculosis cases has been decreasing since 2005, while new cases have decreased since 2002.

• This is confirmed by our previous analysis.

• China has achieved particularly dramatic progress, with about an 80% reduction in its TB mortality rate between 1990 and 2010. Let's check it:

existing_df.China.plot(title="Estimated existing TB cases in China")


• In 2007, the country with the highest estimated incidence rate of TB was Swaziland, with 1,200 cases per 100,000 people.
new_df.apply(pd.Series.argmax, axis=1)['2007']

    'Swaziland'
`

There are many more findings Wikipedia that we can confirm with these or other datasets from Gapminder world. For example, TB and HIV are frequently associated, together with poverty levels. It would be interesting to join datasets and explore tendencies in each of them. We challenge the reader to give them a try and share with us their findings.

### Other web pages to explore

Some interesting resources about tuberculosis apart from the Gapminder website:

• Gates foundation:
• http://www.gatesfoundation.org/What-We-Do/Global-Health/Tuberculosis
• http://www.gatesfoundation.org/Media-Center/Press-Releases/2007/09/New-Grants-to-Fight-Tuberculosis-Epidemic

## Conclusions

Exploratory data analysis is a key step in data analysis. It is during this stage when we start shaping any later work. It precedes any data visualisation or machine learning work, by showing us good or bad our data and our hypothesis are.

Traditionally, R has been the weapon of choice for most EDA work, although the use of a more expressive plotting library such as gglot2 is quite convenient. In fact, the base plotting functionality incorporated in Pandas makes the process cleaner and quicker when using Python. However, the questions we have answered here were very simple and didn't include multiple variables and encodings. In such cases an advanced library like ggplot2 will shine. Apart from providing nicer charts, it will saves us quite a lot of time due to its expressiveness and reusability.

But as simple as our analysis and charts are, we have been able to make the point about how serious the humanitarian crisis is regarding a disease like tuberculosis, specially when considering that the disease is relatively well controlled in more developed countries. We have seen how some coding skills and a good amount of curiosity allows us to create awareness in these and other world issues.

Remember that all the source code for the different parts of this series of tutorials and applications can be checked at GitHub. Feel free to get involved and share your progress with us!

get started
Enjoy this post?

Leave a like and comment for Jose

1
7
7Replies
Steve
2 years ago

The python or R, both became almost equal when choice is given. The data science applications are increasing day by day. So both are equally preferred.

Evan
2 years ago

Jose A Dianes
2 years ago

Hi Evanzhu2013. Many thanks for you’re comment. I’m glad to hear that our tutorials are being useful for you! Jose.

2 years ago

Sorry but I don’t see data science in this article (which is great btw). Is it more appropriate to rename the article : Data wrangling … instead ?
(just saying)

Jose A Dianes
2 years ago

Many thanks for your comment! You’re probably right about the outcomes of this particular article. However the article is part of a longer series about different parts of the data science process. The guys at codementor thought it was good to keep a common title as part of all of them. About the title you propose I disagree. Data wrangling is an important part of the data science process, but is related with transforming raw data into a format that can be used for data analysis. Here we do very mild transformations and get quickly into exploring data using charts.

2 years ago

Awesome sauce, it makes more sense when you mention its gonna be part of a long series. Great stuff !

Jose A Dianes
2 years ago

Glad you like it! I hope we will keep you interested :)