Jose A Dianes

Machine Learning & Data Analytics - Computer Science PhD - data.jadianes.com

Data Science with Python & R: Data Frames II

Published Jul 15, 2015Last updated Feb 09, 2017

We continue here our tutorial on data frames with python and R. The first part introduced the concepts of Data Frame and explained how to create them and index them in Python and R. This part will concentrate on data selection and function mapping.

All the source code for the different parts of this series of tutorials and applications can be checked at GitHub. Feel free to get involved and share your progress with us!

Data Selection

In this section we will show how to select data from data frames based on their values, by using logical expressions.

Python

With Pandas, we can use logical expression to select just data that satisfy certain conditions. So first, let's see what happens when we use logical operators with data frames or series objects.

existing_df>10

country	Afghanistan	Albania	Algeria	American Samoa	Andorra	Angola	Anguilla	Antigua and Barbuda	Argentina	Armenia	...	Uruguay	Uzbekistan	Vanuatu	Venezuela	Viet Nam	Wallis et Futuna	West Bank and Gaza	Yemen	Zambia	Zimbabwe
year
1990	True	True	True	True	True	True	True	True	True	True	...	True	True	True	True	True	True	True	True	True	True
1991	True	True	True	True	True	True	True	True	True	True	...	True	True	True	True	True	True	True	True	True	True
1992	True	True	True	False	True	True	True	True	True	True	...	True	True	True	True	True	True	True	True	True	True
1993	True	True	True	True	True	True	True	True	True	True	...	True	True	True	True	True	True	True	True	True	True
1994	True	True	True	True	True	True	True	True	True	True	...	True	True	True	True	True	True	True	True	True	True
1995	True	True	True	True	True	True	True	True	True	True	...	True	True	True	True	True	True	True	True	True	True
1996	True	True	True	False	True	True	True	True	True	True	...	True	True	True	True	True	True	True	True	True	True
1997	True	True	True	True	True	True	True	True	True	True	...	True	True	True	True	True	True	True	True	True	True
1998	True	True	True	True	True	True	True	True	True	True	...	True	True	True	True	True	True	True	True	True	True
1999	True	True	True	False	True	True	True	False	True	True	...	True	True	True	True	True	True	True	True	True	True
2000	True	True	True	False	True	True	True	False	True	True	...	True	True	True	True	True	True	True	True	True	True
2001	True	True	True	False	True	True	True	False	True	True	...	True	True	True	True	True	True	True	True	True	True
2002	True	True	True	False	True	True	True	False	True	True	...	True	True	True	True	True	True	True	True	True	True
2003	True	True	True	False	True	True	True	False	True	True	...	True	True	True	True	True	True	True	True	True	True
2004	True	True	True	False	True	True	True	False	True	True	...	True	True	True	True	True	True	True	True	True	True
2005	True	True	True	True	True	True	True	False	True	True	...	True	True	True	True	True	True	True	True	True	True
2006	True	True	True	False	True	True	True	False	True	True	...	True	True	True	True	True	True	True	True	True	True
2007	True	True	True	False	True	True	True	False	True	True	...	True	True	True	True	True	True	True	True	True	True

18 rows × 207 columns

And if applied to individual series.

existing_df['United Kingdom'] > 10

    year
    1990    False
    1991    False
    1992    False
    1993    False
    1994    False
    1995    False
    1996    False
    1997    False
    1998    False
    1999    False
    2000    False
    2001    False
    2002    False
    2003    False
    2004    False
    2005     True
    2006     True
    2007     True
    Name: United Kingdom, dtype: bool

The result of these expressions can be used as a indexing vector (with [] or `.iloc') as follows.

existing_df.Spain[existing_df['United Kingdom'] > 10]

    year
    2005    24
    2006    24
    2007    23
    Name: Spain, dtype: int64

An interesting case happens when indexing several series and some of them happen to have False as index and other True at the same position. For example:

existing_df[ existing_df > 10 ]

country	Afghanistan	Albania	Algeria	American Samoa	Andorra	Angola	Anguilla	Antigua and Barbuda	Argentina	Armenia	...	Uruguay	Uzbekistan	Vanuatu	Venezuela	Viet Nam	Wallis et Futuna	West Bank and Gaza	Yemen	Zambia	Zimbabwe
year
1990	436	42	45	42	39	514	38	16	96	52	...	35	114	278	46	365	126	55	265	436	409
1991	429	40	44	14	37	514	38	15	91	49	...	34	105	268	45	361	352	54	261	456	417
1992	422	41	44	NaN	35	513	37	15	86	51	...	33	102	259	44	358	64	54	263	494	415
1993	415	42	43	18	33	512	37	14	82	55	...	32	118	250	43	354	174	52	253	526	419
1994	407	42	43	17	32	510	36	13	78	60	...	31	116	242	42	350	172	52	250	556	426
1995	397	43	42	22	30	508	35	12	74	68	...	30	119	234	42	346	93	50	244	585	439
1996	397	42	43	NaN	28	512	35	12	71	74	...	28	111	226	41	312	123	49	233	602	453
1997	387	44	44	25	23	363	36	11	67	75	...	27	122	218	41	273	213	46	207	626	481
1998	374	43	45	12	24	414	36	11	63	74	...	28	129	211	40	261	107	44	194	634	392
1999	373	42	46	NaN	22	384	36	NaN	58	86	...	28	134	159	39	253	105	42	175	657	430
2000	346	40	48	NaN	20	530	35	NaN	52	94	...	27	139	143	39	248	103	40	164	658	479
2001	326	34	49	NaN	20	335	35	NaN	51	99	...	25	148	128	41	243	13	39	154	680	523
2002	304	32	50	NaN	21	307	35	NaN	42	97	...	27	144	149	41	235	275	37	149	517	571
2003	308	32	51	NaN	18	281	35	NaN	41	91	...	25	152	128	39	234	147	36	146	478	632
2004	283	29	52	NaN	19	318	35	NaN	39	85	...	23	149	118	38	226	63	35	138	468	652
2005	267	29	53	11	18	331	34	NaN	39	79	...	24	144	131	38	227	57	33	137	453	680
2006	251	26	55	NaN	17	302	34	NaN	37	79	...	25	134	104	38	222	60	32	135	422	699
2007	238	22	56	NaN	19	294	34	NaN	35	81	...	23	140	102	39	220	25	31	130	387	714

18 rows × 207 columns

Those cells where existing_df doesn't happen to have more than 10 cases per 100K give False for indexing. The resulting data frame have a NaN value for those cells. A way of solving that (if we need to) is by using the where() method that, apart from providing a more expressive way of reading data selection, acceps a second argument that we can use to impute the NaN values. For example, if we want to have 0 as a value.

existing_df.where(existing_df > 10, 0)

country	Afghanistan	Albania	Algeria	American Samoa	Andorra	Angola	Anguilla	Antigua and Barbuda	Argentina	Armenia	...	Uruguay	Uzbekistan	Vanuatu	Venezuela	Viet Nam	Wallis et Futuna	West Bank and Gaza	Yemen	Zambia	Zimbabwe
year
1990	436	42	45	42	39	514	38	16	96	52	...	35	114	278	46	365	126	55	265	436	409
1991	429	40	44	14	37	514	38	15	91	49	...	34	105	268	45	361	352	54	261	456	417
1992	422	41	44	0	35	513	37	15	86	51	...	33	102	259	44	358	64	54	263	494	415
1993	415	42	43	18	33	512	37	14	82	55	...	32	118	250	43	354	174	52	253	526	419
1994	407	42	43	17	32	510	36	13	78	60	...	31	116	242	42	350	172	52	250	556	426
1995	397	43	42	22	30	508	35	12	74	68	...	30	119	234	42	346	93	50	244	585	439
1996	397	42	43	0	28	512	35	12	71	74	...	28	111	226	41	312	123	49	233	602	453
1997	387	44	44	25	23	363	36	11	67	75	...	27	122	218	41	273	213	46	207	626	481
1998	374	43	45	12	24	414	36	11	63	74	...	28	129	211	40	261	107	44	194	634	392
1999	373	42	46	0	22	384	36	0	58	86	...	28	134	159	39	253	105	42	175	657	430
2000	346	40	48	0	20	530	35	0	52	94	...	27	139	143	39	248	103	40	164	658	479
2001	326	34	49	0	20	335	35	0	51	99	...	25	148	128	41	243	13	39	154	680	523
2002	304	32	50	0	21	307	35	0	42	97	...	27	144	149	41	235	275	37	149	517	571
2003	308	32	51	0	18	281	35	0	41	91	...	25	152	128	39	234	147	36	146	478	632
2004	283	29	52	0	19	318	35	0	39	85	...	23	149	118	38	226	63	35	138	468	652
2005	267	29	53	11	18	331	34	0	39	79	...	24	144	131	38	227	57	33	137	453	680
2006	251	26	55	0	17	302	34	0	37	79	...	25	134	104	38	222	60	32	135	422	699
2007	238	22	56	0	19	294	34	0	35	81	...	23	140	102	39	220	25	31	130	387	714

18 rows × 207 columns

R

As we did with Pandas, let's check the result of using a data.frame in a logical
or boolean expression.

existing_df_gt10 <- existing_df>10
head(existing_df_gt10,2) # check just a couple of rows

##       Afghanistan Albania Algeria American Samoa Andorra Angola Anguilla
## X1990        TRUE    TRUE    TRUE           TRUE    TRUE   TRUE     TRUE
## X1991        TRUE    TRUE    TRUE           TRUE    TRUE   TRUE     TRUE
##       Antigua and Barbuda Argentina Armenia Australia Austria Azerbaijan
## X1990                TRUE      TRUE    TRUE     FALSE    TRUE       TRUE
## X1991                TRUE      TRUE    TRUE     FALSE    TRUE       TRUE
##       Bahamas Bahrain Bangladesh Barbados Belarus Belgium Belize Benin
## X1990    TRUE    TRUE       TRUE    FALSE    TRUE    TRUE   TRUE  TRUE
## X1991    TRUE    TRUE       TRUE    FALSE    TRUE    TRUE   TRUE  TRUE
##       Bermuda Bhutan Bolivia Bosnia and Herzegovina Botswana Brazil
## X1990   FALSE   TRUE    TRUE                   TRUE     TRUE   TRUE
## X1991   FALSE   TRUE    TRUE                   TRUE     TRUE   TRUE
##       British Virgin Islands Brunei Darussalam Bulgaria Burkina Faso
## X1990                   TRUE              TRUE     TRUE         TRUE
## X1991                   TRUE              TRUE     TRUE         TRUE
##       Burundi Cambodia Cameroon Canada Cape Verde Cayman Islands
## X1990    TRUE     TRUE     TRUE  FALSE       TRUE          FALSE
## X1991    TRUE     TRUE     TRUE  FALSE       TRUE          FALSE
##       Central African Republic Chad Chile China Colombia Comoros
## X1990                     TRUE TRUE  TRUE  TRUE     TRUE    TRUE
## X1991                     TRUE TRUE  TRUE  TRUE     TRUE    TRUE
##       Congo, Rep. Cook Islands Costa Rica Croatia Cuba Cyprus
## X1990        TRUE        FALSE       TRUE    TRUE TRUE   TRUE
## X1991        TRUE        FALSE       TRUE    TRUE TRUE   TRUE
##       Czech Republic Cote d'Ivoire Korea, Dem. Rep. Congo, Dem. Rep.
## X1990           TRUE          TRUE             TRUE             TRUE
## X1991           TRUE          TRUE             TRUE             TRUE
##       Denmark Djibouti Dominica Dominican Republic Ecuador Egypt
## X1990    TRUE     TRUE     TRUE               TRUE    TRUE  TRUE
## X1991    TRUE     TRUE     TRUE               TRUE    TRUE  TRUE
##       El Salvador Equatorial Guinea Eritrea Estonia Ethiopia Fiji Finland
## X1990        TRUE              TRUE    TRUE    TRUE     TRUE TRUE    TRUE
## X1991        TRUE              TRUE    TRUE    TRUE     TRUE TRUE    TRUE
##       France French Polynesia Gabon Gambia Georgia Germany Ghana Greece
## X1990   TRUE             TRUE  TRUE   TRUE    TRUE    TRUE  TRUE   TRUE
## X1991   TRUE             TRUE  TRUE   TRUE    TRUE    TRUE  TRUE   TRUE
##       Grenada Guam Guatemala Guinea Guinea-Bissau Guyana Haiti Honduras
## X1990   FALSE TRUE      TRUE   TRUE          TRUE   TRUE  TRUE     TRUE
## X1991   FALSE TRUE      TRUE   TRUE          TRUE   TRUE  TRUE     TRUE
##       Hungary Iceland India Indonesia Iran Iraq Ireland Israel Italy
## X1990    TRUE   FALSE  TRUE      TRUE TRUE TRUE    TRUE   TRUE  TRUE
## X1991    TRUE   FALSE  TRUE      TRUE TRUE TRUE    TRUE  FALSE FALSE
##       Jamaica Japan Jordan Kazakhstan Kenya Kiribati Kuwait Kyrgyzstan
## X1990   FALSE  TRUE   TRUE       TRUE  TRUE     TRUE   TRUE       TRUE
## X1991   FALSE  TRUE   TRUE       TRUE  TRUE     TRUE   TRUE       TRUE
##       Laos Latvia Lebanon Lesotho Liberia Libyan Arab Jamahiriya Lithuania
## X1990 TRUE   TRUE    TRUE    TRUE    TRUE                   TRUE      TRUE
## X1991 TRUE   TRUE    TRUE    TRUE    TRUE                   TRUE      TRUE
##       Luxembourg Madagascar Malawi Malaysia Maldives Mali Malta Mauritania
## X1990       TRUE       TRUE   TRUE     TRUE     TRUE TRUE FALSE       TRUE
## X1991       TRUE       TRUE   TRUE     TRUE     TRUE TRUE FALSE       TRUE
##       Mauritius Mexico Micronesia, Fed. Sts. Monaco Mongolia Montserrat
## X1990      TRUE   TRUE                  TRUE  FALSE     TRUE       TRUE
## X1991      TRUE   TRUE                  TRUE  FALSE     TRUE       TRUE
##       Morocco Mozambique Myanmar Namibia Nauru Nepal Netherlands
## X1990    TRUE       TRUE    TRUE    TRUE  TRUE  TRUE        TRUE
## X1991    TRUE       TRUE    TRUE    TRUE  TRUE  TRUE       FALSE
##       Netherlands Antilles New Caledonia New Zealand Nicaragua Niger
## X1990                 TRUE          TRUE       FALSE      TRUE  TRUE
## X1991                 TRUE          TRUE       FALSE      TRUE  TRUE
##       Nigeria Niue Northern Mariana Islands Norway Oman Pakistan Palau
## X1990    TRUE TRUE                     TRUE  FALSE TRUE     TRUE  TRUE
## X1991    TRUE TRUE                     TRUE  FALSE TRUE     TRUE  TRUE
##       Panama Papua New Guinea Paraguay Peru Philippines Poland Portugal
## X1990   TRUE             TRUE     TRUE TRUE        TRUE   TRUE     TRUE
## X1991   TRUE             TRUE     TRUE TRUE        TRUE   TRUE     TRUE
##       Puerto Rico Qatar Korea, Rep. Moldova Romania Russian Federation
## X1990        TRUE  TRUE        TRUE    TRUE    TRUE               TRUE
## X1991        TRUE  TRUE        TRUE    TRUE    TRUE               TRUE
##       Rwanda Saint Kitts and Nevis Saint Lucia
## X1990   TRUE                  TRUE        TRUE
## X1991   TRUE                  TRUE        TRUE
##       Saint Vincent and the Grenadines Samoa San Marino
## X1990                             TRUE  TRUE      FALSE
## X1991                             TRUE  TRUE      FALSE
##       Sao Tome and Principe Saudi Arabia Senegal Seychelles Sierra Leone
## X1990                  TRUE         TRUE    TRUE       TRUE         TRUE
## X1991                  TRUE         TRUE    TRUE       TRUE         TRUE
##       Singapore Slovakia Slovenia Solomon Islands Somalia South Africa
## X1990      TRUE     TRUE     TRUE            TRUE    TRUE         TRUE
## X1991      TRUE     TRUE     TRUE            TRUE    TRUE         TRUE
##       Spain Sri Lanka Sudan Suriname Swaziland Sweden Switzerland
## X1990  TRUE      TRUE  TRUE     TRUE      TRUE  FALSE        TRUE
## X1991  TRUE      TRUE  TRUE     TRUE      TRUE  FALSE        TRUE
##       Syrian Arab Republic Tajikistan Thailand Macedonia, FYR Timor-Leste
## X1990                 TRUE       TRUE     TRUE           TRUE        TRUE
## X1991                 TRUE       TRUE     TRUE           TRUE        TRUE
##       Togo Tokelau Tonga Trinidad and Tobago Tunisia Turkey Turkmenistan
## X1990 TRUE    TRUE  TRUE                TRUE    TRUE   TRUE         TRUE
## X1991 TRUE    TRUE  TRUE                TRUE    TRUE   TRUE         TRUE
##       Turks and Caicos Islands Tuvalu Uganda Ukraine United Arab Emirates
## X1990                     TRUE   TRUE   TRUE    TRUE                 TRUE
## X1991                     TRUE   TRUE   TRUE    TRUE                 TRUE
##       United Kingdom Tanzania Virgin Islands (U.S.)
## X1990          FALSE     TRUE                  TRUE
## X1991          FALSE     TRUE                  TRUE
##       United States of America Uruguay Uzbekistan Vanuatu Venezuela
## X1990                    FALSE    TRUE       TRUE    TRUE      TRUE
## X1991                    FALSE    TRUE       TRUE    TRUE      TRUE
##       Viet Nam Wallis et Futuna West Bank and Gaza Yemen Zambia Zimbabwe
## X1990     TRUE             TRUE               TRUE  TRUE   TRUE     TRUE
## X1991     TRUE             TRUE               TRUE  TRUE   TRUE     TRUE

In this case we get a matrix variable, with boolean values. When applied to
individual columns.

existing_df['United Kingdom'] > 10

##       United Kingdom
## X1990          FALSE
## X1991          FALSE
## X1992          FALSE
## X1993          FALSE
## X1994          FALSE
## X1995          FALSE
## X1996          FALSE
## X1997          FALSE
## X1998          FALSE
## X1999          FALSE
## X2000          FALSE
## X2001          FALSE
## X2002          FALSE
## X2003          FALSE
## X2004          FALSE
## X2005           TRUE
## X2006           TRUE
## X2007           TRUE

The result (and the syntax) is equivalent to that of Pandas, and can be used for
indexing as follows.

existing_df$Spain[existing_df['United Kingdom'] > 10]

## [1] 24 24 23

As we did in Python/Pandas, let's use the whole boolean matrix we got before.

head(existing_df[ existing_df_gt10 ]) # check first few elements

## [1] 436 429 422 415 407 397

But hey, the results are quite different from what we would expect coming from
using Pandas. We got a long vector of values, not a data frame. The problem is
that the [ ] operator, when passed a matrix, first coerces the data frame to a
matrix. Basically we cannot seamlessly work with R data.frames and boolean matrices
as we did with Pandas. We should instead index in both dimensions, columns and rows,
separately.

But still, we can use matrix indexing with a data frame to replace elements.

existing_df_2 <- existing_df
existing_df_2[ existing_df_gt10 ] <- -1
head(existing_df_2,2) # check just a couple of rows

##       Afghanistan Albania Algeria American Samoa Andorra Angola Anguilla
## X1990          -1      -1      -1             -1      -1     -1       -1
## X1991          -1      -1      -1             -1      -1     -1       -1
##       Antigua and Barbuda Argentina Armenia Australia Austria Azerbaijan
## X1990                  -1        -1      -1         7      -1         -1
## X1991                  -1        -1      -1         7      -1         -1
##       Bahamas Bahrain Bangladesh Barbados Belarus Belgium Belize Benin
## X1990      -1      -1         -1        8      -1      -1     -1    -1
## X1991      -1      -1         -1        8      -1      -1     -1    -1
##       Bermuda Bhutan Bolivia Bosnia and Herzegovina Botswana Brazil
## X1990      10     -1      -1                     -1       -1     -1
## X1991      10     -1      -1                     -1       -1     -1
##       British Virgin Islands Brunei Darussalam Bulgaria Burkina Faso
## X1990                     -1                -1       -1           -1
## X1991                     -1                -1       -1           -1
##       Burundi Cambodia Cameroon Canada Cape Verde Cayman Islands
## X1990      -1       -1       -1      7         -1             10
## X1991      -1       -1       -1      7         -1             10
##       Central African Republic Chad Chile China Colombia Comoros
## X1990                       -1   -1    -1    -1       -1      -1
## X1991                       -1   -1    -1    -1       -1      -1
##       Congo, Rep. Cook Islands Costa Rica Croatia Cuba Cyprus
## X1990          -1            0         -1      -1   -1     -1
## X1991          -1           10         -1      -1   -1     -1
##       Czech Republic Cote d'Ivoire Korea, Dem. Rep. Congo, Dem. Rep.
## X1990             -1            -1               -1               -1
## X1991             -1            -1               -1               -1
##       Denmark Djibouti Dominica Dominican Republic Ecuador Egypt
## X1990      -1       -1       -1                 -1      -1    -1
## X1991      -1       -1       -1                 -1      -1    -1
##       El Salvador Equatorial Guinea Eritrea Estonia Ethiopia Fiji Finland
## X1990          -1                -1      -1      -1       -1   -1      -1
## X1991          -1                -1      -1      -1       -1   -1      -1
##       France French Polynesia Gabon Gambia Georgia Germany Ghana Greece
## X1990     -1               -1    -1     -1      -1      -1    -1     -1
## X1991     -1               -1    -1     -1      -1      -1    -1     -1
##       Grenada Guam Guatemala Guinea Guinea-Bissau Guyana Haiti Honduras
## X1990       7   -1        -1     -1            -1     -1    -1       -1
## X1991       7   -1        -1     -1            -1     -1    -1       -1
##       Hungary Iceland India Indonesia Iran Iraq Ireland Israel Italy
## X1990      -1       5    -1        -1   -1   -1      -1     -1    -1
## X1991      -1       4    -1        -1   -1   -1      -1     10    10
##       Jamaica Japan Jordan Kazakhstan Kenya Kiribati Kuwait Kyrgyzstan
## X1990      10    -1     -1         -1    -1       -1     -1         -1
## X1991      10    -1     -1         -1    -1       -1     -1         -1
##       Laos Latvia Lebanon Lesotho Liberia Libyan Arab Jamahiriya Lithuania
## X1990   -1     -1      -1      -1      -1                     -1        -1
## X1991   -1     -1      -1      -1      -1                     -1        -1
##       Luxembourg Madagascar Malawi Malaysia Maldives Mali Malta Mauritania
## X1990         -1         -1     -1       -1       -1   -1    10         -1
## X1991         -1         -1     -1       -1       -1   -1     9         -1
##       Mauritius Mexico Micronesia, Fed. Sts. Monaco Mongolia Montserrat
## X1990        -1     -1                    -1      3       -1         -1
## X1991        -1     -1                    -1      3       -1         -1
##       Morocco Mozambique Myanmar Namibia Nauru Nepal Netherlands
## X1990      -1         -1      -1      -1    -1    -1          -1
## X1991      -1         -1      -1      -1    -1    -1          10
##       Netherlands Antilles New Caledonia New Zealand Nicaragua Niger
## X1990                   -1            -1          10        -1    -1
## X1991                   -1            -1          10        -1    -1
##       Nigeria Niue Northern Mariana Islands Norway Oman Pakistan Palau
## X1990      -1   -1                       -1      8   -1       -1    -1
## X1991      -1   -1                       -1      8   -1       -1    -1
##       Panama Papua New Guinea Paraguay Peru Philippines Poland Portugal
## X1990     -1               -1       -1   -1          -1     -1       -1
## X1991     -1               -1       -1   -1          -1     -1       -1
##       Puerto Rico Qatar Korea, Rep. Moldova Romania Russian Federation
## X1990          -1    -1          -1      -1      -1                 -1
## X1991          -1    -1          -1      -1      -1                 -1
##       Rwanda Saint Kitts and Nevis Saint Lucia
## X1990     -1                    -1          -1
## X1991     -1                    -1          -1
##       Saint Vincent and the Grenadines Samoa San Marino
## X1990                               -1    -1          9
## X1991                               -1    -1          9
##       Sao Tome and Principe Saudi Arabia Senegal Seychelles Sierra Leone
## X1990                    -1           -1      -1         -1           -1
## X1991                    -1           -1      -1         -1           -1
##       Singapore Slovakia Slovenia Solomon Islands Somalia South Africa
## X1990        -1       -1       -1              -1      -1           -1
## X1991        -1       -1       -1              -1      -1           -1
##       Spain Sri Lanka Sudan Suriname Swaziland Sweden Switzerland
## X1990    -1        -1    -1       -1        -1      5          -1
## X1991    -1        -1    -1       -1        -1      5          -1
##       Syrian Arab Republic Tajikistan Thailand Macedonia, FYR Timor-Leste
## X1990                   -1         -1       -1             -1          -1
## X1991                   -1         -1       -1             -1          -1
##       Togo Tokelau Tonga Trinidad and Tobago Tunisia Turkey Turkmenistan
## X1990   -1      -1    -1                  -1      -1     -1           -1
## X1991   -1      -1    -1                  -1      -1     -1           -1
##       Turks and Caicos Islands Tuvalu Uganda Ukraine United Arab Emirates
## X1990                       -1     -1     -1      -1                   -1
## X1991                       -1     -1     -1      -1                   -1
##       United Kingdom Tanzania Virgin Islands (U.S.)
## X1990              9       -1                    -1
## X1991              9       -1                    -1
##       United States of America Uruguay Uzbekistan Vanuatu Venezuela
## X1990                        7      -1         -1      -1        -1
## X1991                        7      -1         -1      -1        -1
##       Viet Nam Wallis et Futuna West Bank and Gaza Yemen Zambia Zimbabwe
## X1990       -1               -1                 -1    -1     -1       -1
## X1991       -1               -1                 -1    -1     -1       -1

We can see how many of the elements, those where we had more than 10 cases, where
assigned a -1 value.

The most expressive way of selecting form a data.frame in R is by using the
subset function (type ?subset in your R console to
read about this function). The function is applied by row in the data frame.
The second argument can include any condition using column names. The third argument
can include a list of columns. The resulting data frame will contain those rows
that satisfy the second argument conditions, including just those columns listed
in the third argument (all of them bt default). For example, if we want to select
those years when the United Kingdom had more than 10 cases, and list the resulting
rows for three countries (UK, Spain, and Colombia) we will use:

# If a column name contains blanks, we can have to use ` `
subset(existing_df,  `United Kingdom`>10, c('United Kingdom', 'Spain','Colombia'))

##       United Kingdom Spain Colombia
## X2005             11    24       53
## X2006             11    24       44
## X2007             12    23       43

We can do the same thing using [ ] as follows.

existing_df[existing_df["United Kingdom"]>10, c('United Kingdom', 'Spain','Colombia')]

##       United Kingdom Spain Colombia
## X2005             11    24       53
## X2006             11    24       44
## X2007             12    23       43

Function mapping and data grouping

Python

The pandas.DataFrame class defines several ways of applying functions both, index-wise and element-wise. Some of them are already predefined, and are part of the descriptive statistics methods we will talk about when performing exploratory data analysis.

existing_df.sum()

    country
    Afghanistan            6360
    Albania                 665
    Algeria                 853
    American Samoa          221
    Andorra                 455
    Angola                 7442
    Anguilla                641
    Antigua and Barbuda     195
    Argentina              1102
    Armenia                1349
    Australia               116
    Austria                 228
    Azerbaijan             1541
    Bahamas                 920
    Bahrain                1375
    ...
    United Arab Emirates         577
    United Kingdom               173
    Tanzania                    5713
    Virgin Islands (U.S.)        367
    United States of America      88
    Uruguay                      505
    Uzbekistan                  2320
    Vanuatu                     3348
    Venezuela                    736
    Viet Nam                    5088
    Wallis et Futuna            2272
    West Bank and Gaza           781
    Yemen                       3498
    Zambia                      9635
    Zimbabwe                    9231
    Length: 207, dtype: int64

We have just calculated the total number of TB cases from 1990 to 2007 for each country. We can do the same by year if we pass axis=1 to use columns instead of index as axis.

existing_df.sum(axis=1)

    year
    1990    40772
    1991    40669
    1992    39912
    1993    39573
    1994    39066
    1995    38904
    1996    37032
    1997    37462
    1998    36871
    1999    37358
    2000    36747
    2001    36804
    2002    37160
    2003    36516
    2004    36002
    2005    35435
    2006    34987
    2007    34622
    dtype: int64

It looks like there is a descent in the existing number of TB cases per 100K across the world.

Pandas also provides methods to apply other functions to data frames. They are three: apply, applymap, and groupby.

apply and applymap

By using apply() we can apply a function along an input axis of a DataFrame. Objects passed to the functions we apply are Series objects having as index either the DataFrame’s index (axis=0) or the columns (axis=1). Return type depends on whether passed function aggregates, or the reduce argument if the DataFrame is empty. For example, if we want to obtain the number of existing cases per million (instead of 100K) we can use the following.

from __future__ import division # we need this to have float division without using a cast
existing_df.apply(lambda x: x/10)

country	Afghanistan	Albania	Algeria	American Samoa	Andorra	Angola	Anguilla	Antigua and Barbuda	Argentina	Armenia	...	Uruguay	Uzbekistan	Vanuatu	Venezuela	Viet Nam	Wallis et Futuna	West Bank and Gaza	Yemen	Zambia	Zimbabwe
year
1990	43.6	4.2	4.5	4.2	3.9	51.4	3.8	1.6	9.6	5.2	...	3.5	11.4	27.8	4.6	36.5	12.6	5.5	26.5	43.6	40.9
1991	42.9	4.0	4.4	1.4	3.7	51.4	3.8	1.5	9.1	4.9	...	3.4	10.5	26.8	4.5	36.1	35.2	5.4	26.1	45.6	41.7
1992	42.2	4.1	4.4	0.4	3.5	51.3	3.7	1.5	8.6	5.1	...	3.3	10.2	25.9	4.4	35.8	6.4	5.4	26.3	49.4	41.5
1993	41.5	4.2	4.3	1.8	3.3	51.2	3.7	1.4	8.2	5.5	...	3.2	11.8	25.0	4.3	35.4	17.4	5.2	25.3	52.6	41.9
1994	40.7	4.2	4.3	1.7	3.2	51.0	3.6	1.3	7.8	6.0	...	3.1	11.6	24.2	4.2	35.0	17.2	5.2	25.0	55.6	42.6
1995	39.7	4.3	4.2	2.2	3.0	50.8	3.5	1.2	7.4	6.8	...	3.0	11.9	23.4	4.2	34.6	9.3	5.0	24.4	58.5	43.9
1996	39.7	4.2	4.3	0.0	2.8	51.2	3.5	1.2	7.1	7.4	...	2.8	11.1	22.6	4.1	31.2	12.3	4.9	23.3	60.2	45.3
1997	38.7	4.4	4.4	2.5	2.3	36.3	3.6	1.1	6.7	7.5	...	2.7	12.2	21.8	4.1	27.3	21.3	4.6	20.7	62.6	48.1
1998	37.4	4.3	4.5	1.2	2.4	41.4	3.6	1.1	6.3	7.4	...	2.8	12.9	21.1	4.0	26.1	10.7	4.4	19.4	63.4	39.2
1999	37.3	4.2	4.6	0.8	2.2	38.4	3.6	0.9	5.8	8.6	...	2.8	13.4	15.9	3.9	25.3	10.5	4.2	17.5	65.7	43.0
2000	34.6	4.0	4.8	0.8	2.0	53.0	3.5	0.8	5.2	9.4	...	2.7	13.9	14.3	3.9	24.8	10.3	4.0	16.4	65.8	47.9
2001	32.6	3.4	4.9	0.6	2.0	33.5	3.5	0.9	5.1	9.9	...	2.5	14.8	12.8	4.1	24.3	1.3	3.9	15.4	68.0	52.3
2002	30.4	3.2	5.0	0.5	2.1	30.7	3.5	0.7	4.2	9.7	...	2.7	14.4	14.9	4.1	23.5	27.5	3.7	14.9	51.7	57.1
2003	30.8	3.2	5.1	0.6	1.8	28.1	3.5	0.9	4.1	9.1	...	2.5	15.2	12.8	3.9	23.4	14.7	3.6	14.6	47.8	63.2
2004	28.3	2.9	5.2	0.9	1.9	31.8	3.5	0.8	3.9	8.5	...	2.3	14.9	11.8	3.8	22.6	6.3	3.5	13.8	46.8	65.2
2005	26.7	2.9	5.3	1.1	1.8	33.1	3.4	0.8	3.9	7.9	...	2.4	14.4	13.1	3.8	22.7	5.7	3.3	13.7	45.3	68.0
2006	25.1	2.6	5.5	0.9	1.7	30.2	3.4	0.9	3.7	7.9	...	2.5	13.4	10.4	3.8	22.2	6.0	3.2	13.5	42.2	69.9
2007	23.8	2.2	5.6	0.5	1.9	29.4	3.4	0.9	3.5	8.1	...	2.3	14.0	10.2	3.9	22.0	2.5	3.1	13.0	38.7	71.4

18 rows × 207 columns

We have seen how apply works element-wise. If the function we pass is applicable to single elements (e.g. division) pandas will broadcast that to every single element and we will get again a Series with the function applied to each element and hence, a data frame as a result in our case. However, the function intended to be used for element-wise maps is applymap.

groupby

Grouping is a powerful an important data frame operation in Exploratory Data Analysis. In Pandas we can do this easily. For example, imagine we want the mean number of existing cases per year in two different periods, before and after the year 2000. We can do the following.

mean_cases_by_period = existing_df.groupby(lambda x: int(x)>1999).mean()
mean_cases_by_period.index = ['1990-1999', '2000-2007']
mean_cases_by_period

country	Afghanistan	Albania	Algeria	American Samoa	Andorra	Angola	Anguilla	Antigua and Barbuda	Argentina	Armenia	...	Uruguay	Uzbekistan	Vanuatu	Venezuela	Viet Nam	Wallis et Futuna	West Bank and Gaza	Yemen	Zambia	Zimbabwe
1990-1999	403.700	42.1	43.90	16.200	30.3	474.40	36.400	12.800	76.6	64.400	...	30.600	117.00	234.500	42.300	323.300	152.900	49.800	234.500	557.200	428.10
2000-2007	290.375	30.5	51.75	7.375	19.0	337.25	34.625	8.375	42.0	88.125	...	24.875	143.75	125.375	39.125	231.875	92.875	35.375	144.125	507.875	618.75

2 rows × 207 columns

The groupby method accepts different types of grouping, including a mapping function as we passed, a dictionary, a Series, or a tuple / list of column names. The mapping function for example will be called on each element of the object .index (the year string in our case) to determine the groups. If a dict or Series is passed, the Series or dict values are used to determine the groups (e.g. we can pass a column that contains categorical values).

We can index the resulting data frame as usual.

 mean_cases_by_period[['United Kingdom', 'Spain', 'Colombia']]

country	United Kingdom	Spain	Colombia
1990-1999	9.200	35.300	75.10
2000-2007	10.125	24.875	53.25

R

`lapply`

R has a long collection of apply functions that can be used to apply functions to
elements within vectors, matrices, lists, and data frames. The one we will introduce here
is lapply (type ?lapply in your R console). It is the one we use with lists and,
since a data frame is a list of column vectors, will work with them as well.

For example, we can repeat the by year sum we did with Pandas as follows.

existing_df_sum_years <- lapply(existing_df, function(x) { sum(x) })
existing_df_sum_years <- as.data.frame(existing_df_sum_years)
existing_df_sum_years

##   Afghanistan Albania Algeria American.Samoa Andorra Angola Anguilla
## 1        6360     665     853            221     455   7442      641
##   Antigua.and.Barbuda Argentina Armenia Australia Austria Azerbaijan
## 1                 195      1102    1349       116     228       1541
##   Bahamas Bahrain Bangladesh Barbados Belarus Belgium Belize Benin Bermuda
## 1     920    1375       9278       95    1446     229    864  2384     133
##   Bhutan Bolivia Bosnia.and.Herzegovina Botswana Brazil
## 1  10579    4806                   1817     8067   1585
##   British.Virgin.Islands Brunei.Darussalam Bulgaria Burkina.Faso Burundi
## 1                    383              1492      960         5583    8097
##   Cambodia Cameroon Canada Cape.Verde Cayman.Islands
## 1    14015     3787     92       6712            129
##   Central.African.Republic Chad Chile China Colombia Comoros Congo..Rep.
## 1                     7557 7316   452  4854     1177    2310        6755
##   Cook.Islands Costa.Rica Croatia Cuba Cyprus Czech.Republic Cote.d.Ivoire
## 1          357        349    1637  295    163            304          7900
##   Korea..Dem..Rep. Congo..Dem..Rep. Denmark Djibouti Dominica
## 1            12359             9343     151    19155      375
##   Dominican.Republic Ecuador Egypt El.Salvador Equatorial.Guinea Eritrea
## 1               2252    3676   700        1483              5303    3181
##   Estonia Ethiopia Fiji Finland France French.Polynesia Gabon Gambia
## 1    1214     8432  811     153    263              974  5949   6700
##   Georgia Germany Ghana Greece Grenada Guam Guatemala Guinea Guinea.Bissau
## 1    1406     180  7368    380     125 1340      1716   5853          6207
##   Guyana Haiti Honduras Hungary Iceland India Indonesia Iran Iraq Ireland
## 1   1621  7428     1756     930      58  8107      6131  789 1433     233
##   Israel Italy Jamaica Japan Jordan Kazakhstan Kenya Kiribati Kuwait
## 1    138   139     142   822    236       2249  5117    12652    928
##   Kyrgyzstan Laos Latvia Lebanon Lesotho Liberia Libyan.Arab.Jamahiriya
## 1       2354 6460   1351     783    6059    7707                    559
##   Lithuania Luxembourg Madagascar Malawi Malaysia Maldives  Mali Malta
## 1      1579        233       6691   6290     2615     1638 10611   120
##   Mauritania Mauritius Mexico Micronesia..Fed..Sts. Monaco Mongolia
## 1      10698       817    978                  3570     44     6127
##   Montserrat Morocco Mozambique Myanmar Namibia Nauru Nepal Netherlands
## 1        227    1873       7992    5061    9990  2860  7398         138
##   Netherlands.Antilles New.Caledonia New.Zealand Nicaragua Niger Nigeria
## 1                  355          1095         176      1708  5360    7968
##   Niue Northern.Mariana.Islands Norway Oman Pakistan Palau Panama
## 1 1494                     3033    103  337     6889  2258   1073
##   Papua.New.Guinea Paraguay Peru Philippines Poland Portugal Puerto.Rico
## 1             8652     1559 4352       11604   1064      677         206
##   Qatar Korea..Rep. Moldova Romania Russian.Federation Rwanda
## 1  1380        2353    2781    2891               2170   7216
##   Saint.Kitts.and.Nevis Saint.Lucia Saint.Vincent.and.the.Grenadines Samoa
## 1                   259         371                              709   568
##   San.Marino Sao.Tome.and.Principe Saudi.Arabia Senegal Seychelles
## 1        118                  5129         1171    7423       1347
##   Sierra.Leone Singapore Slovakia Slovenia Solomon.Islands Somalia
## 1        11756       751      700      639            6623    8128
##   South.Africa Spain Sri.Lanka Sudan Suriname Swaziland Sweden Switzerland
## 1        10788   552      1695  7062     1975     11460     82         149
##   Syrian.Arab.Republic Tajikistan Thailand Macedonia..FYR Timor.Leste
## 1                  986       3438     4442           1108       10118
##    Togo Tokelau Tonga Trinidad.and.Tobago Tunisia Turkey Turkmenistan
## 1 12111    1283   679                 282     685   1023         1866
##   Turks.and.Caicos.Islands Tuvalu Uganda Ukraine United.Arab.Emirates
## 1                      485   7795   7069    1778                  577
##   United.Kingdom Tanzania Virgin.Islands..U.S.. United.States.of.America
## 1            173     5713                   367                       88
##   Uruguay Uzbekistan Vanuatu Venezuela Viet.Nam Wallis.et.Futuna
## 1     505       2320    3348       736     5088             2272
##   West.Bank.and.Gaza Yemen Zambia Zimbabwe
## 1                781  3498   9635     9231

What did we do there? Very simple. the lapply function gets a list and a function
that will be applied to each element. It returns the result as a list. The function
is defined in-line (i.e. as a lambda in Python). For a given x if sums its elements.

If we want to sum by year, for every country, we can use the transposed data frame
we stored before.

existing_df_sum_countries <- lapply(existing_df_t, function(x) { sum(x) })
existing_df_sum_countries <- as.data.frame(existing_df_sum_countries)
existing_df_sum_countries

##   X1990 X1991 X1992 X1993 X1994 X1995 X1996 X1997 X1998 X1999 X2000 X2001
## 1 40772 40669 39912 39573 39066 38904 37032 37462 36871 37358 36747 36804
##   X2002 X2003 X2004 X2005 X2006 X2007
## 1 37160 36516 36002 35435 34987 34622

aggregate

R provided basic grouping functionality by using aggregate. Another option is
to have a look at the powerful dplyr library that I highly recommend.

But aggregate is quite powerful as well. It accepts a data frame, a list of
grouping elements, and a function to apply to each group. First we need to define
a grouping vector.

before_2000 <- c('1990-99','1990-99','1990-99','1990-99','1990-99',
                 '1990-99','1990-99','1990-99','1990-99','1990-99',
                 '2000-07','2000-07','2000-07','2000-07','2000-07',
                 '2000-07','2000-07','2000-07')
before_2000

##  [1] "1990-99" "1990-99" "1990-99" "1990-99" "1990-99" "1990-99" "1990-99"
##  [8] "1990-99" "1990-99" "1990-99" "2000-07" "2000-07" "2000-07" "2000-07"
## [15] "2000-07" "2000-07" "2000-07" "2000-07"

Then we can use that column as grouping element and use the function mean.

mean_cases_by_period <- aggregate(existing_df, list(Period = before_2000), mean)
mean_cases_by_period

##    Period Afghanistan Albania Algeria American Samoa Andorra Angola
## 1 1990-99     403.700    42.1   43.90         16.200    30.3 474.40
## 2 2000-07     290.375    30.5   51.75          7.375    19.0 337.25
##   Anguilla Antigua and Barbuda Argentina Armenia Australia Austria
## 1   36.400              12.800      76.6  64.400       6.8  14.500
## 2   34.625               8.375      42.0  88.125       6.0  10.375
##   Azerbaijan Bahamas Bahrain Bangladesh Barbados Belarus Belgium Belize
## 1     75.600  52.700  95.600     571.20    6.400  80.500  14.000  54.60
## 2     98.125  49.125  52.375     445.75    3.875  80.125  11.125  39.75
##     Benin Bermuda  Bhutan Bolivia Bosnia and Herzegovina Botswana  Brazil
## 1 131.300   8.400 699.600   308.2                  132.9  356.400 103.400
## 2 133.875   6.125 447.875   215.5                   61.0  562.875  68.875
##   British Virgin Islands Brunei Darussalam Bulgaria Burkina Faso Burundi
## 1                 24.600             90.60   57.700        239.9  332.30
## 2                 17.125             73.25   47.875        398.0  596.75
##   Cambodia Cameroon Canada Cape Verde Cayman Islands
## 1    835.9  201.400  5.900    409.500          8.400
## 2    707.0  221.625  4.125    327.125          5.625
##   Central African Republic    Chad Chile  China Colombia Comoros
## 1                  360.000 330.300  32.0 300.00    75.10 152.500
## 2                  494.625 501.625  16.5 231.75    53.25  98.125
##   Congo, Rep. Cook Islands Costa Rica Croatia  Cuba Cyprus Czech Republic
## 1     322.200       23.400       24.5 110.000 21.70  10.90           20.8
## 2     441.625       15.375       13.0  67.125  9.75   6.75           12.0
##   Cote d'Ivoire Korea, Dem. Rep. Congo, Dem. Rep. Denmark Djibouti
## 1        331.00          794.400           393.30    9.70 1145.000
## 2        573.75          551.875           676.25    6.75  963.125
##   Dominica Dominican Republic Ecuador Egypt El Salvador Equatorial Guinea
## 1   22.000             148.20 236.700  45.6       101.9            206.50
## 2   19.375              96.25 163.625  30.5        58.0            404.75
##   Eritrea Estonia Ethiopia  Fiji Finland France French Polynesia   Gabon
## 1 221.200  77.700  382.900 54.50  10.400  16.90           70.900 330.800
## 2 121.125  54.625  575.375 33.25   6.125  11.75           33.125 330.125
##   Gambia Georgia Germany   Ghana Greece Grenada   Guam Guatemala  Guinea
## 1 352.20    68.2    12.8 450.100 24.300   7.000 100.20   101.500 274.200
## 2 397.25    90.5     6.5 358.375 17.125   6.875  42.25    87.625 388.875
##   Guinea-Bissau  Guyana   Haiti Honduras Hungary Iceland   India Indonesia
## 1        394.10  61.800 438.100  118.900  68.300   3.700 533.200    387.70
## 2        283.25 125.375 380.875   70.875  30.875   2.625 346.875    281.75
##     Iran   Iraq Ireland Israel Italy Jamaica  Japan Jordan Kazakhstan
## 1 52.000 85.800    14.9   8.80 8.800     8.6 53.700 16.300      107.3
## 2 33.625 71.875    10.5   6.25 6.375     7.0 35.625  9.125      147.0
##   Kenya Kiribati Kuwait Kyrgyzstan   Laos Latvia Lebanon Lesotho Liberia
## 1 208.9  874.900  69.40    118.700 393.40 75.400    57.9   271.5   444.7
## 2 378.5  487.875  29.25    145.875 315.75 74.625    25.5   418.0   407.5
##   Libyan Arab Jamahiriya Lithuania Luxembourg Madagascar Malawi Malaysia
## 1                 40.200     94.10      15.10      359.5  355.0   158.90
## 2                 19.625     79.75      10.25      387.0  342.5   128.25
##   Maldives    Mali Malta Mauritania Mauritius Mexico Micronesia, Fed. Sts.
## 1  105.500 595.200  7.80    600.700    50.200  72.40                246.80
## 2   72.875 582.375  5.25    586.375    39.375  31.75                137.75
##   Monaco Mongolia Montserrat Morocco Mozambique Myanmar Namibia   Nauru
## 1    2.8   412.50       13.5 116.600    368.300  352.70 566.900 216.500
## 2    2.0   250.25       11.5  88.375    538.625  191.75 540.125  86.875
##     Nepal Netherlands Netherlands Antilles New Caledonia New Zealand
## 1 523.300        8.80                 22.7          83.1      10.100
## 2 270.625        6.25                 16.0          33.0       9.375
##   Nicaragua  Niger Nigeria  Niue Northern Mariana Islands Norway   Oman
## 1    113.40 308.60 361.500 98.80                  228.200    6.7 23.200
## 2     71.75 284.25 544.125 63.25                   93.875    4.5 13.125
##   Pakistan   Palau Panama Papua New Guinea Paraguay   Peru Philippines
## 1  423.400 164.100 68.800          494.900   89.400 297.40       726.4
## 2  331.875  77.125 48.125          462.875   83.125 172.25       542.5
##   Poland Portugal Puerto Rico Qatar Korea, Rep. Moldova Romania
## 1 77.100    43.90      15.300    78     141.600 140.000   153.1
## 2 36.625    29.75       6.625    75     117.125 172.625   170.0
##   Russian Federation Rwanda Saint Kitts and Nevis Saint Lucia
## 1             107.20 274.20                  15.1       22.50
## 2             137.25 559.25                  13.5       18.25
##   Saint Vincent and the Grenadines Samoa San Marino Sao Tome and Principe
## 1                            42.30 35.00      7.500                 306.1
## 2                            35.75 27.25      5.375                 258.5
##   Saudi Arabia Senegal Seychelles Sierra Leone Singapore Slovakia Slovenia
## 1       67.000 385.000     91.400      531.900     49.70   49.700   47.800
## 2       62.625 446.625     54.125      804.625     31.75   25.375   20.125
##   Solomon Islands Somalia South Africa  Spain Sri Lanka   Sudan Suriname
## 1         469.600 521.100        569.2 35.300      99.1 401.100     95.1
## 2         240.875 364.625        637.0 24.875      88.0 381.375    128.0
##   Swaziland Sweden Switzerland Syrian Arab Republic Tajikistan Thailand
## 1   527.900  4.900       10.30               72.300     134.00    288.6
## 2   772.625  4.125        5.75               32.875     262.25    194.5
##   Macedonia, FYR Timor-Leste   Togo Tokelau Tonga Trinidad and Tobago
## 1         80.100       662.6 650.10   105.9  39.9              16.100
## 2         38.375       436.5 701.25    28.0  35.0              15.125
##   Tunisia Turkey Turkmenistan Turks and Caicos Islands Tuvalu Uganda
## 1  46.400 68.800      105.900                   32.200 511.30 352.70
## 2  27.625 41.875      100.875                   20.375 335.25 442.75
##   Ukraine United Arab Emirates United Kingdom Tanzania
## 1   81.60               37.400          9.200  279.200
## 2  120.25               25.375         10.125  365.125
##   Virgin Islands (U.S.) United States of America Uruguay Uzbekistan
## 1                23.000                      6.0  30.600     117.00
## 2                17.125                      3.5  24.875     143.75
##   Vanuatu Venezuela Viet Nam Wallis et Futuna West Bank and Gaza   Yemen
## 1 234.500    42.300  323.300          152.900             49.800 234.500
## 2 125.375    39.125  231.875           92.875             35.375 144.125
##    Zambia Zimbabwe
## 1 557.200   428.10
## 2 507.875   618.75

The aggregate function allows subsetting the data frame we pass as first parameter
of course, and also to pass multiple grouping elements and define our own functions
(either as lambda or predefined functions). And again, the result is a data frame
that we can index as usual.

mean_cases_by_period[,c('United Kingdom','Spain','Colombia')]

##   United Kingdom  Spain Colombia
## 1          9.200 35.300    75.10
## 2         10.125 24.875    53.25

Conclusions

This two-part tutorial has introduced the concept of data frame, together with how to use them in the two most popular Data Science ecosystems nowadays, R and Python. We have seen how Pandas is inspired by R. We can see how in Python/Pandas we can use very similar constructs to those present in the R language. Python is also a language widely used by software developers of all kinds. All this means that Pandas offers a more consistent programming interface, more efficient in many situations. It is also agreed in the community that, if you come from a software development background, you will feel more comfortable with a language like Python and how DataFrame as an object oriented concepts is defined. If you come instead from a maths and statistics background, you will appreciate a language like R, very interactive and totally function-based, with libraries made by statisticians for statisticians. It is not a language meant to be used in complex software architectures on its own, but to be used in a powerful dialog with data.

Additionally, we have introduced a few datasets from Gapminder World related with Infectious Tuberculosis, a very serious epidemic disease sometimes forgotten in developed countries but that nowadays is the second cause of death of its kind just after HIV (and many times associated to HIV). In the next tutorial in the series, we will use these datasets in order to perform some Exploratory Analysis in both, Python and R, to better understand the world situation regarding the disease.

Remember that all the source code for the different parts of this series of tutorials and applications can be checked at GitHub. Feel free to get involved and share your progress with us!

Python R Data Science

Report

Enjoy this post? Give Jose A Dianes a like if it's helpful.

Jose A Dianes

Machine Learning & Data Analytics - Computer Science PhD - data.jadianes.com

With more than a decade of experience, I have been involved in different aspects of Computer Science, Machine Learning, and Data Analytics applied to domains such as Life Sciences, Ambient Sensing, and Real-time Simulators. I a...

Discover and read more posts from Jose A Dianes

get started