GET IT FREE

# Data Science with Python & R: Data Frames I

– {{showDate(postTime)}}

## Motivation

These series of tutorials on Data Science will try to compare how different concepts
in the discipline can be implemented in the two dominant ecosystems nowadays: R and Python.
We will do this from a neutral point of view. Our opinion is that each environment has
good and bad things, and any data scientist should know how to use both in order to be as
prepared as possible for the job market or to start a personal project.

To get a feeling of what is going on regarding this hot topic, we refer the reader to
DataCamp’s Data Science War
infographic. Their infographic explores what the strengths of R are over Python
and vice versa, and aims to provide a basic comparison between these two programming
languages from a data science and statistics perspective.

Far from being a repetition from the previous, our series of tutorials will go hands-on
into how to actually perform different data science tasks such as working with data frames,
doing aggregations, or creating different statistical models such in the areas of supervised
and unsupervised learning.

As usual, we will use real-world datasets. This will help us to quickly transfer what we
learn here to actual data analysis situations.

The first tutorial in our series will deal the an important abstraction, that of a Data Frame.
In the very next tutorial, we will introduce one of the first tasks we face when we have our data loaded, that of the Exploratory Data Analysis. This task can be performed using data frames and basic plots as we will show here for both, Python and R.

All the source code for the different parts of this series of tutorials and applications can be checked at GitHub. Feel free to get involved and share your progress with us!

## What is a DataFrame?

A data frame is used for storing tabular data. It has labeled axes (rows and columns) that we can use to perform arithmetic operations at on levels.

The concept was introduced in R before it was in Python Pandas so the later repeats many of the ideas from the former. In R, a data.frame is a list of vector variables of the same number of elements (rows) with unique row names. That is, each column is a vector with an associated name, and each row is a series of vector elements that correspond to the same position in each of the column-vectors.

In Pandas, a DataFrame can be thought of as a dict-like container for Series objects, where a Series is a one-dimensional NumPy ndarray with axis labels (including time series). By default, each Series correspond with a column in the resulting DataFrame.

But let’s see both data types in practice. First of all we will introduce a data set that will be used in order to explain the data frame creation process and what data analysis tasks can be done with a data frame. Then we will have a separate section for each platform repeating every task for you to be able to move from one to the other easily in the future.

## Introducing Gapminder World datasets

The Gapminder website presents itself as a fact-based worldview. It is a comprehensive resource for data regarding different countries and territories indicators. Its Data section contains a list of datasets that can be accessed as Google Spreadsheet pages (add &output=csv to download as CSV). Each indicator dataset is tagged with a Data provider, a Category, and a Subcategory.

For this tutorial, we will use different datasets related to Infectious Tuberculosis:

First thing we need to do is download the files for later use within our R and Python environments. There is a description of each dataset if we click in its title in the list of datasets. When performing any data analysis task, it is essential to understand our data as much as possible, so go there and have a read. Basically, each cell in the dataset contains the data related to the number of tuberculosis cases per 100K people during the given year (column) for each country or region (row).

We will use these datasets to better understand the TB incidence in different regions in time.

### Python

import urllib

local_tb_deaths_file = 'tb_deaths_100.csv'
local_tb_existing_file = 'tb_existing_100.csv'
local_tb_new_file = 'tb_new_100.csv'

deaths_f = urllib.urlretrieve(tb_deaths_url_csv, local_tb_deaths_file)
existing_f = urllib.urlretrieve(tb_existing_url_csv, local_tb_existing_file)
new_f = urllib.urlretrieve(tb_new_url_csv, local_tb_new_file)

import pandas as pd

deaths_df = pd.read_csv(local_tb_deaths_file, index_col = 0, thousands  = ',').T
existing_df = pd.read_csv(local_tb_existing_file, index_col = 0, thousands  = ',').T
new_df = pd.read_csv(local_tb_new_file, index_col = 0, thousands  = ',').T

We have specified index_col to be 0 since we want the country names to be the row labels. We also specified the thousands separator to be ‘,’ so Pandas automatically parses cells as numbers. Then, we transpose the table to make the time series for each country correspond to each column.

We will concentrate on the existing cases for a while. We can use head() to check the first few lines.

TB prevalence, all forms (per 100 000 population per year) Afghanistan Albania Algeria American Samoa Andorra Angola Anguilla Antigua and Barbuda Argentina Armenia Uruguay Uzbekistan Vanuatu Venezuela Viet Nam Wallis et Futuna West Bank and Gaza Yemen Zambia Zimbabwe
1990 436 42 45 42 39 514 38 16 96 52 35 114 278 46 365 126 55 265 436 409
1991 429 40 44 14 37 514 38 15 91 49 34 105 268 45 361 352 54 261 456 417
1992 422 41 44 4 35 513 37 15 86 51 33 102 259 44 358 64 54 263 494 415
1993 415 42 43 18 33 512 37 14 82 55 32 118 250 43 354 174 52 253 526 419
1994 407 42 43 17 32 510 36 13 78 60 31 116 242 42 350 172 52 250 556 426
###### 5 rows × 207 columns

By using the attribute columns we can read and write column names.

existing_df.columns
Index([u'Afghanistan', u'Albania', u'Algeria', u'American Samoa', u'Andorra', u'Angola', u'Anguilla', u'Antigua and Barbuda', u'Argentina', u'Armenia', u'Australia', u'Austria', u'Azerbaijan', u'Bahamas', u'Bahrain', u'Bangladesh', u'Barbados', u'Belarus', u'Belgium', u'Belize', u'Benin', u'Bermuda', u'Bhutan', u'Bolivia', u'Bosnia and Herzegovina', u'Botswana', u'Brazil', u'British Virgin Islands', u'Brunei Darussalam', u'Bulgaria', u'Burkina Faso', u'Burundi', u'Cambodia', u'Cameroon', u'Canada', u'Cape Verde', u'Cayman Islands', u'Central African Republic', u'Chad', u'Chile', u'China', u'Colombia', u'Comoros', u'Congo, Rep.', u'Cook Islands', u'Costa Rica', u'Croatia', u'Cuba', u'Cyprus', u'Czech Republic', u'Cote dIvoire', u'Korea, Dem. Rep.', u'Congo, Dem. Rep.', u'Denmark', u'Djibouti', u'Dominica', u'Dominican Republic', u'Ecuador', u'Egypt', u'El Salvador', u'Equatorial Guinea', u'Eritrea', u'Estonia', u'Ethiopia', u'Fiji', u'Finland', u'France', u'French Polynesia', u'Gabon', u'Gambia', u'Georgia', u'Germany', u'Ghana', u'Greece', u'Grenada', u'Guam', u'Guatemala', u'Guinea', u'Guinea-Bissau', u'Guyana', u'Haiti', u'Honduras', u'Hungary', u'Iceland', u'India', u'Indonesia', u'Iran', u'Iraq', u'Ireland', u'Israel', u'Italy', u'Jamaica', u'Japan', u'Jordan', u'Kazakhstan', u'Kenya', u'Kiribati', u'Kuwait', u'Kyrgyzstan', u'Laos', ...], dtype='object')

Similarly, we can access row names by using index.

existing_df.index
Index([u'1990', u'1991', u'1992', u'1993', u'1994', u'1995', u'1996', u'1997', u'1998', u'1999', u'2000', u'2001', u'2002', u'2003', u'2004', u'2005', u'2006', u'2007'], dtype='object')

We will use them to assign proper names to our column and index names.

deaths_df.index.names = ['year']
deaths_df.columns.names = ['country']
existing_df.index.names = ['year']
existing_df.columns.names = ['country']
new_df.index.names = ['year']
new_df.columns.names = ['country']
existing_df
country Afghanistan Albania Algeria American Samoa Andorra Angola Anguilla Antigua and Barbuda Argentina Armenia Uruguay Uzbekistan Vanuatu Venezuela Viet Nam Wallis et Futuna West Bank and Gaza Yemen Zambia Zimbabwe
year
1990 436 42 45 42 39 514 38 16 96 52 35 114 278 46 365 126 55 265 436 409
1991 429 40 44 14 37 514 38 15 91 49 34 105 268 45 361 352 54 261 456 417
1992 422 41 44 4 35 513 37 15 86 51 33 102 259 44 358 64 54 263 494 415
1993 415 42 43 18 33 512 37 14 82 55 32 118 250 43 354 174 52 253 526 419
1994 407 42 43 17 32 510 36 13 78 60 31 116 242 42 350 172 52 250 556 426
1995 397 43 42 22 30 508 35 12 74 68 30 119 234 42 346 93 50 244 585 439
1996 397 42 43 0 28 512 35 12 71 74 28 111 226 41 312 123 49 233 602 453
1997 387 44 44 25 23 363 36 11 67 75 27 122 218 41 273 213 46 207 626 481
1998 374 43 45 12 24 414 36 11 63 74 28 129 211 40 261 107 44 194 634 392
1999 373 42 46 8 22 384 36 9 58 86 28 134 159 39 253 105 42 175 657 430
2000 346 40 48 8 20 530 35 8 52 94 27 139 143 39 248 103 40 164 658 479
2001 326 34 49 6 20 335 35 9 51 99 25 148 128 41 243 13 39 154 680 523
2002 304 32 50 5 21 307 35 7 42 97 27 144 149 41 235 275 37 149 517 571
2003 308 32 51 6 18 281 35 9 41 91 25 152 128 39 234 147 36 146 478 632
2004 283 29 52 9 19 318 35 8 39 85 23 149 118 38 226 63 35 138 468 652
2005 267 29 53 11 18 331 34 8 39 79 24 144 131 38 227 57 33 137 453 680
2006 251 26 55 9 17 302 34 9 37 79 25 134 104 38 222 60 32 135 422 699
2007 238 22 56 5 19 294 34 9 35 81 23 140 102 39 220 25 31 130 387 714

### R

In R we use read.csv to read CSV files into data.frame variables. Although the R function read.csv can work with URLs, https is a problem for R in many cases, so you need to use a package like RCurl to get around it.

library(RCurl)

existing_df <- read.csv(text = existing_cases_file, row.names=1, stringsAsFactor=F)
str(existing_df)

## 'data.frame':	207 obs. of  18 variables:

##  \$ X1990: chr  "436" "42" "45" "42" ...

##  \$ X1991: chr  "429" "40" "44" "14" ...

##  \$ X1992: chr  "422" "41" "44" "4" ...

##  \$ X1993: chr  "415" "42" "43" "18" ...

##  \$ X1994: chr  "407" "42" "43" "17" ...

##  \$ X1995: chr  "397" "43" "42" "22" ...

##  \$ X1996: int  397 42 43 0 28 512 35 12 71 74 ...

##  \$ X1997: int  387 44 44 25 23 363 36 11 67 75 ...

##  \$ X1998: int  374 43 45 12 24 414 36 11 63 74 ...

##  \$ X1999: int  373 42 46 8 22 384 36 9 58 86 ...

##  \$ X2000: int  346 40 48 8 20 530 35 8 52 94 ...

##  \$ X2001: int  326 34 49 6 20 335 35 9 51 99 ...

##  \$ X2002: int  304 32 50 5 21 307 35 7 42 97 ...

##  \$ X2003: int  308 32 51 6 18 281 35 9 41 91 ...

##  \$ X2004: chr  "283" "29" "52" "9" ...

##  \$ X2005: chr  "267" "29" "53" "11" ...

##  \$ X2006: chr  "251" "26" "55" "9" ...

##  \$ X2007: chr  "238" "22" "56" "5" ...

The str() function in R gives us information about a variable type. In this case
we can see that, due to the , thousands separator,
some of the columns hasn’t been parsed as numbers but as character.
If we want to properly work with our dataset we need to convert them to numbers.
Once we know a bit more about indexing and mapping functions, I promise you will be
able to understand the following piece of code. By know let’s say that we convert
a column and assign it again to its reference in the data frame.

existing_df[c(1,2,3,4,5,6,15,16,17,18)] <-
lapply( existing_df[c(1,2,3,4,5,6,15,16,17,18)],
function(x) { as.integer(gsub(',', '', x) )})
str(existing_df)

## 'data.frame':	207 obs. of  18 variables:

##  \$ X1990: int  436 42 45 42 39 514 38 16 96 52 ...

##  \$ X1991: int  429 40 44 14 37 514 38 15 91 49 ...

##  \$ X1992: int  422 41 44 4 35 513 37 15 86 51 ...

##  \$ X1993: int  415 42 43 18 33 512 37 14 82 55 ...

##  \$ X1994: int  407 42 43 17 32 510 36 13 78 60 ...

##  \$ X1995: int  397 43 42 22 30 508 35 12 74 68 ...

##  \$ X1996: int  397 42 43 0 28 512 35 12 71 74 ...

##  \$ X1997: int  387 44 44 25 23 363 36 11 67 75 ...

##  \$ X1998: int  374 43 45 12 24 414 36 11 63 74 ...

##  \$ X1999: int  373 42 46 8 22 384 36 9 58 86 ...

##  \$ X2000: int  346 40 48 8 20 530 35 8 52 94 ...

##  \$ X2001: int  326 34 49 6 20 335 35 9 51 99 ...

##  \$ X2002: int  304 32 50 5 21 307 35 7 42 97 ...

##  \$ X2003: int  308 32 51 6 18 281 35 9 41 91 ...

##  \$ X2004: int  283 29 52 9 19 318 35 8 39 85 ...

##  \$ X2005: int  267 29 53 11 18 331 34 8 39 79 ...

##  \$ X2006: int  251 26 55 9 17 302 34 9 37 79 ...

##  \$ X2007: int  238 22 56 5 19 294 34 9 35 81 ...

Everything looks fine now. But still our dataset is a bit tricky. If we have a
look at what we got into the data frame with head

##             X1990 X1991 X1992 X1993 X1994 X1995 X1996 X1997 X1998 X1999

## Afghanistan   436   429   422   415   407   397   397   387   374   373

## Albania        42    40    41    42    42    43    42    44    43    42

## Algeria        45    44    44    43    43    42    43    44    45    46

##             X2000 X2001 X2002 X2003 X2004 X2005 X2006 X2007

## Afghanistan   346   326   304   308   283   267   251   238

## Albania        40    34    32    32    29    29    26    22

## Algeria        48    49    50    51    52    53    55    56

and nrow and ncol

nrow(existing_df)

## [1] 207
ncol(existing_df)

## [1] 18

we see that we have a data frame with 207 observations, one for each country, and 19 variables or features, one for each year. This doesn’t seem the most natural shape for this dataset. It is very unlikely that we will add new countries (observations or rows in this case) to the dataset, while is quite possible to add additional years (variables or columns in this case). If we keep it like it is, we will end up with a dataset that grows in features and not in observations, and that seems counterintuitive (and unpractical depending of the analysis we will want to do).

We won’t need to do this preprocessing all the time, but there we go. Thankfully, R as a function t() similar to the method T in Pandas, that allows us to transpose a data.frame variable. The result is given as a matrix, so we need to convert it to a data frame again by using as.data.frame.

# we will save the "trasposed" original verison for later use if needed
existing_df_t <- existing_df
existing_df <- as.data.frame(t(existing_df))

##       Afghanistan Albania Algeria American Samoa Andorra Angola Anguilla

## X1990         436      42      45             42      39    514       38

## X1991         429      40      44             14      37    514       38

## X1992         422      41      44              4      35    513       37

##       Antigua and Barbuda Argentina Armenia Australia Austria Azerbaijan

## X1990                  16        96      52         7      18         58

## X1991                  15        91      49         7      17         55

## X1992                  15        86      51         7      16         57

## X1990      54     120        639        8      62      16     65   140

## X1991      53     113        623        8      54      15     64   138

## X1992      52     108        608        7      59      15     62   135

##       Bermuda Bhutan Bolivia Bosnia and Herzegovina Botswana Brazil

## X1990      10    924     377                    160      344    124

## X1991      10    862     362                    156      355    119

## X1992       9    804     347                    154      351    114

##       British Virgin Islands Brunei Darussalam Bulgaria Burkina Faso

## X1990                     32                91       43          179

## X1991                     30                91       48          196

## X1992                     28                91       54          208

##       Burundi Cambodia Cameroon Canada Cape Verde Cayman Islands

## X1990     288      928      188      7        449             10

## X1991     302      905      199      7        438             10

## X1992     292      881      200      7        428              9

##       Central African Republic Chad Chile China Colombia Comoros

## X1990                      318  251    45   327       88     188

## X1991                      336  272    41   321       85     177

## X1992                      342  282    38   315       82     167

##       Congo, Rep. Cook Islands Costa Rica Croatia Cuba Cyprus

## X1990         209            0         30     126   32     14

## X1991         222           10         28     123   29     13

## X1992         231           57         27     121   26     13

##       Czech Republic Cote d'Ivoire Korea, Dem. Rep. Congo, Dem. Rep.

## X1990             22           292              841              275

## X1991             22           304              828              306

## X1992             22           306              815              327

##       Denmark Djibouti Dominica Dominican Republic Ecuador Egypt

## X1990      12    1,485       24                183     282    48

## X1991      12    1,477       24                173     271    47

## X1992      11    1,463       24                164     259    47

##       El Salvador Equatorial Guinea Eritrea Estonia Ethiopia Fiji Finland

## X1990         133               169     245      50      312   68      14

## X1991         126               181     245      50      337   65      12

## X1992         119               187     242      56      351   62      11

##       France French Polynesia Gabon Gambia Georgia Germany Ghana Greece

## X1990     21               67   359    350      51      15   533     30

## X1991     20               55   340    350      48      15   519     29

## X1992     19               91   325    349      50      14   502     27

##       Grenada Guam Guatemala Guinea Guinea-Bissau Guyana Haiti Honduras

## X1990       7  103       113    241           404     39   479      141

## X1991       7  101       111    248           403     43   464      133

## X1992       7   96       108    255           402     34   453      128

##       Hungary Iceland India Indonesia Iran Iraq Ireland Israel Italy

## X1990      67       5   586       443   50   88      19     11    11

## X1991      68       4   577       430   51   88      18     10    10

## X1992      70       4   566       417   56   88      18     10    10

##       Jamaica Japan Jordan Kazakhstan Kenya Kiribati Kuwait Kyrgyzstan

## X1990      10    62     19         95   125    1,026     89         90

## X1991      10    60     18         87   120    1,006     84         93

## X1992      10    58     17         85   134      986     80         93

##       Laos Latvia Lebanon Lesotho Liberia Libyan Arab Jamahiriya Lithuania

## X1990  428     56      64     225     476                     46        64

## X1991  424     57      64     231     473                     45        66

## X1992  420     59      63     229     469                     45        71

##       Luxembourg Madagascar Malawi Malaysia Maldives Mali Malta Mauritania

## X1990         19        367    380      159      143  640    10        585

## X1991         18        368    376      158      130  631     9        587

## X1992         17        369    365      156      118  621     9        590

##       Mauritius Mexico Micronesia, Fed. Sts. Monaco Mongolia Montserrat

## X1990        53    101                   263      3      477         14

## X1991        51     93                   253      3      477         14

## X1992        50     86                   244      3      477         14

##       Morocco Mozambique Myanmar Namibia Nauru Nepal Netherlands

## X1990     134        287     411     650   170   629          11

## X1991     130        313     400     685   285   607          10

## X1992     127        328     389     687   280   585          10

##       Netherlands Antilles New Caledonia New Zealand Nicaragua Niger

## X1990                   28           112          10       145   317

## X1991                   27           107          10       137   318

## X1992                   25           104           9       129   319

##       Nigeria Niue Northern Mariana Islands Norway Oman Pakistan Palau

## X1990     282  118                      142      8   40      430    96

## X1991     307  115                      201      8   36      428    66

## X1992     321  113                      301      8   29      427    43

##       Panama Papua New Guinea Paraguay Peru Philippines Poland Portugal

## X1990     74              498       95  394         799     88       51

## X1991     73              498       93  368         783     87       49

## X1992     71              497       92  343         766     86       47

##       Puerto Rico Qatar Korea, Rep. Moldova Romania Russian Federation

## X1990          17    71         223     105     118                 69

## X1991          15    69         196      99     125                 64

## X1992          17    69         174     103     134                 70

##       Rwanda Saint Kitts and Nevis Saint Lucia

## X1990    190                    17          26

## X1991    211                    17          26

## X1992    226                    16          25

##       Saint Vincent and the Grenadines Samoa San Marino

## X1990                               45    36          9

## X1991                               45    35          9

## X1992                               44    34          8

##       Sao Tome and Principe Saudi Arabia Senegal Seychelles Sierra Leone

## X1990                   346           68     380        113          465

## X1991                   335           60     379        110          479

## X1992                   325           59     379        106          492

##       Singapore Slovakia Slovenia Solomon Islands Somalia South Africa

## X1990        52       55       66             625     597          769

## X1991        52       56       62             593     587          726

## X1992        53       59       59             563     577          676

##       Spain Sri Lanka Sudan Suriname Swaziland Sweden Switzerland

## X1990    44       109   409      109       629      5          14

## X1991    42       106   404      100       590      5          13

## X1992    40       104   402       79       527      6          12

##       Syrian Arab Republic Tajikistan Thailand Macedonia, FYR Timor-Leste

## X1990                   94        193      336             92         706

## X1991                   89        162      319             90         694

## X1992                   84        112      307             89         681

##       Togo Tokelau Tonga Trinidad and Tobago Tunisia Turkey Turkmenistan

## X1990  702     139    45                  17      49     83          105

## X1991  687     140    44                  17      46     79           99

## X1992  668     143    43                  17      49     77          101

##       Turks and Caicos Islands Tuvalu Uganda Ukraine United Arab Emirates

## X1990                       42    593    206      67                   47

## X1991                       40    573    313      64                   44

## X1992                       37    554    342      67                   42

##       United Kingdom Tanzania Virgin Islands (U.S.)

## X1990              9      215                    30

## X1991              9      228                    28

## X1992             10      240                    27

##       United States of America Uruguay Uzbekistan Vanuatu Venezuela

## X1990                        7      35        114     278        46

## X1991                        7      34        105     268        45

## X1992                        7      33        102     259        44

##       Viet Nam Wallis et Futuna West Bank and Gaza Yemen Zambia Zimbabwe

## X1990      365              126                 55   265    436      409

## X1991      361              352                 54   261    456      417

## X1992      358               64                 54   263    494      415

Row names are sort of what in Pandas we get when we use the attribute .index in a data frame.

rownames(existing_df)

##  [1] "X1990" "X1991" "X1992" "X1993" "X1994" "X1995" "X1996" "X1997"

##  [9] "X1998" "X1999" "X2000" "X2001" "X2002" "X2003" "X2004" "X2005"

## [17] "X2006" "X2007"

In our data frame we see we have weird names for them. Every year is prefixed with an X. This is so because they started as column names. From the definition of a data.frame in R, we know that each column is a vector with a variable name. A name in R cannot start with a digit, so R automatically prefixes numbers with the letter X. Right know we will leave it like it is since it doesn’t really stop us from doing our analysis.

In the case of column names, they pretty much correspond to Pandas .columns attribute in a data frame.

colnames(existing_df)

##   [1] "Afghanistan"                      "Albania"

##   [3] "Algeria"                          "American Samoa"

##   [5] "Andorra"                          "Angola"

##   [7] "Anguilla"                         "Antigua and Barbuda"

##   [9] "Argentina"                        "Armenia"

##  [11] "Australia"                        "Austria"

##  [13] "Azerbaijan"                       "Bahamas"

##  [19] "Belgium"                          "Belize"

##  [21] "Benin"                            "Bermuda"

##  [23] "Bhutan"                           "Bolivia"

##  [25] "Bosnia and Herzegovina"           "Botswana"

##  [27] "Brazil"                           "British Virgin Islands"

##  [29] "Brunei Darussalam"                "Bulgaria"

##  [31] "Burkina Faso"                     "Burundi"

##  [33] "Cambodia"                         "Cameroon"

##  [37] "Cayman Islands"                   "Central African Republic"

##  [41] "China"                            "Colombia"

##  [43] "Comoros"                          "Congo, Rep."

##  [45] "Cook Islands"                     "Costa Rica"

##  [47] "Croatia"                          "Cuba"

##  [49] "Cyprus"                           "Czech Republic"

##  [51] "Cote d'Ivoire"                    "Korea, Dem. Rep."

##  [53] "Congo, Dem. Rep."                 "Denmark"

##  [55] "Djibouti"                         "Dominica"

##  [61] "Equatorial Guinea"                "Eritrea"

##  [63] "Estonia"                          "Ethiopia"

##  [65] "Fiji"                             "Finland"

##  [67] "France"                           "French Polynesia"

##  [69] "Gabon"                            "Gambia"

##  [71] "Georgia"                          "Germany"

##  [73] "Ghana"                            "Greece"

##  [77] "Guatemala"                        "Guinea"

##  [79] "Guinea-Bissau"                    "Guyana"

##  [81] "Haiti"                            "Honduras"

##  [83] "Hungary"                          "Iceland"

##  [85] "India"                            "Indonesia"

##  [87] "Iran"                             "Iraq"

##  [89] "Ireland"                          "Israel"

##  [91] "Italy"                            "Jamaica"

##  [93] "Japan"                            "Jordan"

##  [95] "Kazakhstan"                       "Kenya"

##  [97] "Kiribati"                         "Kuwait"

##  [99] "Kyrgyzstan"                       "Laos"

## [101] "Latvia"                           "Lebanon"

## [103] "Lesotho"                          "Liberia"

## [105] "Libyan Arab Jamahiriya"           "Lithuania"

## [109] "Malawi"                           "Malaysia"

## [111] "Maldives"                         "Mali"

## [113] "Malta"                            "Mauritania"

## [115] "Mauritius"                        "Mexico"

## [117] "Micronesia, Fed. Sts."            "Monaco"

## [119] "Mongolia"                         "Montserrat"

## [121] "Morocco"                          "Mozambique"

## [123] "Myanmar"                          "Namibia"

## [125] "Nauru"                            "Nepal"

## [127] "Netherlands"                      "Netherlands Antilles"

## [129] "New Caledonia"                    "New Zealand"

## [131] "Nicaragua"                        "Niger"

## [133] "Nigeria"                          "Niue"

## [135] "Northern Mariana Islands"         "Norway"

## [137] "Oman"                             "Pakistan"

## [139] "Palau"                            "Panama"

## [141] "Papua New Guinea"                 "Paraguay"

## [143] "Peru"                             "Philippines"

## [145] "Poland"                           "Portugal"

## [147] "Puerto Rico"                      "Qatar"

## [149] "Korea, Rep."                      "Moldova"

## [151] "Romania"                          "Russian Federation"

## [153] "Rwanda"                           "Saint Kitts and Nevis"

## [155] "Saint Lucia"                      "Saint Vincent and the Grenadines"

## [157] "Samoa"                            "San Marino"

## [159] "Sao Tome and Principe"            "Saudi Arabia"

## [161] "Senegal"                          "Seychelles"

## [163] "Sierra Leone"                     "Singapore"

## [165] "Slovakia"                         "Slovenia"

## [167] "Solomon Islands"                  "Somalia"

## [169] "South Africa"                     "Spain"

## [171] "Sri Lanka"                        "Sudan"

## [173] "Suriname"                         "Swaziland"

## [175] "Sweden"                           "Switzerland"

## [177] "Syrian Arab Republic"             "Tajikistan"

## [179] "Thailand"                         "Macedonia, FYR"

## [181] "Timor-Leste"                      "Togo"

## [183] "Tokelau"                          "Tonga"

## [185] "Trinidad and Tobago"              "Tunisia"

## [187] "Turkey"                           "Turkmenistan"

## [189] "Turks and Caicos Islands"         "Tuvalu"

## [191] "Uganda"                           "Ukraine"

## [193] "United Arab Emirates"             "United Kingdom"

## [195] "Tanzania"                         "Virgin Islands (U.S.)"

## [197] "United States of America"         "Uruguay"

## [199] "Uzbekistan"                       "Vanuatu"

## [201] "Venezuela"                        "Viet Nam"

## [203] "Wallis et Futuna"                 "West Bank and Gaza"

## [205] "Yemen"                            "Zambia"

## [207] "Zimbabwe"

These two functions show a common idiom in R, where we use the same function to get a value and to assign it. For example, if we want to change row names we will do something like:

colnames(existing_df) <- new_col_names

But as we said we will leave them as they are by now.

## Data Indexing

### Python

There is a whole section devoted to indexing and selecting data in DataFrames in the official documentation. Let’s apply them to our Tuberculosis cases dataframe.

We can access each data frame Series object by using its column name, as with a Python dictionary. In our case we can access each country series by its name.

existing_df['United Kingdom']
year
1990     9
1991     9
1992    10
1993    10
1994     9
1995     9
1996     9
1997     9
1998     9
1999     9
2000     9
2001     9
2002     9
2003    10
2004    10
2005    11
2006    11
2007    12
Name: United Kingdom, dtype: int64

Or just using the key value as an attribute.

existing_df.Spain
year
1990    44
1991    42
1992    40
1993    37
1994    35
1995    34
1996    33
1997    30
1998    30
1999    28
2000    27
2001    26
2002    26
2003    25
2004    24
2005    24
2006    24
2007    23
Name: Spain, dtype: int64

Or we can access multiple series passing their column names as a Python list.

existing_df[['Spain', 'United Kingdom']]
ountry Spain United Kingdom
year
1990 44 9
1991 42 9
1992 40 10
1993 37 10
1994 35 9
1995 34 9
1996 33 9
1997 30 9
1998 30 9
1999 28 9
2000 27 9
2001 26 9
2002 26 9
2003 25 10
2004 24 10
2005 24 11
2006 24 11
2007 23 12

We can also access individual cells as follows.

existing_df.Spain['1990']
44

Or using any Python list indexing for slicing the series.

existing_df[['Spain', 'United Kingdom']][0:5]
country Spain United Kingdom
year
1990 44 9
1991 42 9
1992 40 10
1993 37 10
1994 35 9

With the whole DataFrame, slicing inside of [] slices the rows. This is provided largely as a convenience since it is such a common operation.

existing_df[0:5]
country Afghanistan Albania Algeria American Samoa Andorra Angola Anguilla Antigua and Barbuda Argentina Armenia Uruguay Uzbekistan Vanuatu Venezuela Viet Nam Wallis et Futuna West Bank and Gaza Yemen Zambia Zimbabwe
year
1990 436 42 45 42 39 514 38 16 96 52 35 114 278 46 365 126 55 265 436 409
1991 429 40 44 14 37 514 38 15 91 49 34 105 268 45 361 352 54 261 456 417
1992 422 41 44 4 35 513 37 15 86 51 33 102 259 44 358 64 54 263 494 415
1993 415 42 43 18 33 512 37 14 82 55 32 118 250 43 354 174 52 253 526 419
1994 407 42 43 17 32 510 36 13 78 60 31 116 242 42 350 172 52 250 556 426

######5 rows × 207 columns

### Indexing in production Python code

As stated in the official documentation, the Python and NumPy indexing operators [] and attribute operator . provide quick and easy access to pandas data structures across a wide range of use cases. This makes interactive work intuitive, as there’s little new to learn if you already know how to deal with Python dictionaries and NumPy arrays. However, since the type of the data to be accessed isn’t known in advance, directly using standard operators has some optimization limits. For production code, it is recommended that you take advantage of the optimized pandas data access methods exposed in this section.

For example, the .iloc method can be used for positional index access.

existing_df.iloc[0:2]
country Afghanistan Albania Algeria American Samoa Andorra Angola Anguilla Antigua and Barbuda Argentina Armenia Uruguay Uzbekistan Vanuatu Venezuela Viet Nam Wallis et Futuna West Bank and Gaza Yemen Zambia Zimbabwe
year
1990 436 42 45 42 39 514 38 16 96 52 35 114 278 46 365 126 55 265 436 409
1991 429 40 44 14 37 514 38 15 91 49 34 105 268 45 361 352 54 261 456 417

######2 rows × 207 columns

While .loc is used for label access.

existing_df.loc['1992':'2005']
country Afghanistan Albania Algeria American Samoa Andorra Angola Anguilla Antigua and Barbuda Argentina Armenia Uruguay Uzbekistan Vanuatu Venezuela Viet Nam Wallis et Futuna West Bank and Gaza Yemen Zambia Zimbabwe
year
1992 422 41 44 4 35 513 37 15 86 51 33 102 259 44 358 64 54 263 494 415
1993 415 42 43 18 33 512 37 14 82 55 32 118 250 43 354 174 52 253 526 419
1994 407 42 43 17 32 510 36 13 78 60 31 116 242 42 350 172 52 250 556 426
1995 397 43 42 22 30 508 35 12 74 68 30 119 234 42 346 93 50 244 585 439
1996 397 42 43 0 28 512 35 12 71 74 28 111 226 41 312 123 49 233 602 453
1997 387 44 44 25 23 363 36 11 67 75 27 122 218 41 273 213 46 207 626 481
1998 374 43 45 12 24 414 36 11 63 74 28 129 211 40 261 107 44 194 634 392
1999 373 42 46 8 22 384 36 9 58 86 28 134 159 39 253 105 42 175 657 430
2000 346 40 48 8 20 530 35 8 52 94 27 139 143 39 248 103 40 164 658 479
2001 326 34 49 6 20 335 35 9 51 99 25 148 128 41 243 13 39 154 680 523
2002 304 32 50 5 21 307 35 7 42 97 27 144 149 41 235 275 37 149 517 571
2003 308 32 51 6 18 281 35 9 41 91 25 152 128 39 234 147 36 146 478 632
2004 283 29 52 9 19 318 35 8 39 85 23 149 118 38 226 63 35 138 468 652
2005 267 29 53 11 18 331 34 8 39 79 24 144 131 38 227 57 33 137 453 680

######14 rows × 207 columns

And we can combine that with series indexing by column.

existing_df.loc[['1992','1998','2005'],['Spain','United Kingdom']]
country Spain United Kingdom
1992 40 10
1998 30 9
2005 24 11

This last approach is the recommended when using Pandas data frames, specially when doing assignments (something we are not doing here). Otherwise, we might have assignment problems as described here.

### R

Similarly to what we do in Pandas (actually Pandas is inspired in R), we can
access a data.frame column by its position.

existing_df[,1]

## X1990 X1991 X1992 X1993 X1994 X1995 X1996 X1997 X1998 X1999 X2000 X2001

##   436   429   422   415   407   397   397   387   374   373   346   326

## X2002 X2003 X2004 X2005 X2006 X2007

##   304   308   283   267   251   238

## 17 Levels: 238 251 267 283 304 308 326 346 373 374 387 397 407 415 ... 436

The position-based indexing in R uses the first element for the row number and
the second one for the column one. If left blank, we are telling R to get all
the row/columns. In the previous example we retrieved all the rows for the first
column (Afghanistan) in the data.frame. And yes, R has a 1-based indexing
schema.

Like in Pandas, we can use column names to access columns (series in Pandas).
However R data.frame variables aren’t exactly object and we don’t use the .
operator but the \$ that allows accessing labels within a list.

existing_df\$Afghanistan

## X1990 X1991 X1992 X1993 X1994 X1995 X1996 X1997 X1998 X1999 X2000 X2001

##   436   429   422   415   407   397   397   387   374   373   346   326

## X2002 X2003 X2004 X2005 X2006 X2007

##   304   308   283   267   251   238

## 17 Levels: 238 251 267 283 304 308 326 346 373 374 387 397 407 415 ... 436

An finally, since a data.frame is a list of elements (its columns), we can access
columns as list elements using the list indexing operator [[]].

existing_df[[1]]

## X1990 X1991 X1992 X1993 X1994 X1995 X1996 X1997 X1998 X1999 X2000 X2001

##   436   429   422   415   407   397   397   387   374   373   346   326

## X2002 X2003 X2004 X2005 X2006 X2007

##   304   308   283   267   251   238

## 17 Levels: 238 251 267 283 304 308 326 346 373 374 387 397 407 415 ... 436

At this point you should have realised that in R there are multiple ways of doing
the same thing, and that this seems to happen more because of the language itself
than because somebody wanted to provide different ways of doing things. This strongly
contrasts with Python’s philosophy of having one clear way of doing things (the
Pythonic way).

For row indexing we have the positional approach.

existing_df[1,]

##       Afghanistan Albania Algeria American Samoa Andorra Angola Anguilla

## X1990         436      42      45             42      39    514       38

##       Antigua and Barbuda Argentina Armenia Australia Austria Azerbaijan

## X1990                  16        96      52         7      18         58

## X1990      54     120        639        8      62      16     65   140

##       Bermuda Bhutan Bolivia Bosnia and Herzegovina Botswana Brazil

## X1990      10    924     377                    160      344    124

##       British Virgin Islands Brunei Darussalam Bulgaria Burkina Faso

## X1990                     32                91       43          179

##       Burundi Cambodia Cameroon Canada Cape Verde Cayman Islands

## X1990     288      928      188      7        449             10

##       Central African Republic Chad Chile China Colombia Comoros

## X1990                      318  251    45   327       88     188

##       Congo, Rep. Cook Islands Costa Rica Croatia Cuba Cyprus

## X1990         209            0         30     126   32     14

##       Czech Republic Cote d'Ivoire Korea, Dem. Rep. Congo, Dem. Rep.

## X1990             22           292              841              275

##       Denmark Djibouti Dominica Dominican Republic Ecuador Egypt

## X1990      12    1,485       24                183     282    48

##       El Salvador Equatorial Guinea Eritrea Estonia Ethiopia Fiji Finland

## X1990         133               169     245      50      312   68      14

##       France French Polynesia Gabon Gambia Georgia Germany Ghana Greece

## X1990     21               67   359    350      51      15   533     30

##       Grenada Guam Guatemala Guinea Guinea-Bissau Guyana Haiti Honduras

## X1990       7  103       113    241           404     39   479      141

##       Hungary Iceland India Indonesia Iran Iraq Ireland Israel Italy

## X1990      67       5   586       443   50   88      19     11    11

##       Jamaica Japan Jordan Kazakhstan Kenya Kiribati Kuwait Kyrgyzstan

## X1990      10    62     19         95   125    1,026     89         90

##       Laos Latvia Lebanon Lesotho Liberia Libyan Arab Jamahiriya Lithuania

## X1990  428     56      64     225     476                     46        64

##       Luxembourg Madagascar Malawi Malaysia Maldives Mali Malta Mauritania

## X1990         19        367    380      159      143  640    10        585

##       Mauritius Mexico Micronesia, Fed. Sts. Monaco Mongolia Montserrat

## X1990        53    101                   263      3      477         14

##       Morocco Mozambique Myanmar Namibia Nauru Nepal Netherlands

## X1990     134        287     411     650   170   629          11

##       Netherlands Antilles New Caledonia New Zealand Nicaragua Niger

## X1990                   28           112          10       145   317

##       Nigeria Niue Northern Mariana Islands Norway Oman Pakistan Palau

## X1990     282  118                      142      8   40      430    96

##       Panama Papua New Guinea Paraguay Peru Philippines Poland Portugal

## X1990     74              498       95  394         799     88       51

##       Puerto Rico Qatar Korea, Rep. Moldova Romania Russian Federation

## X1990          17    71         223     105     118                 69

##       Rwanda Saint Kitts and Nevis Saint Lucia

## X1990    190                    17          26

##       Saint Vincent and the Grenadines Samoa San Marino

## X1990                               45    36          9

##       Sao Tome and Principe Saudi Arabia Senegal Seychelles Sierra Leone

## X1990                   346           68     380        113          465

##       Singapore Slovakia Slovenia Solomon Islands Somalia South Africa

## X1990        52       55       66             625     597          769

##       Spain Sri Lanka Sudan Suriname Swaziland Sweden Switzerland

## X1990    44       109   409      109       629      5          14

##       Syrian Arab Republic Tajikistan Thailand Macedonia, FYR Timor-Leste

## X1990                   94        193      336             92         706

##       Togo Tokelau Tonga Trinidad and Tobago Tunisia Turkey Turkmenistan

## X1990  702     139    45                  17      49     83          105

##       Turks and Caicos Islands Tuvalu Uganda Ukraine United Arab Emirates

## X1990                       42    593    206      67                   47

##       United Kingdom Tanzania Virgin Islands (U.S.)

## X1990              9      215                    30

##       United States of America Uruguay Uzbekistan Vanuatu Venezuela

## X1990                        7      35        114     278        46

##       Viet Nam Wallis et Futuna West Bank and Gaza Yemen Zambia Zimbabwe

## X1990      365              126                 55   265    436      409

There we retrieved data for every country in 1990. We can combine this with a
column number.

existing_df[1,1]

## X1990

##   436

## 17 Levels: 238 251 267 283 304 308 326 346 373 374 387 397 407 415 ... 436

Or its name.

existing_df\$Afghanistan[1]

## X1990

##   436

## 17 Levels: 238 251 267 283 304 308 326 346 373 374 387 397 407 415 ... 436

What did just do before? Basically we retrieved a column, that is a vector, and
accessed that vector first element. That way we got the value for Afghanistan for
the year 1990. We can do the same thing using the [[]] operator instead of the
list element label.

existing_df[[1]][1]

## X1990

##   436

## 17 Levels: 238 251 267 283 304 308 326 346 373 374 387 397 407 415 ... 436

We can also select multiple columns and/or rows by passing R vectors.

existing_df[c(3,9,16),c(170,194)]

##       Spain United Kingdom

## X1992    40             10

## X1998    30              9

## X2005    24             11

Finally, using names is also possible when using positional indexing.

existing_df["X1992","Spain"]

## X1992

##    40

## Levels:  25  26  27  28  30  33 23 24 34 35 37 40 42 44

That we can combine with vectors.

existing_df[c("X1992", "X1998", "X2005"), c("Spain", "United Kingdom")]

##       Spain United Kingdom

## X1992    40             10

## X1998    30              9

## X2005    24             11

##Next Steps

So enough about indexing. In the next part of the tutorial on data frames we will see how to perform more complex data accessing using selection. Additionally, we will explain how to apply functions to a data frame elements, and how to group them.

Remember that all the source code for the different parts of this series of tutorials and applications can be checked at GitHub. Feel free to get involved and share your progress with us!

Author
Jose A Dianes
5.0
Data Engineer & Computer Science PhD
With more than a decade of experience in software engineering, I have been involved in different aspects of distributed and enterprise systems applied to domains such as Bioinformatics, Ambient...
Hire the Author

Jessamyn Smith
5.0
Experienced Full-Stack Web Developer
I am an experienced full-stack software developer with over a decade in the field, primarily in back-end development for web applications. My...
Hire this Expert
Ray Phan
5.0
PhD, former university instructor and software engineer with 20 years of software development experience in MATLAB, Python, Java, C, and C++. Image Processing and Machine Learning are part of my all-balanced diet. Students get a 33% discount!
*** Please note that if your language doesn't appear in my tags in my headline above, I probably don't know it! *** - Ph.D. (2013) - Electrical &...
Hire this Expert

Or Become a Codementor!

Codementor is your live 1:1 expert mentor helping you in real time.