PACKT

Stay Relevant!

Using R Data Structures

Published Dec 18, 2018

Learn how to use R data structures in this article by DejanSarka, an MCT and Microsoft Data Platform MVP and an independent trainer and consultant who focuses on the development of database and business intelligence applications.

This article introduces the most important data structures in R. When you analyze the data, you analyze a dataset. A dataset looks like a SQL Server table: you can observe rows and columns.

However, this is not a table in the relational sense, as defined in the Relational Model, which SQL Server follows. The order of rows and columns is not defined in a table that conforms to the Relational Model. However, in R, positions of cells as crossings of rows and columns are known. This is more like a matrix in mathematics.

In the R dataset, rows are also called cases or observations. You analyze the cases using the values in their columns, also called variables or attributes of the cases.

The following data structures that will be introduced in this article:
• Matrices and arrays
• Factors
• Lists
• Data frames

A matrix is a two-dimensional array. All values of a matrix must have the same mode – you can have integers only, or strings only, and so on. Use the matrix() function to generate a matrix from a vector. You can assign labels to columns and rows. When you create a matrix from a vector, you define whether you generate it by rows or by columns (default). You will quickly understand the difference if you execute the following demo code:

x = c(1, 2, 3, 4, 5, 6); x
Y = array(x, dim = c(2, 3)); Y
Z = matrix(x, 2, 3, byrow = F); Z
U = matrix(x, 2, 3, byrow = T); U

Please note that there are two commands in each line, separated by a semicolon. Also note that the array() function is used in the second line to generate the same matrix that the third line does, generating the matrix by columns. The last command generates a matrix by rows. Here is the output of the previous code:

[1] 1 2 3 4 5 6
     [,1] [,2] [,3]
[1,]    1    3    5
[2,]    2    4    6
     [,1] [,2] [,3]
[1,]    1    3    5
[2,]    2    4    6
     [,1] [,2] [,3]
[1,]    1    2    3
[2,]    4    5    6

The following code shows how to define explicit names for rows and columns:

rnames = c("rrr1", "rrr2")
cnames = c("ccc1", "ccc2", "ccc3")
V = matrix(x, 2, 3, byrow = T,
dimnames = list(rnames, cnames))
V

Here is the result:

     ccc1 ccc2 ccc3
rrr1    1    2    3
rrr2    4    5    6

You can refer to the elements of a matrix by index positions or by names if you defined the names. The following code shows you some options. Note that you always refer to the row and column indexes; however, when you skip an explicit index value, this means all of the elements for that index. For example, the first line of code selects the first row, all columns. The third line selects all rows, but the second and third column only:

U[1,]
U[1, c(2, 3)]
U[, c(2, 3)]
V[, c("ccc2", "ccc3")]

The results are as follows:

[1] 1 2 3
[1] 2 3
     [,1] [,2]
[1,]    2    3
[2,]    5    6
     ccc2 ccc3
Row1    2    3
Row2    5    6

The matrix is generated with the array() function as well. An array is a generalized, multi-dimensional matrix. The array() function, similar to the matrix() function, accepts a vector of values as the first input parameter. The second parameter is a vector where you define the number of dimensions with the number of elements in this vector, and the number of elements in each dimension with the values in this vector. You can also pass a list of vectors for the names of the dimensions' elements. An array is filled by columns, then by rows, then by the third dimension (pages), and so on. The following code shows how to create an array:

rnames = c("rrr1", "rrr2")
cnames = c("ccc1", "ccc2", "ccc3")
pnames = c("ppp1", "ppp2", "ppp3")
Y = array(1:18, dim = c(2, 3, 3),
dimnames = list(rnames, cnames, pnames))
Y

A variable is discrete when every value is taken from a limited pool of possible values. A variable is continuous if the pool of possible values is not limited. Discrete values can be nominal, or categorical, where they represent labels only, without any specific order, or ordinal, where a logical ordering of the values makes sense.

In R, discrete variables are called factors. Levels of a factor are the distinct values that make the pool of possible values. You define factors from vectors of values with the factor() function. Many statistics, data mining, and machine learning algorithms treat discrete and continuous variables differently. Therefore, you need to define the factors properly in advance, before analyzing them. Here are some examples of defining the factors:

x = c("good", "moderate", "good", "bad", "bad", "good")
y = factor(x); y
z = factor(x, order = TRUE); z
w = factor(x, order = TRUE,
           levels = c("bad", "moderate", "good")); w

The previous code produces the following result:

[1] good     moderate good     bad      bad      good   
Levels: bad good moderate
[1] good     moderate good     bad      bad      good  
Levels: bad < good < moderate
[1] good     moderate good     bad      bad      good   
Levels: bad < moderate < good

You can see how R recognized distinct levels when you defined that the vector represents a factor. You can also see how R sorted the levels alphabetically by defaultwhen in the third line of the previous code you defined the factor as an ordinal variable. In the last command, you finally defined the correct order for the factor.

Lists are complex data structures. Lists can include any other data structure, including another list. Typically, you analyze a list. However, you need to know about them because some functions return complex results packed in a list, and some functions accept lists as parameters.

You create lists with the list() function. As lists are also ordered, you can refer to the objects in a list by position. You need to enclose the index number in double brackets. If an element is a vector or a matrix, you can use the index position(s) of the elements of this vector in a matrix, enclosed in single brackets. Here is an example of using a list:

L = list(name1 = "ABC", name2 = "DEF",
no.children = 2, children.ages = c(3, 6))
L
L[[1]]
L[[4]]
L[[4]][2]

This example code for working with lists returns the following result:

$name1
[1] "ABC"
$name2
[1] "DEF"
$no.children
[1] 2
$children.ages
[1] 3 6
[1] "ABC"
[1] 3 6
[1] 6

The most important data structure for any data science analysis is a data frame. Data frames are generalized two-dimensional matrices where each variable can have a different mode or a different data type. Of course, all the values of a variable must be of the same data type.

This is very similar to SQL Server tables. But data frames are matrices, so you can use the positional access to refer to the values of a data frame. You can create a data frame from multiple vectors of different modes with the data.frame() function. All of the vectors must be of the same length and must have the same number of elements. Here is an example:

CategoryId = c(1, 2, 3, 4)
CategoryName = c("Bikes", "Components", "Clothing", "Accessories")
ProductCategories = data.frame(CategoryId, CategoryName)
ProductCategories

The result is as follows:

CategoryIdCategoryName
1          1        Bikes
2          2   Components
3          3     Clothing
4          4  Accessories

Most of the time, what you want is to store the data you read in a data frame. The dataset you analyze is your data frame. SQL Server is definitely not the only source of data you can use. You read the data from many other sources, including text files and Excel.

The following code retrieves the data from the dbo.vtargetMail view from the AdventureWorksDW2017 demo database in a data frame and then displays the first five columns for the first five rows of the data frame:

con <- odbcConnect("AWDW", uid = "RUser", pwd = "Pa$$w0rd")
TM <-
sqlQuery(con,
         "SELECT CustomerKey,
EnglishEducation AS Education,
            Age, NumberCarsOwned, BikeBuyer
          FROM dbo.vTargetMail;")
close(con)
TM[1:5, 1:5]

Here is the content of these first five rows of the data frame:

CustomerKey Education Age NumberCarsOwnedBikeBuyer
1       11000 Bachelors  31               0         1
2       11001 Bachelors  27               1         1
3       11002 Bachelors  32               1         1
4       11003 Bachelors  29               1         1
5       11004 Bachelors  23               4         1

You can also see the complete data frame in a separate window that opens after you execute the following command:

View(TM)

As you have just seen, you can retrieve the data from a data frame using positional indices, as in matrices. You can also use column names. However, the most common notation is using the data frame name and column name, separated by the dollar ( $) sign, such as TM$ Education. The following code uses the R table() function to produce a cross-tabulation of NumberCarsOwned and BikeBuyer:

table(TM$NumberCarsOwned, TM$BikeBuyer)

Here is the last result:

       0    1
  0 1551 2687
  1 2187 2696
  2 3868 2589
3  951  694
4  795  466

This data frame concludes this short introduction to R data structures.

If you found this article interesting, you can explore Data Science with SQL Server Quick Start Guide to get unique insights from your data by combining the power of SQL Server, R,and Python. Data Science with SQL Server Quick Start Guide is the ideal introduction to data science with Microsoft SQL Server and In-Database ML Services and covers all stages of a data science project.

Data structure R

Report

Enjoy this post? Give PACKT a like if it's helpful.

PACKT

Stay Relevant!

Discover and read more posts from PACKT

get started