Spark & R: Downloading data and Starting with SparkR using Jupyter notebooks
In this tutorial we will use the 2013 American Community Survey dataset and start up a SparkR cluster using IPython/Jupyter notebooks. Both are necessary steps in order to work any further with Spark and R using notebooks. After downloading the files we will have them locally and we won't need to download them again. However, we will need to init the cluster in each notebook in order to use it.
In the next tutorial, we will use our local files to load them into SparkSQL data frames. This will open the door to exploratory data analysis and linear methods in future tutorials.
All the code for these series of Spark and R tutorials can be found in its own GitHub repository. Go there and make it yours.
A good way of using these notebooks is by first cloning the repo, and then
starting your Jupyter in pySpark mode. For example,
if we have a standalone Spark installation running in our
localhost with a
maximum of 6Gb per node assigned to IPython:
MASTER="spark://127.0.0.1:7077" SPARK_EXECUTOR_MEMORY="6G" IPYTHON_OPTS="notebook --pylab inline" ~/spark-1.5.0-bin-hadoop2.6/bin/pyspark
Notice that the path to the
pyspark command will depend on your specific
installation. So as requirement, you need to have
Spark installed in
the same machine you are going to start the
IPython notebook server.
For more Spark options see here. In general it works the rule of passign options
described in the form
Every year, the US Census Bureau runs the American Community Survey. In this survey, approximately 3.5 million
households are asked detailed questions about who they are and how they live. Many topics are covered, including
ancestry, education, work, transportation, internet use, and residency. You can directly to
in order to know more about the data and get files for different years, longer periods, individual states, etc.
In any case, the starting up notebook
will download the 2013 data locally for later use with the rest of the notebooks.
The idea of using this dataset came from being recently announced in Kaggle
as part of their Kaggle scripts datasets. There you will be able to analyse the dataset on site, while sharing your results with other Kaggle
users. Highly recommended!
Getting and Reading Data
Let's first download the data files using R as follows.
population_data_files_url <- 'http://www2.census.gov/acs2013_1yr/pums/csv_pus.zip' housing_data_files_url <- 'http://www2.census.gov/acs2013_1yr/pums/csv_hus.zip' library(RCurl) population_data_file <- getBinaryURL(population_data_files_url)
Loading required package: bitops
housing_data_file <- getBinaryURL(housing_data_files_url)
Now we want to persist the files, so we don't need to download them again in further notebooks.
population_data_file_path <- '/nfs/data/2013-acs/csv_pus.zip' population_data_file_local <- file(population_data_file_path, open = "wb") writeBin(population_data_file, population_data_file_local) close(population_data_file_local) housing_data_file_path <- '/nfs/data/2013-acs/csv_hus.zip' housing_data_file_local <- file(housing_data_file_path, open = "wb") writeBin(housing_data_file, housing_data_file_local) close(housing_data_file_local)
From the revious we got two zip files,
csv_hus.zip. We can now unzip them.
data_file_path <- '/nfs/data/2013-acs' unzip(population_data_file_path, exdir=data_file_path) unzip(housing_data_file_path, exdir=data_file_path)
Once you unzip the contents of both files you will see up to six files. Each zip contains three files, a PDF explanatory document, and two data files in
csv format. Each housing/population data set is divided in two pieces, "a" and "b" (where "a" contains states 1 to 25 and "b" contains states 26 to 50). Therefore:
ss13husa.csv: housing data for states from 1 to 25.
ss13husb.csv: housing data for states from 26 to 50.
ss13pusa.csv: population data for states from 1 to 25.
ss13pusb.csv: population data for states from 26 to 50.
We will work with these fours files in our notebooks.
Starting up a SparkR cluster
In further notebooks, we will explore our data by loading them into SparkSQL data frames. But first we need to init a SparkR cluster and use it to init a SparkSQL context.
The first thing we need to do is to set up some environment variables and library paths as follows. Remember to replace the value assigned to
SPARK_HOME with your Spark home folder.
Sys.setenv(SPARK_HOME='/home/cluster/spark-1.5.0-bin-hadoop2.6') .libPaths(c(file.path(Sys.getenv('SPARK_HOME'), 'R', 'lib'), .libPaths()))
Now we can load the
SparkR library as follows.
Attaching package: 'SparkR' The following object is masked from ‘package:RCurl’: base64 The following objects are masked from ‘package:stats’: filter, na.omit The following objects are masked from ‘package:base’: intersect, rbind, sample, subset, summary, table, transform
And now we can initialise the Spark context as in the official documentation. In our case we are use a standalone Spark cluster with one master and seven workers. If you are running Spark in local node, use just
sc <- sparkR.init(master='spark://169.254.206.2:7077')
Launching java with spark-submit command /home/cluster/spark-1.5.0-bin-hadoop2.6/bin/spark-submit sparkr-shell /tmp/RtmpPm0py4/backend_port29c24c141b34
And finally we can start the SparkSQL context as follows.
sqlContext <- sparkRSQL.init(sc)
And that's it. Once we get to this going, we are ready to load data into SparkSQL data frames. We will do this in the next notebook.
By using R on Spark, we will get the power of Spark clusters into our regular R workflow. In fact, as we will see in the following tutorials, the SparkR implementation tries to use the same function names we normally use with regular R data frames.
And finally, remember that all the code for these series of Spark and R tutorials can be found in its own GitHub repository. Go there and make it yours.