Mauhcs

Quirky Data Scientist with a Ph.D. in CS and 8 years in the industry

Train Neural Networks Faster with Google’s TPU from your LapTop.

Published Aug 18, 2020Last updated Feb 13, 2021

You know the drill, you got that sweet deep neural network architecture, but it is taking forever to train. At this moment you start browsing Amazon for the cheapest GPU you can find to train your models, even though you are not even sure how plug that into your machine, but anything is better than those hours your model is taking to train one set of hyper-parameters. Well, before you start spending hundreds of dollars in hardware, you should consider the Tensor Processing Units (TPUs) from Google. TPUs are a hardware component meant to speed up machine learning models training and prediction so researchers and engineers can focus on their solutions to their favorite humans, instead of going crazy over life-long epochs.

If you are new to TPUs:

If you are looking to play around TPUs before getting serious, have a look at this tutorial from Google (https://www.tensorflow.org/guide/tpu). It will teach you how to run in TPUs from Colabs (Jupyter Notebooks on Google Cloud). That is a neat feature and pretty easy to run, so it is a good start. But what if you want to develop your model locally, even do a small sample run in your machine and only then send data to the cloud TPUs for training? Turns out you need to jump some hops for that. But I can show you a straight line through wild land of Google Cloud documentation to achieve just that.

If you are ready to rock with TPUs:

Grab your favorite machine and buckle up. Here is a short summary of what is coming up:

Learn how to instantiate your own Cloud TPU in Google.
Create your own Cloud Server and Google Storage bucket to store training data.
Get service account keys to programmatically connect your training script to google.
SSH Tunneling. Connect your local machine to the cloud TPU in one line.
Dive in a simple git repo example for training your first model in the TPUs

0. Setup

Open a terminal window and create a working directory for this tutorial. Also, leave the terminal open as we will be setting up the environment as we go through the tutorial.

# Make a folder for this tutorial
mkdir -p ~/tpu-tutorial

1. Get your own Cloud TPU

To get your own TPU go to Google Cloud Platform (GCP) console: https://console.cloud.google.com/compute/tpus and click “Create TPU Node” on the top of the page (or in the middle of the page if it is your first TPU).

Then, choose a name for your node, this is not important, so any name is fine, or simple leave as node-1 and you should be fine.

The following are important steps:

Select us-central-1-c as the zone for your TPU node. At the time of writing, only a few data centers have TPUs, they are Iowa (us-central1-{a,b,c}), Netherlands (europe-west4-a) and Taiwan (asia-east1-c). Iowa TPUs are the cheapest (USD$1.35), so for this tutorial let’s stick to them, choose us-central1-c.
Click in Preemptibility. This allows google to shutdown your TPU if needed by their system. That sounds bad, and it is in a production environment, but this option makes the TPU node much cheaper, and the added risk does not really matter for this tutorial. I was just fine with this option checked.

Finally, choose your TPU type. The type v2–8 works just fine for us here. Click “Create.”

See figure below how the setup looks like:

TPU node setup. Get yours in the us-central1 as they are cheaper.

IMPORTANT: TURN OFF YOUR TPU when you are not using it to avoid spending money unnecessarily. Google makes it very easy to get a new TPU up and running, so do not hesitate to destroy it if you take a break of this tutorial. Your pocket thanks you

After clicking create you will be redirected to the dashboard with your TPU nodes. After the node is initialized a green checkmark will appear to the left of the name of the node and an IP address will appear under the column “Internal IP” (if the column “Internal IP” does not show up click in “columns” and select “Internal IP”). Copy the IP and run the following command in the terminal:

# Change TPU.INTERNAL.IP with TPU node's IP
export TPU="TPU.INTENAL.IP"

Note that the TPU’s IP is internal only (as the name says), this means that you cannot connect to it from your machine directly. We need a jump server between your computer and the TPU for that, but that is very easy.

2. Create a Virtual Machine Jump Server

Go to GCP instances and create an instance. There are three important steps to be able to connect your local machine to the TPU properly.

Create the instance in the same region as in the previous step (us-central1-c if you are following the default in this tutorial).
In the machine configuration you can choose the cheapest server (f1-micro). That is perfect for this tutorial.

See the image for reference:

Choose your region as the previous section. Here, the lowest server configuration is enough. In my case, I was able to get the instance for free from GCP.

Add your ssh public-key to the server in order to connect to it from your local machine. To do that, on the bottom of the page, click where it reads “Management, security, disks, networking, sole tenancy” to expand the form. Choose “Security” and past your ssh public-key to the “Enter public SSH key” text-box.

To see your ssh public-key run the following command:

# Display ssh public-key
cat ~/.ssh/id_rsa.pub

The output should look something like this:

# ssh public-key example
ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAQEAklOUpkDHrfHY17SbrmTIpNLTGK9Tjom/BWDSU
GPl+nafzlHDTYW7hdI4yZ5ew18JH4JW9jbhUFrviQzM7xlELEVf4h9lFX5QVkbPppSwg0cda3
Pbv7kOdJ/MTyBlWXFCR+HAo3FXRitBqxiX1nKhXpHAZsMciLq8V6RjsNAQwdsdMFvSlVK/7XA
t3FaoJoAsncM1Q9x5+3V0Ww68/eIFmb1zuUFljQJKprrX88XypNDvjYNby6vw/Pb0rwert/En
mZ+AW4OZPnTPI89ZPmVMLuayrD2cE86Z/il8b+gw3r3+1nKatmIkjn2so1d01QraTlMqVSsbx
NrRFi9wrf+M7Q== USERNAME@email.com

Remember your USERNAME as it will be important to connect your local machine with the instance.

If you do not have an ssh public-key, the previous command should have output an error, in this case run the following command to create your own ssh key:

# Generate a ssh public-key
ssh-keygen -t rsa -b 4096 -C "USERNAME@email.com"

Then go ahead and copy and past your ssh public-key in the form and press “Create.” See image below for reference.

Paste your ssh public-key in the “Enter public SSH key” field, then click “Create”

After clicking “Create” you will be redirected to the dashboard with yours instances. When yours new instance is running you will see the green check mark to the left of the name. It is time to get its IP. See the image for reference:

Get your instance’s “External IP”

As with the TPUs, you will need the instance’s IP, but this time the “External IP”. Also, remember your SSH public-key and run the following command in the previously opened terminal:

# Save the jump server's IP and your username in it:
export JUMP="INSTANCE.EXTERNAL.IP"; export USER=USERNAME

With the TPU initialized and the Cloud Server instantiated we are ALMOST ready to start training our models. The missing part is data, and it MUST be stored within google cloud storage. So let’s create our own bucket in the next section.

3. Create Google Storage and Get Permissions

Go to Google Storage (gs) and click in “Create Bucket.” Follow these steps to make sure your TPU can see your bucket.

Name it “dnn-bucket”.
Choose Region as the bucket’s Location type.
Choose its location as us-central1 (Iowa) — or make sure is the same region as your TPU
Choose Storage size to be 1GB.
Leave all the other options and fields as they are.
Click Create

The next step is to create a Service account keys so your script can connect to the bucket. For that click in the link below and follow the instructions:
Google Service Account Key creation: https://cloud.google.com/iam/docs/creating-managing-service-account-keys#creating_service_account_keys

After creating your keys, save the file as “~/tpu-tutorial/tensorflow-tutorial-cred.json.”

Now, with the TPU and the Jump server instantiated, the Google Storage bucket created and your local machine having access to it, the only missing part is connecting everything. We will do that through SSH tunneling in the next section.

4. SSH Tunnel from Local Computer to Cloud TPU — Cloud Heisting Google

SSH Tunneling allows the jump server created in the cloud to connect your local machine with the cloud TPU in a way that our deep-neural-net does not need to worry about network connectivity. From the training script point of view, the connection happens to the local machine.

To create the tunnel run the following commands in a new terminal:

# Export TPU's  Internal IP if you did not do that yet
export TPU="TPU.INTERNAL.IP"

# Export Jump Server's IP and username if you did not do that yet
export JUMP="JUMP.EXTERNAL.IP"; export USER=JUMP_USERNAME
# Make the SSH Tunnel between your local port 2000 and the TPU:
ssh $USER@$JUMP -L 2000:$TPU:8470

This command will log your terminal to the Jump server, and the tunnel will be connected as long as you are connected to the server. To disconnect, killing the tunnel, press ctrl-d with the terminal connected.

Now, we are ready for our last step, training our deep-neural network in google’s Cloud TPU. How much faster will it be than running in your local computer?

5. Training in a TPU — Clone Git Repo and Installation

As a first try for training using the cloud TPU, I made an example git repo. To install it you should have python 3 installed in your computer. Then, clone the git repo as follows:

# Create working directory if not create already and navigate to it:
mkdir -p ~/tpu-tutorial
cd ~/tpu-tutorial 
git clone https://github.com/mauhcs/tpu-from-home.git
cd tpu-from-home

Then, create an environment and update pip and install python requirements with:

python -m venv tpu-env
source tpu-env/bin/activate
pip install --upgrade pip
pip install -r requirements.txt

That is it, you are ready to train your own model.
If you want, you can use the convenience scripts I wrote. For example, here is how to train locally:

# Train locally:
time bash local_run.sh

Here is how to train in your TPU (remember that we got the file tensorflow-tutorial-cred.json from a previous step):

# Train in the TPU
# The file tensorflow-tutorial-cred.json is from a previous step
time bach tpu_run.sh ~/tpu-tutorial/tensorflow-tutorial-cred.json

If you want to dive in details how the model works, have a look at the main.py. Let me know if you would like an article deep diving in distributed training focused on TPUs.

My Benchmark Results

These are my results running from my MacBook pro as a local machine and running in a TPU in Iowa (from Tokyo).
The model is a multi-layer convolutional network for MNIST visual recognition with 50 epochs of training. See the model here.

# Local run
{'accuracy_top_1': 0.9893662929534912, 
'eval_loss': 0.032137673969070114, 
'loss': 0.033212198394125904, 
'training_accuracy_top_1': 0.989628255367279}
real 34m12.642s
user 101m32.471s
sys 11m41.765s
# TPU training
{'accuracy_top_1': 0.9892578125, 
'eval_loss': 0.031606151825851865, 
'loss': 0.032124145309729825, 
'training_accuracy_top_1': 0.989863932132721}
real 16m54.201s
user 0m20.993s
sys 0m4.911s

6. Conclusion

Training a simple convolutional neural network on MNIST data set in the Cloud TPU was twice as fast than training locally (MacBook Pro 3.5 GHz Intel Core i7). This can be a good way to help your models do hyper-parameters search, or simply training your more complex models. I would expect a better performance from the TPUs, so I expect that more tweaking in the model and data pipeline should improve the performance of the TPUs.

Cloud TPUs are also not cheap, so unless you are sure you can take advantage of its training speed, you might be better off training on your local machines.
The TPUs require a lot of integration with google cloud (storage buckets, cloud server instances, …), but with latency that would be added if any component were in your local machine, it would be hard to justify using the TPUs, as the network requests would slow down training. To see that, remove the option --no_callback from the file tpc_run.sh and see all the latency it adds.

Ultimately, because of network latency running cloud TPUs from your local machine should be consider as a test run before deploying your training scripts to your cloud server.

Are you going to start to using TPUs? Do you want a tutorial about how to write your own models for training? Let me know in the comments below what you think.

Also, TURN OFF YOUR TPU when you are not using it. You are welcome

Tensorflow Google cloud platform Python Deep learning Tpu

Report

Enjoy this post? Give Mauhcs a like if it's helpful.

Mauhcs

Quirky Data Scientist with a Ph.D. in CS and 8 years in the industry

I got my Ph.D. in Computer Science back in 2015 in Tokyo. Since then I have been working in the industry in multiple projects always related to Python, Machine Learning and Data Science. I want to foster the skills and tricks I l...

Discover and read more posts from Mauhcs

get started