How I Built a Reverse Image Search with Machine Learning and TensorFlow: Part 1

Published May 26, 2017Last updated Nov 22, 2017
How I Built a Reverse Image Search with Machine Learning and TensorFlow: Part 1

Getting Started

I’ve been making some TensorFlow examples for my website, fomoro.com, and one of the ones I created was a lightweight reverse image search. While it’s fresh in my head, I wanted to write up an end-to-end description of what it’s like to build a machine learning app, and more specifically, how to make your own reverse image search. For this demo, the work is ⅓ data munging/setup, ⅓ model development and ⅓ app development.

At a high-level, I use TensorFlow to create an autoencoder, train it on a bunch of images, use the trained model to find related images, and display them with a Flask app.

In this first post, I’m going to go over my environment and project setup and do a little bit of scaffolding. Ready? Let’s get started.

Environment Setup

The first thing to do is setup the environment. I’m using Python 2.7 and TensorFlow installed with virtualenvwrapper.

# new project directory
$ mkdir imagesearch
$ cd imagesearch

# created new virtual env
$ mkvirtualenv --system-site-packages imagesearch
$ pip install --upgrade tensorflow

If you’ve never used TensorFlow before, it’s a good idea to verify that it works. You may see some warnings when you run the session line in the code below. These are normal and are just to alert you that building TF from source may be faster on your setup. Since it may take an hour or more depending on your hardware, I generally don't recommend it if you’re just getting started.

# verify tensorflow works by starting a python shell
$ python
>>> import tensorflow as tf
>>> hello = tf.constant('Hello, TensorFlow!')
>>> sess = tf.Session()
>>> print(sess.run(hello))
Hello, TensorFlow!

Dataset Prep

The next thing I needed was a dataset. I really love architecture and bridges, so I decided to use LSUN bridges. LSUN stands for “large-scale scene understanding,” and the whole dataset is pretty large. For the lightweight demo I’m making, using just the bridges was plenty of data, and I didn’t want to spend all day downloading stuff.

Generally, the more data and more varied data you have, the more accurate your final model will be, but that also comes with the trade-off of taking longer to train. I wanted to get to that the sweet spot of data that allowed me to train quickly while still getting pretty decent accuracy.

# clone the lsun repo
$ mkdir data
$ cd data
$ git clone https://github.com/fyu/lsun.git repo

# start the download 
$ mkdir src
$ python repo/download.py -o src -c bridge

This is a really good point to take a break or go to lunch. The download is around 15 GB and even on fast wi-fi it takes a while.

Unfortunately, the LSUN dataset, like most machine learning datasets, requires a couple of steps to get into a usable format. I had to clone the repo, and then use their tools to download a zipped database of the images I wanted. Once that was done, they still needed to be extracted. So I had to add another couple packages to my setup.

# add dependencies
$ pip install opencv-python lmdb numpy

# extract training set
$ mkdir training
$ python repo/data.py export src/bridge_train_lmdb --out_dir training

# extract validation set
$ mkdir validation
$ python repo/data.py export src/bridge_val_lmdb --out_dir validation

# go back to the project root and we're ready to start our model development
$ cd ../
$ mkdir imagesearch
$ cd imagesearch

The extracted images are jpgs with an unusual extension. You can verify that by installing pillow and getting the image type. Since I like my extensions and types to match, I changed the extension. While I was at it, I also flattened the directory structure. These last steps are totally optional, but it allowed me to simplify my eventual TensorFlow input code, and I already had a nice helper script. You can find it in prep.py in the repo.

Project Setup

Now we’re finally ready to setup the TensorFlow bits of the project. Before we go any further, let's talk about file organization. When working on a TensorFlow project, there’s a lot of back and forth between the code and the command line, so I like to structure my code as modules and err on the side of more files versus more functions. I also like to use the project name when running the module in the command line, so the main module directory is also called imagesearch, which leads to this directory structure:

/imagesearch
----/data
--------/training
--------/validation
--------/other folders…
----/imagesearch
--------main.py
--------other files…

Main.py

Main.py is the entrypoint into the model. It sets up the TensorFlow experiment and kicks off the training job. I like to use command line arguments, because they give me flexibility during development. It’s much better to update your batch size via an argument then to hardcode it since it’s something that changes frequently.

From personal experience, it’s also important to keep track of when these arguments change in a notebook or a spreadsheet or something, so the results can be reproduced. Scrolling back through a bash history isn’t fun, but it’s also better than having zero record at all if you forget to keep track. The full file is at the link below, but the important bits are the last couple lines where I setup the experiment and kick off the runner.

experiment_fn = generate_experiment_fn(
    train_files=args.train_files,
    eval_files=args.eval_files,
    batch_size=args.batch_size,
    num_epochs=args.num_epochs)

learn_runner.run(experiment_fn, args.job_dir)

main.py line:47

If you’ve done some other TensorFlow tutorials, you’ll notice that those lines look a bit different than the typical sess = tf.Session() and sess.run() code that you normally see. Experiments are a higher level API and part of tf.contrib.learn. They are much more convenient to work with and simplifies the code you have to write quite a bit. I use them extensively in projects.

The downside is that anything in the contrib section of TensorFlow might change in future versions, but since the overall field is rapidly evolving, I wouldn’t worry about it too much.

Experiment.py

So what is an experiment really? I like to think of it as the glue layer between your input data and your model. It controls how often and where checkpoints are saved, what evaluation metrics you’re using, and sets up some basic configuration. I used to put this functionality just into main.py, but now I find it cleaner to break it out into its own file.

Most of this is file is boilerplate hookup code, but run_config has some arguments that are worth mentioning, specifically the GPU fraction. By default, TensorFlow tries to use 100% of the available GPU memory, which can be problematic if you’re running locally and need your GPU for other things.

In this case, I’m restricting the GPU memory usage to 80% to avoid that issue. When I’m in active development, I run everything locally, so I can see if it’s more-or-less working. It’s only when I’m confident that I have everything figured out that I’ll put the model on the cloud for extensive training.

run_config = tf.contrib.learn.RunConfig(
    save_summary_steps=1000,
    save_checkpoints_steps=1000,
    save_checkpoints_secs=None,
    gpu_memory_fraction=0.8)

experiment.py line:31

The other interesting bit of code here is the evaluation metrics. I’ll talk more about this when we get to the autoencoder model, but you should know that I’m using root mean squared error (RMSE) and comparing the original image to the output of our model.

eval_metrics = {
    'rmse': tf.contrib.learn.MetricSpec(
        metric_fn=tf.metrics.root_mean_squared_error,
        prediction_key='decoded_image',
        label_key='image')
}

experiment.py line:42

Inputs.py

The last thing to do before I can move on to creating my model is to define my input pipeline. In this case, I’m grabbing all of the filenames, so I can load them into a input producer. This is the point where your filesystem (or straight Python code) interacts with TensorFlow, so there’s some gotchas to be aware of.

I’m using match_filenames_once to load in my data. It’s a really useful function that takes either a pattern or a list of filenames (straight from Python) and returns a tensor of filenames that can be consumed as needed. It’s also why I flattened the input directory structures in the setup stage, since it doesn’t like /**/ globstrings.

The alternatives are to write your own filename loop or expand out the glob string with something like */*/*/*/*/. Since I wanted to change the filename extension anyways (and had some code around), I decided to flatten the structure.

filenames_tensor = tf.train.match_filenames_once(file_pattern)

inputs.py line:16

Using a built-in TF input producer is also a handy way to take care of the shuffling and queuing that’s required for creating robust models without having to write the shuffle code yourself. All I have to do is load in my filename_tensor, and it takes care of the hard work for me.

I’m passing in shuffle as a parameter instead of just defaulting to true because I want to be able to use the same input functions for both training and evaluation, and there's no need to shuffle for evaluation.

filename_queue = tf.train.string_input_producer(
    filenames_tensor,
    num_epochs=num_epochs,
    shuffle=shuffle)

inputs.py line:17

From there, I’m able to read in the contents of the file (the image data) and get it ready for processing (cropping to 256px by 256px, rescaling the values to be between -1 and 1, and batching them all together). That way it’s really easy to convert all the images to a tensor of a specific size whose values are very roughly “centered” at 0. Since we’re using RGB images, the number of channels is 3 (one for each color: red, green, and blue).

height, width, channels = [256, 256, 3]

reader = tf.WholeFileReader()
filename, contents = reader.read(filename_queue)

image = tf.image.decode_jpeg(contents, channels=channels)
image = tf.image.resize_image_with_crop_or_pad(image, height, width)
image_batch, filname_batch = tf.train.batch(
    [image, filename],
    batch_size,
    num_threads=4,
    capacity=50000)

image_batch = tf.to_float(image_batch) / 255
image_batch = (image_batch * 2) - 1

inputs.py line:22

Now that I’m done with all the pre-processing and batching, I just return the features and labels that the model needs to have access to, and I’m finally ready to start the model development. I’m passing along the filename in addition to the image because I’ll eventually need a way to associate our representation with the original, but we’ll cover why in part 3. For now, it’s just something to note.

features = {
    "image": image_batch,
    "filename": filname_batch
}

labels = {
    "image": image_batch
}

return features, labels

inputs.py line38

Wrap Up, For Now...

Phew… That was a lot of work. Getting TensorFlow set up, downloading and preparing our data, and doing some project scaffolding takes time. It’s also stuff I do for every machine learning project.

In the next post, I'll talk about how (and why) to build an autoencoder and how to use it, which is the part that changes from project to project.

Read Part 2: Model Development
Read Part 3: App Development

Questions? Comments? Let me know in the comments, or hit me up on Twitter: @jimmfleming

Discover and read more posts from Jim Fleming
get started
Enjoy this post?

Leave a like and comment for Jim

16
4