Audio classification using Image classification techniques

Published Mar 26, 2018Last updated Mar 27, 2018
Audio classification using Image classification techniques

Both image classification and audio classification were challenging tasks for a machine to do until AI and neural networks technology came to the scene. Research on both problems were started decades before, and something fruitful started coming out after the inception of artificial intelligence and neural networks.

Classification is always an easier task for humans. Hope you all agree. But I can't 100% agree with that statement. Why am I contradicting my own statement?

I am not mad but there are certain cases like it's really hard to identify who is Jiswin and who is Jeswin (my friends) because both are identical twins and we can't make a judgement correctly with only our vision.

Leave the identical twins case and suppose we ordered a veg burger and a chicken burger. If they don't tell us which is chicken and which is veg, will we be able to recognise it correctly? So there are certain cases where human intelligence fails.

Theses are really simple examples from our daily life. So the takeaway from these examples is that sometimes human intelligence can also fail.

Inception of TensorFlow Inception v3

Now a days artificial intelligence can out perform almost any tasks than a human. A classification model trained with a lot of images of veg and chicken burger will be able to precisely recognize a chicken burger and a veg burger. If you can't agree with that, please go and try this out, here is the link.

Google's inception model, trained on imagenet dataset, can classify 1,000 classes of objects and is open sourced by Google. Anyone can use the trained model or retrain its last layer for new classes or to build your own classification model.

They have a really nice tutorial to start with inception If you haven't gone through it, please click on the above link and do a quick reading on it and come back. This is the simplest tutorial and good model which I found on google to do image classification. We will be talking a lot about it in the upcoming sections.

Inception is the only model which I found giving accurate predictions in less time and it is very easy to use, that means well documented. They uses deep convolution neural networks in inception. Inception model have already shown some excellent performance than humans in some visual tasks. I hope everyone read about inception and  all understood how we can retrain it.

Audio classification using TensorFlow Inception model

Here by seeing this heading you might be confused. How we can train a model with audio files for classification in inception? How can we do that? Actually it's not possible using inception. Your understanding 100% right till now. Inception can only be trained with images. It can only do image classification. Now, how we are going to solve this problem?

What if we can convert audio to images? Is there any pictorial representation for audio? During the research I came across two pictorial representation for audio files, spectrograms and chromagrams. A spectrogram is a visual representation of spectrum of frequencies of sound or other signal as they vary with time or some other variable — Wikipedia. The chromagram is a chroma-time representation similar to spectrogram.

| Image source:  Internet |

So after generating the spectrograms or chromograms, can we train an inception model with those images? Yes, we can. For us, human eyes, it might be difficult to find out a pattern from a set of spectrogram images but inception the way it sees an image is really different and it gives astonishing results.

I have done a couple of experiments with this technique. I will explain how to build a speaker recognition system with inception. I have started a project in GitHub named AudioNet. It contains scripts to process audio files and convert them to spectrograms.

Experiment 1 : Speaker Recognition

GitHub :
Note : This project is only tested on Ubuntu 16.04

Before we begin, we need data to train the model. Just for our experiment, we can download any speech of great people from Youtube as MP3. I have written a script to convert MP3s to wav files and then to process the wav file to make spectrogram out of it. To continue with this experiment, ensure this file is downloaded and extracted and kept in same folder where scripts folder you have downloaded.

Steps in our experiment:

  • Data preparation
  • Training the model
  • Testing the model

Data preparation

In this step, first thing you have to do is to make seperate folders for each speaker and name the folder with speaker's name. For example, if you have voice clips of Barrack Obama and APJ Abdul Kalam, then you have to make seperate folder for each person, one for Obama and one for Kalam. Then you have to put voice clips of each person in respective folders. And the voice clips should be in MP3. It will be better if the total time duration of all voice clips in a folder is same with all other folders. Once you have different folders for speakers, put that folder inside data_audio folder in tf_files folder.

Now we are good to run the script in scripts folder. Open up a terminal in scripts folder and enter

$ cd scripts
$ python

For running this script successfully, you should have below packages installed in your machine.

  • sox
  • libsox-fmt-mp3
  • ffmpeg
  • python-tk

After running the script successfully, you just go to each folders of speakers inside tf_files/data_audio/, you might be able to observe the voice clips which where in MP3 have been converted to wav files and the wav files have been divided into 20 seconds chunks and for each chunks there is a spectrogram JPG image. This is our training data. If you want, you can go to the data_maker script and change the time duration of chunks.


As I already mentioned, we are using Google's Inception model.
Run below commands to start training.

$ cd scripts
$ ARCHITECTURE="inception_v3"
$ python   --bottleneck_dir=../tf_files/bottlenecks  \
        --how_many_training_steps=500  \
        --model_dir=../tf_files/models/   \
        --summaries_dir=../tf_files/training_summaries/"${ARCHITECTURE}"   \
        --output_graph=../tf_files/retrained_graph.pb   \
        --output_labels=../tf_files/retrained_labels.txt   \
        --architecture="${ARCHITECTURE}"   \

You can increase the number of training steps if you want.


Get a voice clip of the speaker and generate spectrogram of his voice using script or Audio2Spectrogram. Then try testing the model by running below commands

$ cd scripts
$ python \
    --graph=../tf_files/retrained_graph.pb  \
    --labels=../tf_files/retrained_labels.txt  \


For any queries shoot a mail at

Your turn!!

Try to solve urban sound classification problem using inception and get amazed by seeing the results.

| Image source:  Internet |

Happy coding.. 😉



Discover and read more posts from Vishnu Ks
get started