Mikhail Sidyakov

Senior Data Scientist with 3+ years of experience

Jaccard similarity and Jaccard distance in Python

Published Dec 20, 2021Last updated Jun 17, 2022

In this tutorial we will explore how to calculate the Jaccard similarity (index) and Jaccard distance in Python.

Table of contents

Introduction
What is Jaccard similarity
Calculate Jaccard similarity
What is Jaccard distance
Calculate Jaccard distance
Similarity and distance of asymmetric binary attributes
Calculate Jaccard similarity in Python
Calculate Jaccard distance in Python
Similarity and distance of asymmetrics binary attributes in Python
Conclusion

Introduction

Jaccard similarity (Jaccard index) and Jaccard index are widely used as a statistic for similarity and dissimilarity measurement. Their applications ranges from simple set similarities, all the way up to complex text files similarities.

To continue following this tutorial we will need the following Python libraries: scipy, sklearn and numpy.

If you don’t have it installed, please open “Command Prompt” (on Windows) and install it using the following code:

pip install scipy
pip install sklearn
pip install numpy

What is Jaccard similarity

The Jaccard similarity (also known as Jaccard similarity coefficient, or Jaccard index) is a statistic used to measure similarities between two sets.

Its use is further extended to measure similarities between two objects, for example two text files. In Python programming, Jaccard similarity is mainly used to measure similarities between two sets or between two asymmetric binary vectors.

Mathematically, the calculation of Jaccard similarity is simply taking the ratio of set intersection over set union.

Consider two sets A and B :

Jaccard Similarity - Set Defined

Then their Jaccard similarity (or Jaccard index) is given by:
1*Fs71qhfrf3yuQBnDI6k4sQ.png

Let’s break down this formula into two components:

1. Nominator

The nominator is effectively the set intersection between A and B , shown by the yellow area in the infographic below:

Jaccard Similarity - Set Intersection

2. Denominator

The denominator is effectively the set union of A and B , shown by the yellow area in the infographic below:

Jaccard Similarity - Set Union

Using the formula of Jaccard similarity, we can see that the similarity statistic is simply the ratio of the above two visualizations, where:

If both sets are identical, for example $(A = {1, 2, 3})$ and $(B = {1, 2, 3})$ , then their Jaccard similarity = 1.
If sets A and B don’t have common elements, for example, say $(A = {1, 2, 3})$ and $(B = {4, 5, 6})$ , then their Jaccard similarity = 0.
If sets sets A and B have some common elements, for example, $(A={1,2,3})$ and $(B = {3, 4, 5})$ , then their Jaccard similarity is some value on the interval: $(0 \leq J(A, B) \leq 1)$ .

Calculate Jaccard similarity

Consider two sets:

A = {1, 2, 3, 5, 7}
B = {1, 2, 4, 8, 9}

Or visually:

Set Defined Example

Step 1:

As the first step, we will need to find the set intersection between A and B :

Set Intersection in Python

In this case:

$A \cap B = \{1, 2\}$

Step 2:

The second step is to find the set union of A and B :

Set Union in Python

In this case:

$A \cup B = \{1, 2, 3, 5, 7, 4, 8, 9\}$

Step 3:

And the final step is to take the ratio of sizes of intersection and union:

$J = \frac{|A \cap B|}{|A \cup B|} = \frac{2}{8} = 0.25$

What is Jaccard distance

Unlike the Jaccard similarity (Jaccard index), the Jaccard distance is a measure of dissimilarity between two sets.

Mathematically, the calculation of Jaccard distance is the ratio of difference between set union and set intersection over set union.

Consider two sets A and B :

Jaccard Similarity - Set Defined

Then their Jaccard distance is given by:

1*kTn53RgItgnXWPyXK5mUSg.png

Let’s break down this formula into two components:

1. Nominator

The nominator can be also written as:

1*Pa_UA6TYCPpJe6VkL7vZZg.png

which is effectively the set symmetric difference between A and B , shown by the yellow area in the infographic below:

Jaccard Similarity - Set Symmetric Difference

2. Denominator

The denominator is effectively the set union of A and B , shown by the yellow area in the infographic below:

Jaccard Similarity - Set Union

Using the formula of Jaccard distance, we can see that the dissimilarity statistic is simply the ratio of the above two visualizations, where:

If both sets are identical, for example $(A = {1, 2, 3})$ and $(B = {1, 2, 3})$ , then their Jaccard distance = 0.
If sets A and B don’t have common elements, for example, say $(A = {1, 2, 3})$ and $(B = {4, 5, 6})$ , then their Jaccard distance = 1.
If sets sets A and B have some common elements, for example, $(A={1,2,3})$ and $(B = {3, 4, 5})$ , then their Jaccard distance is some value on the interval: $(0 \leq d\_J(A, B) \leq 1)$ .

Calculate Jaccard distance

Consider two sets:

A = {1, 2, 3, 5, 7}
B = {1, 2, 4, 8, 9}

Or visually:

Set Defined Example

Step 1:

As the first step, we will need to find the set symmetric difference between A and B :

Python Set Symmetric Difference

In this case:

1*0pKXRaC91OzYlTlAnBWJbw.png

Step 2:

The second step is to find the set union of A and B :

Set Union in Python

In this case:

$A \cup B = \{1, 2, 3, 5, 7, 4, 8, 9\}$

Step 3:

And the final step is to take the ratio of sizes of symmetric difference and union:

1*NGXiGGlvmODSO_DRDRdadw.png

Similarity and distance of asymmetric binary attributes

In this section we will look into a more specific application of Jaccard similarity and Jaccard distance. More specifically, their application to asymmetric binary attributes.

From the naming of it, we can already guess what a binary attribute is. It’s an attribute that has only two states, and those two states are:

0, meaning an attribute is not present
1, meaning an attribute is present

The asymmetry comes from the point that if both attributes are present (both equal to 1), it is considered more important, than if both attributes weren’t present (both equal to 0).

Suppose we have two vectors, A and B , each with (n) binary attributes.

In this case, the Jaccard similarity (index) can be calculated as:

1*e3JZu4Zn16Tz28utP51xeA.png

and Jaccard distance can be calculated as:

1*YMiu8GmWz8GCv6IyQGlh5A.png

where:

$M\_{11}$ is the total numbers of attributes, for which both A and B have 1
$M\_{01}$ is the total numbers of attributes, for which A has 0 and B has 1
$M\_{10}$ is the total numbers of attributes, for which A has 1 and B has 0
$M\_{00}$ is the total numbers of attributes, for which both A and B have 0

and:

1*8KbqLvzxQkHKahO0Tciu7g.png

Example

To explain this in more simple terms, consider the example that can be used for market basket analysis.

You operate a store that has 6 products (attributes) and 2 customers (objects), and also keep track of which customer bought which item. You know that:

Customer A bought: apple, milk coffee
Customer B bought: eggs, milk, coffee

As you can already imagine, we can construct the following matrix:

1*qePn1Byz85f0Mz6n2gw6tw.png

Where the binary attribute for each customer is indicating if customer purchased (1) or didn’t purchase (0) a particular product.

The question is to find the Jaccard similarity and Jaccard distance for these two customers.

Step 1:

We will first need to find the total number for attributes for each $M$ :

1*mPuAtjoi5CbmKVy9b2vSaw.png

We can validate the groups by summing up the counts. it should be equal to 6 which is the $n$ number of attributes (products):

1*vTrnDfJnvOLHDk5_kryYPw.png

Step 2:

Since we have all the required inputs, we can now calculate the Jaccard similarity:

1*mTp5efQkDrSYN3AIOTq7KA.png

And Jaccard distance:

1*wNS3AKiINFvBXR7CiQ6OMg.png

Calculate Jaccard similarity in Python

In this section we will use the same sets as we defined in the one of the first sections:

A = {1, 2, 3, 5, 7}
B = {1, 2, 4, 8, 9}

We begin by defining them in Python:

A = {1, 2, 3, 5, 7}
B = {1, 2, 4, 8, 9}

As the next step we will construct a function that takes set A and set B as parameters and then calculates the Jaccard similarity using set operations and returns it:

def jaccard_similarity(A, B):
    #Find intersection of two sets
    nominator = A.intersection(B)

    #Find union of two sets
    denominator = A.union(B)

    #Take the ratio of sizes
    similarity = len(nominator)/len(denominator)
    
    return similarity

Then test our function:

similarity = jaccard_similarity(A, B)

print(similarity)

And you should get:

0.25

which is exactly the same as the statistic we calculated manually.

Calculate Jaccard distance in Python

In this section we continue working with the same sets ( A and B ) as in the previous section:

A = {1, 2, 3, 5, 7}
B = {1, 2, 4, 8, 9}

We begin by defining them in Python:

A = {1, 2, 3, 5, 7}
B = {1, 2, 4, 8, 9}

As the next step we will construct a function that takes set A and set B as parameters and then calculates the Jaccard similarity using set operations and returns it:

def jaccard_distance(A, B):
    #Find symmetric difference of two sets
    nominator = A.symmetric_difference(B)

    #Find union of two sets
    denominator = A.union(B)

    #Take the ratio of sizes
    distance = len(nominator)/len(denominator)
    
    return distance

distance = jaccard_distance(A, B)

Then test our function:

distance = jaccard_distance(A, B)

print(distance)

And you should get:

0.75

which is exactly the same as the statistic we calculated manually.

Calculate similarity and distance of asymmetric binary attributes in Python

We begin by importing the required dependencies:

import numpy as np
from scipy.spatial.distance import jaccard
from sklearn.metrics import jaccard_score

Using the table we used in the theory section:

1*qePn1Byz85f0Mz6n2gw6tw.png

we can create the required binary vectors:

A = np.array([1,0,0,1,1,1])
B = np.array([0,0,1,1,1,0])

and then use the libraries’ function to calculate the Jaccard similarity and Jaccard distance:

similarity = jaccard_score(A, B)
distance = jaccard(A, B)

print(f'Jaccard similarity is equal to: {similarity}')
print(f'Jaccard distance is equal to: {distance}')

And you should get:

Jaccard similarity is equal to: 0.4
Jaccard distance is equal to: 0.6

which is exactly the same as the statistic we calculated manually.

Conclusion

In this article we explored Jaccard similarity (index) and Jaccard distance as well as how to calculate them in Python.

Feel free to leave comments below if you have any questions or have suggestions for some edits and check out more of my Statistics articles.

Python Programming Statistics Scipy

Report

Enjoy this post? Give Mikhail Sidyakov a like if it's helpful.

Mikhail Sidyakov

Senior Data Scientist with 3+ years of experience

I work as a senior data scientist primarily in the retail field. My expertise is in marketing optimization, logistics optimization, and recommendation engines. I hold several advanced degrees in economics, operations research, and...

Discover and read more posts from Mikhail Sidyakov

get started

4Replies

seb16120

4 years ago

why you mention Pyshark in your images ?

seb16120

4 years ago

ok its your pseudo, but was confising because its also a python module.
(a Python wrapper for tshark, allowing python packet parsing using wireshark dissectors.)

Mikhail Sidyakov

4 years ago

https://pyshark.com/ is my Python programming blog

seb16120

4 years ago

it’s by visiting this website i understood it’s your pseudo ><

Show more replies