Mathias Gatti

Software Developer specialized in Data Science

Python implementation of Normalized Google Distance (Simple web scraping example)

Published Jun 09, 2019Last updated Dec 05, 2019

Introduction

Based on the count of google results we can infer the popularity of a word. Also the relationship between the frequency of two words together with respect to its individual frequency is a useful measure of how much two words are related.

WhatsApp Image 2019-06-09 at 13.58.03.jpeg

Based on these ideas is defined the Normalized Google distance, in this post I show how to implement it in python using basic web scraping tools. The final code can be found here.

The Code

Importing libraries

import requests
from bs4 import BeautifulSoup
import math
import sys

Doing the search and getting the count

Here I implement this function which does a GET to google using headers that specify that we are on a desktop machine (And not on a phone), I also specify the gl parameter to make the search as if I were in USA.

def number_of_results(text):
  headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
  r = requests.get("https://www.google.com/search?q="+text.replace(" ","+"),params={"gl":"us"},headers=headers)

Then BeautifulSoup is used to extract the html part where the count is specified.

  soup = BeautifulSoup(r.text, "lxml")
  res = soup.find("div", {"id": "resultStats"})

Finally It just prints and returns the parsed number

  print(res.text)
  for t in res.text.split():
    try:
      number = float(t.replace(",",""))
      print("{} results for {}".format(number,text))
      return number
    except:
      pass

  raise Exception("Couldn't find a valid number of results on Google")

Compute the formula

Here we implement the formula specified on wikipedia
WhatsApp Image 2019-06-09 at 14.12.48 (1).jpeg

# N = number_of_results("the")
N = 25270000000.0
N = math.log(N,2)

def normalized_google_distance(w1, w2):
  f_w1 = math.log(number_of_results(w1),2)
  f_w2 = math.log(number_of_results(w2),2)
  f_w1_w2 = math.log(number_of_results(w1+" "+w2),2)

  return (max(f_w1,f_w2) - f_w1_w2) / (N - min(f_w1,f_w2))

Main Function

All the code is executed from the main function

def main(argv):
  w1 = argv[1]
  w2 = argv[2]
  score = normalized_google_distance(w1,w2)

  print("Score is",round(score,2))
  print("W1='"+ w1+ "' W2='"+ w2+ "'")

# Usage example
# python normalized_google_distance.py shakespeare macbeth
# python normalized_google_distance.py "shakespeare " "macbeth"

main(sys.argv)

Beautiful soup Python Web scraping NLP (Natural Language Processing)

Report

Enjoy this post? Give Mathias Gatti a like if it's helpful.

Mathias Gatti

Software Developer specialized in Data Science

I am a software developer specialized in data science. I have a computer science degree and several years of experience as a programmer and math teacher. In my spare time I contribute to open source projects.

Discover and read more posts from Mathias Gatti

get started

Be the first to share your opinion

GitHub flavored markdown supported

submit

Ishika Singh

5 years ago

Hey Mathias, I am getting the same error as Zaman, i.e., after a few queries the code doesn’t work giving the exactly same error.
It might be happening due to some restrictions on number of queries by google, but in that case the limit is too low, like around I made only 20 queries.
Let me know if there’s already a solution to this.

zaman

5 years ago

Hi Mathias. Your code was working fine for me. However, after a few runs, I am getting the following error:

File “normalizedgoogledistance.py”, line 12, in number_of_results
print(res.text)
AttributeError: ‘NoneType’ object has no attribute ‘text’

I hope you can suggest as solution for the error above.

Mathias Gatti

5 years ago

Hi Zaman, how are you running it? A possible problem might be that one of the words (Or the combination of both) doesn’t show up on google (It has 0 results) in that case the program returns an error.

zaman

5 years ago

It’s working again now ! so sometimes the same code works and sometimes does not !

zaman

5 years ago

When I get the error, I even get the error for similar test scripts such as the one here: https://gist.github.com/yxlao/ad429b65ec1b3836da8f06fbd9fa8c54

Mathias Gatti

5 years ago

I couldn’t reproduce your error. I would try to check the value of “r.text” when you get the error. That’s the full html response from google and it seems that for some reason it’s not returning you a valid search result.

Try adding something like:
with open(“response.html”,‘w’) as f:
f.write(r.text)

Before the:
soup = BeautifulSoup(r.text, “lxml”)

If you open the file with a web browser you should see a google search result screen, but if there is a problem it might not be the case.

zaman

5 years ago

Thanks Mathias. At present I am trying an alternate Java implementation

Les Carbonaro

5 years ago

Does BeautifulSoup work for dynamically loaded DOM elements, i.e. via ajax calls?

Mathias Gatti

5 years ago

It depends on the website but it’s usually possible to access the URL with the desired DOMs already loaded and just extract those values as usual html.

Here is an example: https://stackoverflow.com/questions/5913280/beautifulsoup-and-ajax-table-problem

Les Carbonaro

5 years ago

Thanks, Mathias. I’ll give it a try. Appreciate the prompt response.

Show more replies