Dan

Software Developer

Web scraping using Python and BeautifulSoup

Published Oct 07, 2018Last updated Oct 09, 2018

Intro
In the era of data science it is common to collect data from websites for analytics purposes.
Python is one of the most commonly used programming languages for data science projects. Using python with beautifulsoup makes web scrapping easier. Knowing how to scrap web pages will save your time and money.

Prerequisite

Basics of python programming (python3.x).
Basics of html tags.

Installing required modules
First thing first, assuming python3.x is already install on your system you need to install requests http library and beautifulsoup4 module.

Install requests and beautifulsoup4

$ pip install requests
$ pip install beautifulsoup4

Collecting web page data

Now we are ready to go. In this tutorial our goal is to get the list of presidents of United States from this wikipedia page.
Go to this link and right click on the table containing all the information about the United States presidents and then click on the inspect to inspect the page (I am using Chrome. Other browsers have similar option to inspect the page).

Screen Shot 2018-10-07 at 11.38.39 PM.png

The table content is within the tag table and class wikitable (see the image below). We will need these information to extract the data of interest.
Screen Shot 2018-10-07 at 9.25.01 PM.png

Import the installed modules

import requests
from bs4 import BeautifulSoup

To get the data from the web page we will use requests API's get() method

url = "https://en.wikipedia.org/wiki/List_of_Presidents_of_the_United_States"
page = requests.get(url)

It is always good to check the http response status code

print(page.status_code)   # This should print 200

Now we have collected the data from the web page, let's see what we got

print(page.content)

The above code will display the http response body.
The above data can be view in a pretty format by using beautifulsoup's prettify() method. For this we will create a bs4 object and use the prettify method

soup = BeautifulSoup(page.content, 'html.parser')
print(soup.prettify())

This will print data in format like we have seen when we inspected the web page.

<table class="wikitable" style="text-align:center;">
      <tbody>
       <tr>
        <th colspan="9">
         <span style="margin:0; font-size:90%; white-space:nowrap;">
          <span class="legend-text" style="border:1px solid #AAAAAA; padding:1px .6em; background-color:#DDDDDD; color:black; font-size:95%; line-height:1.25; text-align:center;">
          </span>
          <a href="/wiki/Independent_politician" title="Independent politician">
           Unaffiliated
          </a>
          (2)
         </span>
         <span style="margin:0; font-size:90%; white-space:nowrap;">
         ...
         ...

As of now we know that our table is in tag table and class wikitable. So, first we will extract the data in table tag using find method of bs4 object. This method returns a bs4 object

tb = soup.find('table', class_='wikitable')

This tag has many nested tags but we only need text under title element of the tag a of parent tag b (which is the child tag of table). For that we need to find all b tags under the table tag and then find all the a tags under the b tags. For this we will use find_all method and iterate over each of the b tag to get the a tag

for link in tb.find_all('b'):
    name = link.find('a')
    print(name)

This will extract data under all the a tags

<a href="/wiki/George_Washington" title="George Washington">George Washington</a>
<a href="/wiki/John_Adams" title="John Adams">John Adams</a>
<a href="/wiki/Thomas_Jefferson" title="Thomas Jefferson">Thomas Jefferson</a>
<a href="/wiki/James_Madison" title="James Madison">James Madison</a>
<a href="/wiki/James_Monroe" title="James Monroe">James Monroe</a>
...
...
<a href="/wiki/Barack_Obama" title="Barack Obama">Barack Obama</a>
<a href="/wiki/Donald_Trump" title="Donald Trump">Donald Trump</a>

The eleemnt title can be extracted from all a tags using the method get_text(). So modifyng the above code snippet

for link in tb.find_all('b'):
    name = link.find('a')
    print(name.get_text('title'))

and here is the desired result

George Washington
John Adams
Thomas Jefferson

James Monroe
...
...
Barack Obama
Donald Trump

Putting it all together

import requests
from bs4 import BeautifulSoup

url = "https://en.wikipedia.org/wiki/List_of_Presidents_of_the_United_States"
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
tb = soup.find('table', class_='wikitable')

for link in tb.find_all('b'):
    name = link.find('a')
    print(name.get_text('title'))

We have successfully scrapped a web page in less than 10 lines of python code!! Bingo!

Leave a feedback in the comment box. Let me know if you have any questions in your mind or having any difficulty with this tutorial.

Python Python 3 Beautifulsoup4 Web scraping

Report

Enjoy this post? Give Dan a like if it's helpful.

Dan

Software Developer

Self motivated, always willing to learn new technologies and give my best to the team and project. My blogs: [https://www.haccks.com](https://www.haccks.com/) Expert in - Python, Django Rest Framework, Data Analytics (Pandas, Num...

Discover and read more posts from Dan

get started

13Replies

Alyssa Mesaros

2 years ago

I am getting the desired output but followed by an attribute error non type object has no attribute get_text.

Any ideas?

Jason Schvach

2 years ago

It means that at some point in the code link.find('a') returns None (meaning there was no <a> tag in that link object.) So you can’t .get_text() from something that doesn’t exist. I’d be moe than happy to help you out with this and any web scraping questions. An also, by the way, it would be far easier to use pandas in the example given above as it specifically parses <table> tags (using BeautifulSoup under the hood).

Alex Dias Camargo

5 years ago

Tks, dear!

Jeff

5 years ago

Awesome!

Show more replies