Codementor Events

Scraping an e-commerce website with BeautifulSoup

Published May 31, 2018Last updated Nov 27, 2018

Case study

This guide walks you through how to scrape an e-commerce website with BeautifulSoup python library.

What you’ll need

For the sake of this tutorial you'll need a complete sample ecommerce website. I bundled a e-commerce website sample with the complete source code of the tutorial. Clone the repository, and open the folder shop-cart, and inside this one, run this command. It will serve the content of the folder.

  python -m http.server 8000

Open your webbrowser at this location: http://localhost:8000/products.html
FireShot Capture 9 - Twitter Bootstrap shopping cart - http___localhost_8000_products.html.png

How to complete this tutorial

1. Install requests and beautifulsoup library:

  pip install requests
  pip install beautifulsoup4

2. Your first parsing with beautifulSoup

  from bs4 import BeautifulSoup
  import requests
  page = requests.get("http://localhost:8000/products.html")
  soup = BeautifulSoup(page.content, 'html.parser')
  print(soup.prettify())

Here we make an http request to retrieve an url that host our e-commerce web page. Then we parse the content of the page we get, with 'html.parser', that is included in Python’s standard library. And finally we display the html code!.

3. Retrieve all products

If you take a look at the html code behind the product list webpage, you might notice that all product are wrapped inside a div tag with a class well well-small like this:

  <div class="well well-small">
      <h3>Our Products </h3>
      <!-- Products goes here -->
  </div>

And each product is built like this:

  <li class="span4">
        <div class="thumbnail">
        <a href="product_details.html" class="overlay"></a>
        <a class="zoomTool" href="product_details.html" title="add to cart"><span class="icon-search"></span> QUICK VIEW</a>
        <a href="product_details.html"><img src="assets/img/a.jpg" alt=""></a>
        <div class="caption cntr">
          <p>Manicure & Pedicure</p>
          <p><strong> $22.00</strong></p>
          <h4><a class="shopBtn" href="#" title="add to cart"> Add to cart </a></h4>
          <div class="actionList">
            <a class="pull-left" href="#">Add to Wish List </a> 
            <a class="pull-left" href="#"> Add to Compare </a>
          </div> 
          <br class="clr">
        </div>
        </div>
      </li>

So, the key here is you cannot srape a website if you don't know how it is built. You have to figure out how things goes. A tip here is to right click on any page and select view page source option. There you go.

Now, we can use the find_all method to search for items by class or by id. In our case, we are looking for all li elements with span4 class.

  from bs4 import BeautifulSoup
  import requests
  page = requests.get("http://localhost:8000/products.html")
  soup = BeautifulSoup(page.content, 'html.parser')

  def retrieve_all_products():
      print(soup.find_all('li', class_='span4'))

  if __name__ == '__main__':
      retrieve_all_products()

If you run it, it must return a list as response.

4. Get product price

Now, let's get one product's price

  from bs4 import BeautifulSoup
  import requests
  page = requests.get("http://localhost:8000/products.html")
  soup = BeautifulSoup(page.content, 'html.parser')

  def retrive_first_product_price():
      all_products = soup.find_all('li', class_='span4')
      product_one = all_products[0]
      product_one_price = product_one.find("strong")
      print(product_one_price.get_text())
      print(product_one_price.get_text().strip().strip('$'))

  if __name__ == '__main__':
      retrive_first_product_price()

First, we get all products. Then we take the result and upon this, we look for the price. This one is inside a strong tag. After fiding the price we display it. We can also removed $ character. As you see, you can search element based on previous result's search. Unlike find_all method that returns a list of elements or an empty list, find method returns a single element or None.

5. Build a fake price comparator

Let's suppose we want to compare our products with their price as criteria. Here is a very simple way to do it.

  from bs4 import BeautifulSoup
  import requests
  page = requests.get("http://localhost:8000/products.html")
  soup = BeautifulSoup(page.content, 'html.parser')
  
  def lazy_comparator():
    all_products = soup.find_all('li', class_='span4')
    products = {}
    for product in all_products:
        products[product.find("p").get_text().strip()] = product.find("strong").get_text().strip().strip('$')
    print (sorted([(v, k) for k, v in products.items()]))

  if __name__ == '__main__':
      lazy_comparator()

Some notes here. After getting all product, we put each one into a dictionnary, and the we make a filtering.

That's it

Get the complete source code on github. Take also a look at the official BeautifulSoup documentation.

Discover and read more posts from Kayode Adechinan T. Salami
get started
post commentsBe the first to share your opinion
Mart Henrey
a month ago

Very helpful information I must refer to https://ehsaasprograme8171.pk/

Ahsan Ali
3 months ago

from bs4 import BeautifulSoup
import requests

Make an HTTP request to retrieve the URL hosting the e-commerce webpage

page = requests.get(“https://reminimodsapk.net/remini-iphone/”)

Parse the content of the page using ‘html.parser’

soup = BeautifulSoup(page.content, ‘html.parser’)

Build a fake price comparator

def lazy_comparator():
all_products = soup.find_all(‘li’, class_=‘span4’)
products = {}
for product in all_products:
products[product.find(“p”).get_text().strip()] = product.find(“strong”).get_text().strip().strip(’$’)
print(sorted([(v, k) for k, v in products.items()]))

if name == ‘main’:
lazy_comparator()

ReviewGators
3 months ago

It’s a great information you can read more about it
https://www.reviewgators.com/e-commerce-product-reviews-scraper.php

Show more replies