Scraping an e-commerce website with BeautifulSoup
Case study
This guide walks you through how to scrape an e-commerce website with BeautifulSoup python library.
What you’ll need
For the sake of this tutorial you'll need a complete sample ecommerce website. I bundled a e-commerce website sample with the complete source code of the tutorial. Clone the repository, and open the folder shop-cart, and inside this one, run this command. It will serve the content of the folder.
python -m http.server 8000
Open your webbrowser at this location: http://localhost:8000/products.html
How to complete this tutorial
1. Install requests and beautifulsoup library:
pip install requests
pip install beautifulsoup4
2. Your first parsing with beautifulSoup
from bs4 import BeautifulSoup
import requests
page = requests.get("http://localhost:8000/products.html")
soup = BeautifulSoup(page.content, 'html.parser')
print(soup.prettify())
Here we make an http request to retrieve an url that host our e-commerce web page. Then we parse the content of the page we get, with 'html.parser', that is included in Python’s standard library. And finally we display the html code!.
3. Retrieve all products
If you take a look at the html code behind the product list webpage, you might notice that all product are wrapped inside a div tag with a class well well-small like this:
<div class="well well-small">
<h3>Our Products </h3>
<!-- Products goes here -->
</div>
And each product is built like this:
<li class="span4">
<div class="thumbnail">
<a href="product_details.html" class="overlay"></a>
<a class="zoomTool" href="product_details.html" title="add to cart"><span class="icon-search"></span> QUICK VIEW</a>
<a href="product_details.html"><img src="assets/img/a.jpg" alt=""></a>
<div class="caption cntr">
<p>Manicure & Pedicure</p>
<p><strong> $22.00</strong></p>
<h4><a class="shopBtn" href="#" title="add to cart"> Add to cart </a></h4>
<div class="actionList">
<a class="pull-left" href="#">Add to Wish List </a>
<a class="pull-left" href="#"> Add to Compare </a>
</div>
<br class="clr">
</div>
</div>
</li>
So, the key here is you cannot srape a website if you don't know how it is built. You have to figure out how things goes. A tip here is to right click on any page and select view page source option. There you go.
Now, we can use the find_all method to search for items by class or by id. In our case, we are looking for all li elements with span4 class.
from bs4 import BeautifulSoup
import requests
page = requests.get("http://localhost:8000/products.html")
soup = BeautifulSoup(page.content, 'html.parser')
def retrieve_all_products():
print(soup.find_all('li', class_='span4'))
if __name__ == '__main__':
retrieve_all_products()
If you run it, it must return a list as response.
4. Get product price
Now, let's get one product's price
from bs4 import BeautifulSoup
import requests
page = requests.get("http://localhost:8000/products.html")
soup = BeautifulSoup(page.content, 'html.parser')
def retrive_first_product_price():
all_products = soup.find_all('li', class_='span4')
product_one = all_products[0]
product_one_price = product_one.find("strong")
print(product_one_price.get_text())
print(product_one_price.get_text().strip().strip('$'))
if __name__ == '__main__':
retrive_first_product_price()
First, we get all products. Then we take the result and upon this, we look for the price. This one is inside a strong tag. After fiding the price we display it. We can also removed $ character. As you see, you can search element based on previous result's search. Unlike find_all method that returns a list of elements or an empty list, find method returns a single element or None.
5. Build a fake price comparator
Let's suppose we want to compare our products with their price as criteria. Here is a very simple way to do it.
from bs4 import BeautifulSoup
import requests
page = requests.get("http://localhost:8000/products.html")
soup = BeautifulSoup(page.content, 'html.parser')
def lazy_comparator():
all_products = soup.find_all('li', class_='span4')
products = {}
for product in all_products:
products[product.find("p").get_text().strip()] = product.find("strong").get_text().strip().strip('$')
print (sorted([(v, k) for k, v in products.items()]))
if __name__ == '__main__':
lazy_comparator()
Some notes here. After getting all product, we put each one into a dictionnary, and the we make a filtering.
That's it
Get the complete source code on github. Take also a look at the official BeautifulSoup documentation.
Very helpful information I must refer to https://ehsaasprograme8171.pk/
from bs4 import BeautifulSoup
import requests
Make an HTTP request to retrieve the URL hosting the e-commerce webpage
page = requests.get(“https://reminimodsapk.net/remini-iphone/”)
Parse the content of the page using ‘html.parser’
soup = BeautifulSoup(page.content, ‘html.parser’)
Build a fake price comparator
def lazy_comparator():
all_products = soup.find_all(‘li’, class_=‘span4’)
products = {}
for product in all_products:
products[product.find(“p”).get_text().strip()] = product.find(“strong”).get_text().strip().strip(’$’)
print(sorted([(v, k) for k, v in products.items()]))
if name == ‘main’:
lazy_comparator()
It’s a great information you can read more about it
https://www.reviewgators.com/e-commerce-product-reviews-scraper.php