30-minute Python Web Scraper

Published Nov 14, 2017Last updated Nov 21, 2017
30-minute Python Web Scraper

I’ve been meaning to create a web scraper using Python and Selenium for a while now, but never gotten around to it. A few nights ago, I decided to give it a spin. Daunting as it may have seemed, it was extremely easy to write the code to grab some beautiful images from Unsplash.

Ingredients for a simple Image Scraper

  • Python (3.6.3 or newer)
  • Pycharm (Community edition is just fine)
  • pip install requests Pillow selenium
  • geckodriver (read below for instructions)
  • Mozlla Firefox (as if you didn’t have it installed)
  • Working internet connection (obviously)
  • 30 minutes of your time (possibly less)

Recipe for a simple Image Scraper

Got everything installed and ready? Good! I’ll explain what each of these ingredients does, as we move forward with our code.

The first thing we’ll be utilizing is the Selenium webdriver combined with geckodriver to open a browser window that does our job for us. To get started, create a project in Pycharm , download the latest version of geckodriver for you operation system, open the compressed file and drag & drop the geckodriver file into your project’s folder. Geckodriver is basically what lets Selenium get control of Firefox, so we need it in our project folder to be able to utilize the browser.

Next thing we want to be doing is to actually import the webdriver from Selenium into our code and connect to the URL we want. So let’s do just that:

from selenium import webdriver
# The URL we want to browse to
url = "https://unsplash.com"
# Using Selenium's webdriver to open the page
driver = webdriver.Firefox(executable_path=r'geckodriver.exe')
driver.get(url)


A remote-controlled Firefox window

Pretty easy, huh? If you’ve done everything correctly, you are over the hard part already and you should see a browser window similar to the one shown in the above image.

Next up, we should scroll down so that more images can be loaded before we get to download them. We also want to wait a few seconds , just in case the connection is slow and the images have not fully loaded. As Unsplash is built with React, waiting for about 5 seconds seems like a generous timeframe, so we should do just that, using the time package. We also want to use some Javascript code to scroll the page — we will be using window.scrollTo() to accomplish this. Putting it all together, you should end up with something like this:

import time
from selenium import webdriver

url = "https://unsplash.com"

driver = webdriver.Firefox(executable_path=r'geckodriver.exe')
driver.get(url)
# Scroll page and wait 5 seconds
driver.execute_script("window.scrollTo(0,1000);")
time.sleep(5)

After testing the above code, you should see the browser scroll down the page a little bit. The next thing we need to be doing is finding the images we want to downalod from the website. After digging around in the code React generates, I figured out that we can use a CSS selector to specifically target the images in the gallery of the page. The specific layout and code of the page might change in the future, but at the time of writing I could use a #gridMulti img selector to get all the <img> elements that were appearing on my screen.

We can get a list of these elements using find_elements_by_css_selector(), but what we want is the src attribute of each element. So, we can iterate over the list and grab those:

import time
from selenium import webdriver

url = "https://unsplash.com"

driver = webdriver.Firefox(executable_path=r'geckodriver.exe')
driver.get(url)

driver.execute_script("window.scrollTo(0,1000);")
time.sleep(5)
# Select image elements and print their URLs
image_elements = driver.find_elements_by_css_selector("#gridMulti img")
for image_element in image_elements:
    image_url = image_element.get_attribute("src")
print(image_url)

Now, to actually get the images we found. For this, we will use requests and part of the PIL package, namely Image. We also want to use BytesIO from io to write the images to a ./images/ folder that we will create inside our project folder. So, to put this all together, we need to send an HTTP GET request to the URL of each image and then, using Image and BytesIO, we will store the image that we get in the response. Here’s one way to do this:

import requests
import time
from selenium import webdriver
from PIL import Image
from io import BytesIO

url = "https://unsplash.com"

driver = webdriver.Firefox(executable_path=r'geckodriver.exe')
driver.get(url)

driver.execute_script("window.scrollTo(0,1000);")
time.sleep(5)
image_elements = driver.find_elements_by_css_selector("#gridMulti img")
i = 0

for image_element in image_elements:
    image_url = image_element.get_attribute("src")
    # Send an HTTP GET request, get and save the image from the response
    image_object = requests.get(image_url)
    image = Image.open(BytesIO(image_object.content))
    image.save("./images/image" + str(i) + "." + image.format, image.format)
i += 1

That’s pretty much all you need to get a bunch of free images downloaded. Obviously, unless you want to prototype a design and you just need random images, this little scraper isn’t of much use. So, I took some time to improve it, by adding a few more features:

  • Command line arguments that allow the user to specify a search query , as well as a numeric value for scrolling, which allows the page to display more images for downloading.
  • Customizable CSS selector.
  • Customized result folders , based on search queries.
  • Full HD images by cropping the URL of the thumbnails, as necessary.
  • Named images, based on their URLs.
  • Closing the browser window at the end of the process.

You can (and probably should) try implementing some of these features on your own. The full-featured version of the web scraper is available here. Remember to download geckodriver separately and connect it to your project, as instructed at the start of the article.


Limitations, Considerations and Future Improvements

This whole project was a very simple proof-of-concept to see how web scraping is done, meaning there are a lot of things one can do to improve upon this little tool:

  • Not crediting the original uploaders of the images is a pretty bad idea. Selenium is definitely capable of working around this, so that each image comes with the name of the author.
  • Geckodriver shouldn’t be placed in the project folder, but rather installed globally and be part of the PATH system variable.
  • The search functionality could be easily extended to include multiple queries, so that the process of downloading lots of images can be simplified.
  • The default browser could be changed from Firefox to Chrome or even PhantomJS, which would be a lot better for this kind of project.
Discover and read more posts from Angelos Chalaris (Chalarangelo)
get started
Enjoy this post?

Leave a like and comment for Angelos

19
2