Codementor Events

Web Scraping In PHP using Goutte

Published Sep 29, 2017

Web Scraping is one of the grey issues on the web while many people do not care about it there are a lot of people who frown upon it. while web scraping may not be a bad thing but the way at which it is implemented might cause issues for the websites server. another issue is how the information is used (in case of scraping copyrighted materials). you should always follow the terms and condition of the website you are scraping to avoid any legal back clash. Even while staying within the legal boundary and the website policies you may still get sued.

What is web scraping

simply put Web scraping or screen scraping is a mechanism to have a computer read a website.

Apart from the ethical issues talked about above another major issue that will affect your scraper is when the website is updated your scraper might stop working whether the changes is a full website redesign or a mere case of changing a div to a p tag.

Real life uses of web scraping

some business use case for web scraping includes but not limited to

  • scrape products from retailer or manufacturer websites to show in their own website or provide specs/price comparison
  • scrape business profiles and reviews to track online presence and reputation
  • scrape people profiles from social networks for tracking online reputation
  • scrape search engine results for SEO tracking
  • monitor specific company pages from social networks to gather what people are saying about certain companies and their products
  • scrape health physicians from their clinic websites to provide a catalog of available doctors per specialisation and region
  • scrape product reviews from retailers to detect fraudulent reviews
  • scrape news websites to apply custom analysis and curation often with the goal of providing better targeted news to their audience
  • scrape products details from e-commerce sites for price comparison

Enough said, now lets go into the technicalities on how to actually scrape a website, create a new project and add a composer.json and index.php files, If you do not know how to use composer there is a post on it here.

We are going to be use a PHP library called Goutte which is based on symfony’s DOMCrawler and it also uses GuzzleHttp for handling web requests.

your composer.json file should be similar to this

{ "require" :{ "fabpot/goutte": "^3.1" } }

Now on your terminal navigate to the project and run the following command to install Goutte with composer

composer install

put the following lines into the index.php file

web scraping in php using goutte
and that is all, you can read the DOMCrawler documentation (link shared above) to see what you can do with goutte.

Happy Coding!

Discover and read more posts from Usman Irale
get started