1
Write a post

How to Write a Web Scraper in Node.js

Published Apr 29, 2016Last updated Jan 18, 2017
How to Write a Web Scraper in Node.js

Sometimes we need to collect information from different web pages automagically. Obviously, a human is not needed for that. A smart script can do the job pretty good, especially if it's something repetitive.

When there is no web based API to share the data with our app, and we still want to extract some data from that website, we have to fallback to scraping.

This means:

  1. We load the page (a GET request is often enough)
  2. Parse the HTML result
  3. Extract the needed data

In Node.js, all these three steps are quite easy because the functionality is already made for us in different modules, by different developers.

Because I often scrape random websites, I created yet another scraper: scrape-ita Node.js scraper for humans. It's designed to be really simple to use and still is quite minimalist.

Here is how I did it:

1. Load the page

To load the web page, we need to use a library that makes HTTP(s) requests. There are a lot of modules doing that that. Like always, I recommend choosing simple/small modules - I wrote a tiny package that does it: tinyreq.

Using this module, you can easily get the HTML rendered by the server from a web page:

const request = require("tinyreq");

request("http://ionicabizau.net/", function (err, body) {
    console.log(err || body); // Print out the HTML
});

tinyreq is actually a friendlier wrapper around the native http.request built-in solution.

Once we have a piece of HTML, we need to parse it. How to do that? 💭

2. Parsing the HTML

Well, that's a complex thing. Other people did that for us already. I like very much the cheerio module. It provides a jQuery-like interface to interact with a piece of HTML you already have.

const cheerio = require("cheerio");

// Parse the HTML 
let $ = cheerio.load("<h2 class="title">Hello world</h2>");

// Take the h2.title element and show the text
console.log($("h2.title").text());
// => Hello world

Because I like to modularize all the things, I created cheerio-req which is basically tinyreq combined with cheerio (basically the previous two steps put together):

const cheerioReq = require("cheerio-req");

cheerioReq("http://ionicabizau.net", (err, $) => {
    console.log($(".header h1").text());
    // => Ionică Bizău
});

Since we already know how to parse the HTML, the next step is to build a nice public interface we can export into a module.

3. Extract the needed data

Putting the previous steps together, we have this (follow the inline comments):

"use strict"

// Import the dependencies
const cheerio = require("cheerio")
    , req = require("tinyreq")
    ;

// Define the scrape function
function scrape(url, data, cb) {
    // 1. Create the request
    req(url, (err, body) => {
        if (err) { return cb(err); }

        // 2. Parse the HTML
        let $ = cheerio.load(body)
          , pageData = {}
          ;

        // 3. Extract the data
        Object.keys(data).forEach(k => {
            pageData[k] = $(data[k]).text();
        });

        // Send the data in the callback
        cb(null, pageData);
    });
}

// Extract some data from my website
scrape("http://ionicabizau.net", {
    // Get the website title (from the top header)
    title: ".header h1"
    // ...and the description
  , description: ".header h2"
}, (err, data) => {
    console.log(err || data);
});

When running this code, we get the following output in the console:

{ title: 'Ionică Bizău',
  description: 'Web Developer,  Linux geek and  Musician' }

Hey! That's my website information, so it's working. We now have a small function that can get the text from any element on the page.


In the module I have written I made it possible to scrape lists of things (e.g. articles, pages etc).

So, basically to get the latest 3 articles on my blog, you can do:

const scrapeIt = require("scrape-it");

// Fetch the articles on the page (list)
scrapeIt("http://ionicabizau.net", {
    listItem: ".article"
  , name: "articles"
  , data: {
        createdAt: {
            selector: ".date"
          , convert: x => new Date(x)
        }
      , title: "a.article-title"
      , tags: {
            selector: ".tags"
          , convert: x => x.split("|").map(c => c.trim()).slice(1)
        }
      , content: {
            selector: ".article-content"
          , how: "html"
        }
    }
}, (err, page) => {
    console.log(err || page);
});
// { articles:
//    [ { createdAt: Mon Mar 14 2016 00:00:00 GMT+0200 (EET),
//        title: 'Pi Day, Raspberry Pi and Command Line',
//        tags: [Object],
//        content: '<p>Everyone knows (or should know)...a" alt=""></p>\n' },
//      { createdAt: Thu Feb 18 2016 00:00:00 GMT+0200 (EET),
//        title: 'How I ported Memory Blocks to modern web',
//        tags: [Object],
//        content: '<p>Playing computer games is a lot of fun. ...' },
//      { createdAt: Mon Nov 02 2015 00:00:00 GMT+0200 (EET),
//        title: 'How to convert JSON to Markdown using json2md',
//        tags: [Object],
//        content: '<p>I love and ...' } ] }

Happy scraping!

Discover and read more posts from Johnny B. (Ionică Bizău)
get started
Enjoy this post?

Leave a like and comment for Johnny

17Replies
Ali Haider
a month ago

how can we get data from multiple pages, i am a beginner and having trouble getting data from multiple pages, i am getting urls but can’t get data from multiple pages, any help would be appreciated, Thanks

Joel Griffith
2 months ago

Thanks for the tutorial, I can’t believe I’ve never heard of scrapeIt before. Always just used cheerio and called it a day, but this module seems to provide a lot more “high-level” aide. The only issue is when you get to sites that are javascript-heavy, then you’ll need to run a browser to do parsing/executing of JS. I built a module called https://github.com/joelgriffith/navalia that’s custom-built for that sort of thing, which might help. Thanks again!

Jonathan Brand
4 months ago

Hey this is excellent work! A fellow piano player, I started learning to code three months ago (at 38) and your work here fixes a big problem … I had a surprise when I found out that JavaScript was not easily able to talk to other websites. If that were true,lol …a real language of the web indeed.

so I’ll try it. also, you know you can us js to throw an adornment menu meaning context or right-click menu right on the page itself? Users don’t have to right-click and so on, but if you do this, can you also use a user agent or directly cause a click on the paste feature? If so, would evil developers be able to paste from you clipboard the moment you walked in the door? Do you know what I keep on my clipboard. (that’s the point) but now it seems like there is less and less the idea of an “I.” more… i dunno webie. So have you heard of an exploit for clipboards like this before? -Jon B.

Show more replies

Get curated posts in your inbox

Learn programming by reading more posts like this