API Scraping in the Real World

I’ve done a few projects that involve API scraping of some sort, whether it's Twitter, AWS, Google, Medium, JIRA, you name it — it’s a fairly common task when you’re a freelance developer. Throughout these implementations I’ve used a few libraries, including bottleneck, promise-queue, or just making my own. However, none of the existing solutions covered every aspect of scraping.

That's why I created my own solution, api-toolkit, as a basis for API scraping. I also created another project, the twitter-toolkit based on it. This api-toolkit solves 90% of the challenges you will encounter in scraping your own APIs including:

  • Key/Secret Management
  • Building a simple queue that can transition between 4 states: Queued, Pending, Complete, Failed
  • Logging
  • Wait time between requests
  • Concurrency
  • Multiple Queues
  • Rate Limiting
  • Error Handling
  • Progress Bars
  • Debugging with Chrome Inspector
  • Pagination
  • Pausing/Resuming

If at any point you get stuck as to how the code works, you can look in those two repos for a working example. api-toolkit is the base set of utilities that you will share across all your APIs, and twitter-toolkit is an example of how you would use this base set for scraping the Twitter API.

Since API scraping has many challenges, we will first focus on the major ones. We'll walk through the fundamental concepts behind API scraping, then set up a Twitter API Scraper as an example while going over some API scraping concepts.

Challenges of API Scraping

Rate Limiting

One of the major challenges for API scraping is rate limiting. For just about any API (public or private), you will probably be hitting one of these two types of rate limiting:

DDOS protection

Almost every production API will block your IP address if you start hitting the API with 1,000 requests per second. This means your API scraper tool will be prohibited from accessing the API, potentially indefinitely. This is meant to prevent DDOS (distributed denial of service) attacks which can disrupt service of the API for other API consumers. Unfortunately, it’s quite easy to inadvertently trigger these protection if you’re not careful, especially if you are using multiple API scraping clustered servers.

Standard Rate Limiting and Throttling

Most APIs will limit either your API requests based on your IP or your API key to a certain timeframe (e.g. 180 requests per 15 minutes). These throttling limits may be vary for different endpoints in a single service.

Error Handling

Errors happen. A lot. Common errors include:

  • Rate limiting: Even if you are careful, sometimes a rate limiting error (as described above) still occurs. You'll need a strategy to retry an API request at a later time when the rate limiting has subsided.
  • Not Found: Different APIs return "not found" error responses in different ways. Some throw a 404 while others may return a 200 but contain an error message in the API response. Your application might not care if something is not found, but it’s still important to consider that this type of error may happen.
  • Other errors: You may want to report every error that happens without crashing your app.

Pagination

Pagination is a common challenge with very large sets of results. Most APIs include pagination for hundreds of records/items. Generally, there are two methods of pagination that an API will use:

  1. Cursor: A pointer to the next record returned by the last record. The pointer will usually be the ID of the record or item.
  2. Page number: Standard pagination. You keep passing page numbers sequentially until there are no more results.

Concurrency

Especially if the results are quite large (i.e. images, files, etc.), you probably want to have some sort of concurrency and parallel processing. That is, you will want to make multiple API requests happen at the same time. However, you don’t want to make too many concurrent requests in order to prevent rate limiting/DDOS protection.

Logging and Debugging

There is so much that can go wrong when scraping an API. For this you will need an effective logging and debugging strategy. The api-toolkit includes a progress bar to indicate what’s going on at any point during API scraping.

API Scraping: The Basics

Key/Secret Management

Almost every private API will have some sort of private key system (essentially a password that’s easily revocable). The implementations vary greatly, but require you to store one or more pieces of “secret” text somewhere.

Never put secrets in your repository. Even if the repo is private, it is so easy for your secrets to get leaked accidentally. If this happens, your API account will be hijacked and you will be responsible for anything that happens on it. This includes posts made on behalf of your company, stolen user information, and any billing that may occur from use of the API.

The options to manage secrets and keys are using a .env file with dotenv or using environment variables.

.env file and Dotenv

An .env file (aka dot env) is a simple file that is excluded from the git repository and is manually copied to your computer and server. This file is great for your your computer or servers that you configure manually.

To do this, create a file called .env in the root of your project.

Then, add your secrets:

    API_KEY = asdfhsd834hsd
    API_SECRET = ioshfa94widoj2ws

Next, install dotenv:

    yarn add dotenv

At the top of your Node file, add:

    require('dotenv').config();

    const {
       API_KEY,
       API_SECRET
    } = process.env;

This will extract the secrets and keys from the environment variables using dotenv.

Finally, make sure that you exclude the file from your .GIT repository (and manually copy it to anywhere that needs to use the secrets) by adding this line to your .gitignore file:

    # .gitignore
      
    .env

Environment Variables

Environment variables are manageable in most environments that run Docker Containers (e.g. Amazon Web Services, Google Cloud Platform, and Microsoft Azure). They are one of the more secure and simplest ways to go.

Essentially, all you have to do is manually set variables in your shell environment or in your container provider’s config, such as AWS ECS (to persist the variable across restarts) for Amazon EC2 instances.

To get this working on your local machine, type the following in your bash environment:

    export API_KEY=sn89ds2ju93sdnljos

Alternatively, you can add that line to your ~/.bashrc or ~/.bash_profile file.

Then in your Node file, you can access the variable with the following:

    const {
        API_KEY
    } = process.env;

Using API Libraries vs. Making Your Own

Generally, when you are going to scraping a well-known API, such as Twitter, there will be multiple node packages to choose from. I strongly suggest using those packages, at least in the beginning because they take care of:

  • authentication
  • the HTTP request headers and payload/body formatting
  • API-specific peculiarities

Once you start using the more advanced parts of the API, you may want to fork the code to fit your needs, or even write your own from scratch.

If you are using a less popular and less supported API, you may be forced to make your own API scraping implementation.

In that case, you have an option of using ES6’s fetch function or an existing library such as axios. I like axios because it supports promises and has clean syntax.

Once you have decided on your HTTP request library, you will want to create a wrapper for creating requests so that you do not have to type the same information over and over. Axios has the ability to create an “instance” that serves this purpose.

For example, we can create an axios instance that is used for multiple API requests like this:

    const instance = axios.create({
            baseURL: 'https://twitter.com/api/',
            headers: {
                 'X-Bearer-Token: API_TOKEN
            }
    });
      
    // later using the instance
    instance.get('/users');
    instance.post(‘/users’, { name: ‘Toli’ });
    instance.delete(‘users/1’)’

How to set up a Twitter API scraper

For this tutorial, we are going to be using the Twitter API with the twit client. Set up your app as follows:

  1. Set up a Twitter developer account and get your account secrets on the Twitter website
  2. Add your Twitter credentials to your .env file or environmental variables
  3. Run the following:
    yarn add twit
  1. Add an object where we will keep our code:
    const twitterScraper = {
      init: async () => { 
    
      }
    }
    twitterScraper.init();
  1. Configure the Twit client, referencing your .env variables:
    init: async () => { 
          this.client = new Twit({
                consumer_key: CONSUMER_KEY,
                consumer_secret: CONSUMER_SECRET,
                access_token: ACCESS_TOKEN,
                access_token_secret: ACCESS_TOKEN_SECRET,
            });
    }

Simple! Now you have the basics for API scraping the Twitter API.

API Scraping Concepts

Now we are going to start building our scraper one concept at a time. We have the API connection with the API credentials/secrets already setup and next will be building a queue to make API requests, adding some logging, adding a wait time between requests, creating concurrent API requests, and handling any errors such as those caused by rate limiting.

Building a simple queue

We are going to create a simple queue with 4 states:

  • queued: waiting to execute
  • pending: currently executing
  • complete: successfully executed
  • failed: failed to execute

We are going to use promises in this queue. Since we cannot “pause” a promise, we are going to have to wrap the function to be executed in a second promise. It will be resolved at the same time that the internal function promise will be executed. But, it will be pending before we start executing the internal function.

So:

  1. External promise is created
  2. External promise is pending
  3. Queue starts processing and internal function is queued up
  4. Internal function is executed
  5. Internal promise resolves
  6. External promise resolves

What happens in the add function is we sign the following:

    class Queue {
      constructor() {
        Object.assign(this, {
            // lists of promises
            queued: [],
              pending: [],
              complete: [],
            failed: [],
      
              // unlike queued this is a list of functions, not promises
              queuedFuncs: [],
              
              stopped: true,
        });
      }
      
      // there will be a promise added to queued immediately after adding
      // but func will ojnly start executing when `process` is called
      // the wrapper promise will actually be the one that is passed around everywhere
      // the queue will start executing if it's currently stopped if autoStart
      // is enabled;
      
      add(func) {
          let resolve;
        let reject;
      
          const wrapperPromise = new Promise((res, rej) => {
              resolve = res;
              reject = rej;
          });
      
        this.queued.push(wrapperPromise);
        
        this.queuedFuncs.push(() => {
          func().then(resolve).catch(reject);
          return wrapperPromise;
        });
        
        setTimeout(() => {
          if (this.stopped && this.autoStart) {
              this.process();
            }
        });
      
        return wrapperPromise;
      }
    }

Now we need to write the process function that starts off the process. Essentially we have an asynchronous while loop, that processes one item at a time. For now, there is nothing asynchronous happening…but we will put some blocks in to stop processing if there are too many requests.

What our function is doing is:

  1. Executing the next function in the queuedFuncs list (this.queuedFuncs.shift())
  2. Moving the wrapper promise from the queuedlist to the pending list.
  3. After all the requests have been added to the pending list, the function will wait for all the requests in the pending list to resolve and return their results.

Notice that we wrap the execution in a try/catch block. For now we aren’t doing anything in the catch block. This will let all the errors that happen pass through though.

    moveLists(item, from, to) {
      this[to].push(item);
      this[from].splice(this[from].indexOf(item), 1);
    }
      
    async processNextItem() {
        if (!this.queued.length) { return false; }
        
        this.moveLists(this.queued[0], 'queued', 'pending');
        
        let promise;
        
        try {
            promise = this.queuedFuncs.shift()();
            await promise;
            this.moveLists(promise, 'pending', 'complete');
        } catch (e) {}
    }
      
    async process() {
      while (this.queuedFuncs.length) {
        await this.processNextItem();
    }
      
      return Promise.all(this.pending);
    }

Adding logging to our request queue

Logging is extremely important. After all, we need to see what’s going on for long running queues. We want to see how many items are done processing, how many are still left. We will set up an advanced logger later, but for now, let’s just set up the ability to see each action as it happens.

For this, we set up a simple event listener/trigger system. We will trigger events as they happen and use .on to listen to them. Then we can log the output using console.log or make more advanced UIs.

    const ALL_EVENTS = [
      'queued',
      'complete',
    ];
      
    class Queue { 
          constructor() {
          // ... other code
          this.initializeEvents();
          }
          
          initializeEvents() {
            this.eventListeners = ALL_EVENTS.reduce((listeners, event) => {
              listeners[event] = [];
              return listeners;
            }, {});
          }
          
          triggerEvent(event, promise) {
            if (!this.eventListeners[event]) return;
            this.eventListeners[event].forEach((cb) => {
              cb(promise);
            });
          }
          
          on(event, cb) {
            if (event === 'all') {
              ALL_EVENTS.forEach((e) => {
                this.eventListeners[e].push((...args) => cb(e, ...args));
              });
            } else {
              this.eventListeners[event].push(cb);
            }
            return this;
          }
          
          add() {
            // ... other code
            this.queued.push(wrapperPromise);
            
            // ADD THIS LINE
            this.triggerEvent('queued', wrapperPromise);
          }
          
          processNextItem() {
            // ... other code
            this.moveLists(promise, 'pending', 'complete');
            
            // ADD THIS LINE
            this.triggerEvent('complete', promise);
          }
    }
      
    const queue = new Queue();
      
    queue.on('all', (event, promise) => {
      console.log(event, promise);
    });

Adding wait time between requests

So we have our queue…but at this rate, if we have 1,000 requests queued up, it will try to hit the server with 1,000 requests at once! Not good, so let’s wait at least 1 second in between requests:

    const WAIT_BETWEEN_REQUESTS = 1000;
    
    wait (ms) {
      return new Promise(resolve => setTimeout(() => resolve(), ms));
    }
    
    async processNextItem() {
      await this.wait(WAIT_BETWEEN_REQUESTS);
      // ...other code
    }

Awesome. Now at least we will wait a second between bombarding the server.

Adding concurrency and making concurrent requests

Now let’s add some concurrency support. Perhaps we only want 3 requests going on at one time.

For this we can wrapper all the code in processNextItem below our initial await with an if statement

    const MAX_CONCURRENT = 3;
    if (this.pending.length < MAX_CONCURRENT) {
      if (!this.queued.length) { return false; }
      // ... OTHER CODE
      // this.moveLists(this.queued[0], 'queued', 'pending');
    }
    
    return Promise.race(this.pending);

Creating multiple API request queues at the same time

Now that we have a queue for one endpoint, we can replicate this for multiple endpoints. We’ll create a hash/map to store all of our queues:

    const queues = {};
    function createQueue(url) {
      queues[url] = new Queue();
    }

Adding rate limiting support

This is one of the most important parts of API Scraping. Most APIs will give you an endpoint for checking how many requests we have left.

We will set up a function getRateLimits to fetch every few seconds and find out the remaining limits on the endpoints. Then we use blockQueue.

    // fetch every 10 seconds
    const RATE_LIMIT_AUTO_FETCH_INTERVAL = 10000;
      
    let rateFetchTimeout;
    let rateLimits;
      
    // has our list of queues from previous step
    const queues;
      
    const setRateLimitOnQueue = (url, reset) => {
      const queue = queues[url];
      if (!queue) return;
      
      const unblockIn = moment.unix(reset).diff(moment());
      queue.blockQueue(unblockIn);
    };
      
    async function getRateLimits() {
      const result = await fetch(RATE_LIMIT_ENDPOINT);
      
      // format limits so that it it in the format:
      
      const limits = {
        endpoint_name: ISO_DATE_WHEN_RESET,
        endpoint_name2: ISO_DATE_WHEN_RESET,
      }
      
      Object.entries(limits).forEach((ep, reset) => { 
        setRateLimitOnQueue(ep, reset);
      });
    }
      
    function initRateLimitsAutoFetch() {
      rateFetchTimeout = setInterval(async () => {
        rateLimits = await getRateLimits();
      }, RATE_LIMIT_AUTO_FETCH_INTERVAL);
    }

Adding error handling

There are some situations that APIs throw as errors but actually need to be handled in other ways. We want to handle them by wrapping our API in a try catch.

Two common scenarios we want to handle are a Not Found error, in which case we want to return a null. Another scenario may be when we are rate limited, in which case we want to call our previous rate limit function to wait a few seconds.

    async request(method, url, params } = {}) {
      return this.queues[url].add(async () => {
        try {
          return (await this.client\[method\](url, params)).data;
          } catch (e) {
            // not found
            if (e.code === 34) return null;
            
            // rate limited
            if (e.code === 88) {
              // waits 20 seconds so that the rate limit
              
              // auto request can find something
              this.setRateLimitOnQueue(url, moment().add(20 * 1000, 'milliseconds').unix());
              return;
            }
        
            throw e;
        }
      });
    }

Retrying

A lot of times, even when we handle rate limiting correctly, the server may fail some of our requests. We want to have an effective retry strategy.

For this we have to slightly edit our process function. It will still run a while loop. It will be in charge of calling processNextItem which queues up the next function, and passes it off to runFunction.

runFunction waits for blocks to be cleared on the queue, then tries to run the function. If the function succeeds, then the item is moved to a completed state. Otherwise we move on to our retry logic.

If the function fails, we call runFunction again (recursion), while incrementing the tryNumber. If the tryNumber reaches the maximum tries (3), we move the item to the failed state.

    async runFunction(func, promise, tryNumber = 0, { retry = true }) {
      
      if (this.waitBetweenRequests) {
        await wait(this.waitBetweenRequests);
      }
      
      if (this.blocked) {
        await this.block;
      }
      
      try {
        await func();
        this.moveLists(promise, 'pending', 'complete');
        
        this.triggerEvent('complete', promise);
      
      } catch (e) {
        if (this.retry) {
          if (tryNumber < 3) {
          const res = await this.runFunction(func, promise, tryNumber + 1);
          
            if (res) return true;
          }
        }
        
        this.moveLists(promise, 'pending', 'failed');
        
        this.triggerEvent('failed', promise);
        }
        
        return true;
    }
      
    async processNextItem() {
      if (this.pending.length < this.maxConcurrent) {
        if (!this.queued.length) { return false; }
        
        const promise = this.queued[0];
        
        this.moveLists(promise, 'queued', 'pending');
        
        await this.runFunction(this.queuedFuncs.shift(), promise);
        
        // process next
        return true;
      }
      
      // wait for something to succeed or fail
      return Promise.race(this.pending);
    }
      
    async process() {
      while (this.queuedFuncs.length) {
        await this.processNextItem();
      }
      
      return Promise.all(this.pending);
    }

Improving our logging: making it easy to debug

It can be quite difficult to debug what’s going on without logging, especially if the function is running for hours at a time. I’ve found that making a progress bar is the most effeective solution to check on what’s going on.

Progress Bar

I wrote a progress bar that handles multiple queues with the structure we created. Instead of copy and pasting it here, you can look at the GitHub file.

The key is hooking on the on function from the queues. Then we use terminal-kit to help us color in the screen according to our current progress and status.

Using Chrome DevTools to debug

I strongly recommend using the debugger in Chrome. Instead of regularly running our application with

    node .`

we add an extra parameter

    node --inspect .

Then we navigate to chrome://inspect in the Chrome browser, and are able to see our script in the list.

This is my preferred method of debugging, because not only can we see a much more user friendly output, but we can execute commands in the node environment as needed.

Adding pagination support

Most APIs have some sort of pagination support for long lists. Generally this is done by passing a “cursor” for the next result, which is provider by the previous result.

We can create a simple recursive function, which checks for a max amount of pages that we want to fetch and that there are further results (cursor is present). Then we extract the next cursor (nextCursor) from the response of the function, and pass it back into the recursive function. Then we concatenate the results from the function onto an output array results.

    const paginate = async ({
      results = [],
      cursor,
      func,
      params,
      page = 0,
      maxPages = 1,
    }) => {
      if (page < maxPages && cursor) {
        const {
          data,
            nextCursor,
          } = await func({
            params,
            cursor,
          });
      
        const newResults = await paginate({
          results: [
            ...results,
            ...data,
          ],
          cursor: nextCursor,
          func,
          params,
          page: page + 1,
        }) || [];
      
        return [
          ...results,
          ...newResults,
        ];
      }
      return results;
    }; 

Pausing/Resuming

All we have to do to pause/resume our queues is call .block() to pause, and then .unblock() to unpause. Our current structure will support the rest.

Conclusion

We are now able to create a fairly powerful API scraper with just a few hundred lines of code. You can easily extend the api-toolkit and what you learned it this article to scrape just about any RESTful API on the web. Good luck, and happy scraping!

Last updated on Dec 26, 2019