Juan Sebastián Gaitán Villamizar

Software Engineer

How to write a full-fledge API to track commits on the fly? (Github)

Published Apr 26, 2018Last updated Oct 23, 2018

Step-by-step. No GitHub API on Node.js.

What are we building?

We are going to use the power of Apify to track commits to a GitHub repository that you own and/or work for using an Act (please ask your boss for permission if required). This could be used to build a GitHub dashboard for your company, where every commit gets displayed on real time — like LiveTweetApp or Walls.io, though LiveGitHubCommitApp . You count the number of commits per developer. The number of lines changes. Whatever you want!

It would be a great tool for Hackathons!

What about GitHub’s API?

The following code to track GitHub commits requires minimal setup and does not limit the calls to GitHub’s API (which is 60 for unauthenticated and 5000 for authenticated API calls, so we are not using it), and it can be used by anyone with access to a GitHub repo.

Besides, we are getting the committed code line-by-line in a consolidated JSON format, with just enough information to use it. We are also executing code in realtime since we are adding a webhook to every push commit that the repository receives.

Pros:

Minimal setup
Unlimited calls
Consolidated JSON State Output
Realtime code execution through Webhooks
Keep in mind that this would not replace GitHub’s powerful API, it is another way to automate a task, with a tiny setup using Apify — a powerful web automation tool.

What is Apify?

Apify is a cloud based web crawler that extracts structured data from any website using a few simple lines of JavaScript. It is the web scraping and automation platform. It can turn any website into an API in a few minutes!
The first thing you will encounter on the site is a crawler. The crawlers are hosted on Apify for developers. Every crawler is set up to do a certain task of scraping and automation. It looks at a page and gets back certain information from that page. Each crawler is, technically, a web browser hosted on Apify servers that enables you to scrape data from any website using JavaScript. The crawler needs two things: URL to be executed on, and the JavaScript code that will be executed.

What is an Act?

The act is a single performance of an actor. Actor is a serverless computing platform that enables execution of arbitrary pieces of code in the Apify cloud. This means you can run a job such as filling a form or sending a mail or even crawling the whole page. The single isolated actor job is called an act, it has settings and source code. Unlike traditional serverless platforms, the act is not limited to the lifetime of a single HTTP transaction. It can run for as long as necessary, even forever. The act is a single cloud app or service.

Getting started

On the left panel you can go to Actor and start creating an act.
1*mqPV1nl6le8c2a_c4YDmnQ.png

You can use the API, install the library or host it online. You will encounter the multiple tabs when you choose to create new.
1*ucCORqGRM9sYQmtLGMczWQ.png

Go to source tab and add you will find the following code:

const Apify = require('apify');
const request = require('request-promise');

Apify.main(async () => {
    // Get input of your act
    const input = await Apify.getValue('INPUT');
    console.log('My input:');
    console.dir(input);

    // Do something useful here
    const html = await request('http://www.example.com');

    // And then save output
    const output = {
        html,
        crawledAt: new Date(),
    };
    console.log('My output:');
    console.dir(output);
    await Apify.setValue('OUTPUT', output);
});

This is the basic code to get you started. Click the “Quick Run” button and it will build and run your newly created Act. Try it out!

Your act is up and running!

For complicated acts, it’s more convenient to host source code on Git. You can clone the boilerplate code from the Quick Start repository to get you started. We’ll do this for our GitHub Tracker.

When you’re on a source tab, choose Source type as Git repository.

1*3mH6Rbf4IAUgTvIhssiRIg.png

1*0d1Wb02US4pVs32WbnQ1Ug.png

Enter the URL to your git repository and you’re good to build. In our case, we would use this repository: https://github.com/juansgaitan/act-git-tracker.

1*bYlabxtGE4bOsfg5sqI9cw.png

If it succeeds you’ll get the Status as succeeded:

1*cRvYb7tkN0_sFD9xcSumyg.png

On GitHub, to track any git repository you own, go to Settings > Webhooks and then Add webhook.

1*Zif89daqWgsqazBPNMr6gQ.png

The “Payload URL” will be the “Run act” URL that can be found in the “API” tab in the Apify’s platform.

1*xq3mfUr6SMII4eacfkVDSg.png

Now just publish the app and run from Actor tab. Now, every time you push a commit to your GitHub repo, it will save the commit in a JSON file, which could be used to serve a Frontend App (as mentioned before) or do other useful things .

The code

Let’s break the code down step-by-step:

Start out by requiring the apify package, cheerio and request-promise. These are the only packages we’ll use for this app — all npm packages are supported. (Lines 1 — 3)
Optional utility functions to simplify the readability. (Lines 6 — 7)
Start the execution by calling the Apify.main(), this function is optional. (Line 20)
Pass an async function to the main function in order to take advantage of the await keyword to handle asynchronous executions. (Lines 20 — end)
Apify.getValue() brings the INPUT we pass into our code for the execution. An act can work without this INPUT, although, it is recommended as it passes external parameters to it, creating flexibility and the capacity to extend its functionality to more use cases. In this case, the INPUT is the payload passed through the GitHub webhook. (Line 22 — 23)
Lines 28 through 38, are explained in its own section below. Key-value store.

“Magic here”

On lines 56 — 87, “Magic here”, is where we requestPromise the blob url of our commit and extract all the required information. Since we are not using GitHub’s API, we request each code blob and inject cheerio to handle the page’s HTML content. Then, we reduce the code into a stateChanges and assign it to a new object, with our previousSTATE.

const Apify = require('apify');
const cheerio = require('cheerio');
const requestPromise = require('request-promise');

// Utils functions to simplify the code
const { log } = console;
const pretty = object => JSON.stringify(object, null, 2);

// Get code line-by-line with cheerio into an Object
const arrayToObjectInContext = $ => (array, object = {}) => (
  array.reduce((acc, tr) => {
    const $tr = $(tr);
    const lineNumber = $tr.find('[data-line-number]').data('line-number');
    const code = $tr.find('.blob-code-inner').text();
    return Object.assign({}, acc, { [lineNumber]: code });
  }, object)
);

// Apify's main function to encapsule execution - optional
Apify.main(async () => {
  // payload from the WebHook
  const payload = await Apify.getValue('INPUT');
  const { repository, head_commit: headCommit } = payload;

  const { id: repositoryId } = repository;
  log('Repository ID:', repositoryId);

  // Create a repoStore to handle 'STATE' for each repo
  const repoStore = await Apify.openKeyValueStore(`Repository-${repositoryId}`);

  // Getting record 'STATE', if any
  let previousState = {};
  try {
    previousState = await repoStore.getValue('STATE');
  } catch (err) {
    // ignore this error
  }
  log('Previous STATE:', pretty(previousState));

  // Get Commit SHA, 7 characters are enough
  const commitId = headCommit.id.slice(0, 7);
  log('Current Commit ID:', commitId);

  const { added, removed, modified } = headCommit;
  const headCommitFiles = [].concat(added, removed, modified);
  log('Added/Removed/Modified Files:', headCommitFiles);

  // Add blob URL to each file
  const commitBlobUrl = headCommit.url.replace('commit', 'blob');
  const commitBlobUrls = headCommitFiles.map(file => ({
    uri: `${commitBlobUrl}/${file}`,
    file: file.toLowerCase(),
  }));

  // Magic here
  const stateChanges = await commitBlobUrls.reduce(async (acc, { uri, file }) => {
    log('Blob URL:', uri);
    const options = {
      uri,
      transform: body => cheerio.load(body),
    };
    const $ = await requestPromise(options);
    const getDiffIn = arrayToObjectInContext($);
    
    const $tableRows = $('table tr').toArray();
    log(`Lines found in '${file}': ${$tableRows.length}`);

    const fileState = previousState[file] || [];
    const currentState = fileState.filter(commitObject => !commitObject[commitId]);
    const commit = {
      [commitId]: {
        timestamp: headCommit.timestamp,
        committer: headCommit.committer,
        message: headCommit.message,
        url: headCommit.url,
      },
    };
    
    // Filtering non .js files
    if (file.includes('.js')) {
      Object.assign(commit[commitId], { code: getDiffIn($tableRows) });
    }
    return Object.assign({}, acc, { [file]: [commit, ...currentState] });
  }, {});

  const nextState = Object.assign({}, previousState, stateChanges);
  log('Next STATE:', pretty(nextState));

  log('Saving into repoStore:', repositoryId);
  await repoStore.setValue('STATE', nextState);

  // Save 'OUTPUT' of the current Act run 
  // (optional - necessary only if called from within an act to get its 'OUTPUT')
  await Apify.setValue('OUTPUT', nextState);
  return log('Done.');
});

Key-value stores

Every Apify class comes with several methods, including an integrated instance to the ApifyClient package (Apify.client). We will use the openKeyValueStore method to open a store for each of our repos.

const repoStore = await Apify.openKeyValueStore(
  `Repository-${repositoryId}`
);

The storeName (Repository-${repositoryId}) parameter we use has to be the same store name every time we call the openKeyValueStore, as we want to append the new commit to same repoStore.

// Getting record 'STATE', if any
let previousState = {};
try {
  previousState = await repoStore.getValue('STATE');
} catch (err) {
  // ignore this error
}

The state of our commits are put into the same record under the ‘STATE’ name. So we call the getValue method included in the repoStore we’ve created. For error handling purposes, we wrap the response around a try/catch, and if that response has a body, we assign it to the previousState variable. We ignore any errors this time.

We are now ready to handle the data as we want. Then, to setValue back to the store, we pass the name of the record, ‘STATE’, and the nextState as the second parameter.

await repoStore.setValue('STATE', nextState);

That’s it?

Yes! We’ve created a full-fledge API that returns a JSON file every time a commit gets push to a repository. We are storing the data in the Storage Tab under the name of the “Repository-repository-id-here” inside the STATE key (If you have nothing there, it’s because you haven’t pushed a commit yet — it’s not retroactive, yet!).

OUTPUT

Check it out here.

This is the first post of a series for creating a full-fledged APIs with Apify. Do not forget to subscribe to be notified for the next post.

Api Node.js Github Web scraping Automation

Report

Enjoy this post? Give Juan Sebastián Gaitán Villamizar a like if it's helpful.

Juan Sebastián Gaitán Villamizar

Software Engineer

I'm a passionate entrepreneur, developer and self-starter person who loves to learn. Focus, simplicity and purpose are my mantras. I have helped to jump-start several projects and startups, making use of my creativity as a develop...

Discover and read more posts from Juan Sebastián Gaitán Villamizar

get started