which metrics matter to you?

Published Dec 19, 2019

I've seen a lot of different measures of availability for online services and APIs. Often, developers and managers will stretch the meaning of "measurement" to try and find a user-oriented way of looking at data. Sometimes putting all of your services into one formula to provide a neat number you can monitor doesn't give you the full picture on the health of your services.

I once had to work on a service that was used to serve articles to guest (not logged in) visitors to a social network. The microservice behaved just as a logged -in one would, taking the id from the URL and retrieving the article, and associated interactions (likes, reactions, comments), but the services that provided the associated content were all behind the in-house firewall. To protect user privacy, normally scrapers were prevented on the network. We were making changes to this service to ensure that search engine scrapers could get to this endpoint, so that public articles would get better SEO. A C plugin sat the outer proxy layer to decide what traffic could be allowed to different requestors. We had to whitelist this service under scraper traffic. Running this service was also very different from our normal article loaders. Its users were almost entirely scrapers and so alerting on 1 second of latency (or more!) didn't matter as long as scrapers could surface the results fast enough for SEO.

However, measuring availability for this service became tricky. The defacto internal method at the time was to set up a synthetic test on the GET endpoints (which was limiting, but here, guest traffic wasn't going to be able to POST anything), and use the approximated per-5-minute polling of the service to provide yes/no datapoints to make sure data returned under the given threshold. We were measuring apples and oranges.

There are several industry standards for measuring the reliability of your app, but at the end of the day, you have to know what you want to measure. If you're trying to measure something carried out as a real user would do it, ideally you would capture real user data. If real users aren't hitting that endpoint, is an availability measurement based on "user frustration" going to be useful?

High availability systems

Report

Enjoy this post? Give Ankeet Presswala a like if it's helpful.

Ankeet Presswala

Developer, security tester, ops with an emphasis on measurement.

I've worked with all sorts of companies, from big and hierarchical companies to small and spry ones to everything in between. I've worked as a pentester, security architecture reviewer, full-stack developer and have been on severa...

Discover and read more posts from Ankeet Presswala

get started