Receive New Tutorials

Monitoring the Behavior of Large Client-Side JavaScript Apps

– {{showDate(postTime)}}

This article goes through how to monitor the performance of a client-side JavaScript app and some common types of errors that you may have in the production or in the procedure of tracking performance. This post is based on the Codementor Office hours hosted by Rebecca Murphy from Bazaarvoice.

Download the Slides

Introduction

When you are developing a traditional server-side application and you own the hardware where that application is running, it’s very easy for you to just login to the box where the code is running and view the logs. However, with a client-side JavaScript application, it is a different scenario.

In our case at BazaarVoice, our application runs hundreds of millions of page views every month. So, on every one of those page views, it ran on somebody else’s computer. We cannot log in to their computer and see if the application had any errors or whatnot.

Bazaarvoice’s Firebird

The application that I will be talking about today is the display for the ratings and reviews software that we developed at BazaarVoice. There are a few different display applications, but I’ll be talking this one app called Firebird inside of BazaarVoice.

Firebird is a third-party JavaScript application where customers do not do any server-side integration to make it work; they just need to put some script tag and a little bit of JavaScript on their page to get it up and running. That review interface is created entirely client-side by Firebird and is given a product ID. Firebird asks the API for the data associated with that product ID and then uses that data to generate all the HTML and interactivity that you saw on that screen (all done client-side).

If you saw stars and reviews on the internet and it wasn’t on Amazon or eBay, there’s a pretty decent chance that those are BazaarVoice stars, as we’re definitely on some very major retailers and brands. What we do has a direct impact on our customers’ ability to make money.

The content that you see online about a product is extremely influential. Nowadays, ratings/reviews are not just for people who are buying things online, but also for people who are researching online before they go buy in a store.

I’m bringing this up just to explain that this thing that I work on impacts the bottom line of our customers. It is not just some vanity software that they are running; it affects their business, and so they get it on their page by just putting a few lines of JavaScript onto their site.

Since what we do is so important, this means that no matter what, our code has to work.

However, sometimes we’ve made a mistake in our code, and now it does not work well with a particular customers’ site. Sometimes a customer might get that little bit of integration code wrong even though it’s only five lines. They might not provide the right product ID, and we cannot actually show stars if we do not know the particular product to show stars for. Sometimes customers do things that we just completely didn’t expect.

More than twelve hundred retailers and brands use this app and in December we saw hundreds of millions of page views across more than tens of thousands of unique sites. I throw out all these big numbers that I cannot even keep track of because we cannot check every page that our app is going to be on. This app runs anywhere and everywhere, and the environment in which it runs can change without us doing anything. So we cannot actually test it for all the possible scenarios that it is going to execute in.

Apps that Talk Back: the Concept

As we can’t know ahead of time that our app will always work in all scenarios, we needed a way to know quickly that our app is not working in some scenario. We needed to get more visibility into how our app was doing out in the world in terms of errors/performance, and we needed to be able to quickly figure out what was going wrong. We also needed to put out some safety nets or some guardrails so we would be less likely to make things go wrong as we were doing development.

This was my hit list last year, and it is been interesting to see the progress that we’ve made and the things that are still kind of on my to-do list:

Service monitoring
Error tracking
Performance metrics
Production debugging
Safety nets

Service Monitoring

The first and most simple thing that you could do to keep track of your JavaScript app is to put in basic monitoring in place. For example, you should check if your app is there or not.

We serve our application from a few different domains, and there’s an API that provides this sort of data for us, also a few different hosts that provide static resources for us (both via HTTP and HTTPS). We also have a server that we maintain where customers can configure their application and see what their application will look like as they make changes. So they might change their stars from yellow to red, and we have a server that shows in pretty close to real time what those things will look like.

Pingdom

We use Pingdom just to check if the app is up or not. With Pingdom, you just set up some URLs, and Pingdom goes and visits those URLs on a super regular basis and makes a note of their response time. This way, you will know if your system goes down, which is nice to know but not super useful, as we are serving static resources from S3. If one of these hosts stops responding, there’s probably very little we could do about it. Moreover, it is difficult to do too much with response time because variability is expected, and even if you get a crazy spike, it is not meaningful unless it stays high.

We tried Pingdom’s response time alerts with PagerDuty before and got some false alerts at two in the morning. Everybody was angry and so we decided to turn those off.

DataDog

DataDog is a great service. If you have a metric that you want to know when that metric goes out of bounds, then Data Dog is super helpful for that. Since it was difficult to monitor response time, we settled on using DataDog to monitor what we are doing with Pingdom and let it alert us when a metric is above a certain threshold for a period.

However, even then, we were not getting value at that. When the whole internet or the API is down, in both cases those aren’t within our team’s control. We can alert other people that this is happening, but there’s not a lot we can do about them.

Error Tracking

What about when there just aren’t stars on the page? If the stars are not on the page because the API is down, we’ve already got that case covered. However, what if the stars are not on the page because the customer did not define a product ID? They were trying to display that they are asking for stars, but they are not saying what they want stars for. Can we get an alert when that happens? Alternatively, how can we get an alert when that happens?

BVTracker

BVTracker.error('No productId is defined');

From the pretty early days of the Firebird app, we had been sending error reports to our internal analytics tool via BVTracker, which is just a class that we wrote that talks to our internal analytics tool. You could talk to Google Analytics as well and just send events, so it was kind of the same thing.

When there was an error, we would send an error event. When you are getting a million of errors, it does not tell you a whole lot unless you know this particular string is important. It is hard to separate the important errors from the unimportant errors if everything is just an error.

BVTracker.error({
    name : errorMessages.IMPLEMENTATION,
    detail1 : 'No productId is defined',
    detail2 : component.name
});

This was like a big value but not a hard change.

Using Kibana for Alerts with Context

We wanted to have context-appropriate alerting for different error types and for the alerts to have information, not just a string.Certain errors are PagerDuty-worthy, but lots of errors are not. We still want to know about them, but we do not need to wake anyone up.

define({
    error : {
        IMPLEMENTATION : 'Implementation Error',
        API : 'API Error',
        UNCAUGHT : 'Uncaught Exception',
        CONFIG : 'Configuration Error',
        THIRD_PARTY : 'Third Party Service Error',
        TIMEOUT : 'Request Timeout',
        UI : 'User interface Error',
        FRAMEWORK : 'Framework Error'
    }
});

As you can see from the code above, we diminished the importance of that message, “No productID I defined,” and instead classified this error to say, “If no product ID is defined, I know that this is an implementation error. This is an error by the customer. This is not an error in our code. If this particular error condition has been hit, it is because a customer made a mistake in implementation, not because our code is broken, and we need to wake people up.”

We came up with a set of categories for the kinds of errors that we were seeing. We went through our code and found all the places we were sending errors, did a little bit of assessment of what those errors were and said, “All these errors fall under one of these buckets and based on these buckets, I can now make a rational decision about how to alert on these. I am not going to alert on a request timeout unless there’s a lot of them. I am not going to wake up a developer over an implementation error. However, if there are a lot of them, I am going to make sure that customer support can go work with that customer to fix them. If there’s a third party service error – good to know about, not a whole lot we can do about it. So again, let’s not wake anyone up.”

We worked with our analytics team to be able to get access to all the data we were sending them, and we were able to use Kibana, which is just a UI around Elastic (formerly elasticsearch). We were able to dump all that data into Elastic and get a user interface using Kibana with very little setup that would let us see those errors in nearly real time. Kibana will drill down into these errors so we could see what was actually happening. Then, we were able to create dashboards that were isolated just to errors we considered critical. And, we created other dashboards for implementation errors that were important but to the point we have to wake people up about it.

This is a screenshot of a bad thing that happened: a customer had implemented the application in a way that was using way more API requests than it should have, and so their API requests were being denied─they were basically being re-limited.

Because we had set up and done all the error-reporting and we had worked with the analytics team to get the data into Elastic, we were able to see this incident happen in nearly real time, get an alert about it because the number of critical errors exceeded a certain threshold, and then we were actually able to tell the customer that they had a problem.

The customer was not the first one to experience the fact that they had a problem. So it wasn’t the customer calling us and saying, “There are no stars,” it was us calling the customer and saying, “There are no stars and here’s how to fix it because it’s an error that you have made on your end.”

We developed a web app that lets people outside of our team get visibility into those errors on a per-customer basis. This is a report that you can access for a customer that gives you a summary of the errors for that customer in the last twenty-four hours. If a customer is experiencing issues, you can go look at this page and quickly see what’s happening. This is a cool thing that was able to shift some of that initial investigation work off of the development team and over to support.

Performance Tracking

The way our application works is that any customer can request a new build of their app at any time. If they turn a feature on and off or change the color of their stars or otherwise make a configuration change, we generate a new set of static resources that reflect that change. It is hard to test all possible cases because customers can turn off many features, and they have a lot of settings options.

So, what we did is send data to DataDog whenever we do a new build, and then we can see what our build sizes and build times are, how many builds are waiting in the queue. Our builds are run by a Grunt task, and we just put some instrumentation in that Grunt task in key moments where it just sends an event to DataDog. Using this, we’re able to see if anything is happening that we don’t expect to be happening.

BV-Primary is our core application, and it is the first yellow box on this screen shot. We are uncomfortable with the size of that build, and we want it to be less. Moreover, we’ve set up an alert in DataDog to let us know if the build size exceeds a certain threshold since we need to do something about it. Ditto for these other files: Bvapi.js is the very first file that we load, it’s called the Scout File. Bvapi.js needs to be less than 14.5k, or else it will take too many round trips to download it. Make sure that it can be downloaded in one round trip. The limit for that is 14.5k and so we have an alert set up if we start to get close to that 14.5k limit on that file, PagerDuty will yell at us and we will need to go figure out what to do about that.

It’d be cool to get this sort of instrumentation going in our development process. For example, as you push a new branch, have it do a build and see what size it is. However, build sizes can vary wildly based on the configuration options that people have turned on. Even on this graph, you can see there’s pretty big movement. We certainly should get that instrumentation in place for individual pull requests. However, in the meantime, and even after we do that, this is still valuable because it lets us see what’s happening in reality, not just in our development environment. Those sorts of concerns are unique to third-party JavaScript.

If your JavaScript application doesn’t run on a bunch of different sites with a bunch of different configurations, a much smarter place to do the monitoring is at the pull request level, not at the actual build level. That’s because you may only do one build per deployment and, in that case, a build is a build, so the numbers are always going to be the same. We built thousands of these on a busy day, and those numbers can be all over the map. We need to be able to see what’s happening in the real world.

Instrumenting Key Moments

Another thing that we’ve done to monitor performance is to instrument key moments in the application itself. When are all of the resources for all of the application loaded? When do we actually show up on the page? When are we finally visible on the page? What’s the time between when all the resources are loaded and when we show up on the page?

My teammate Lauren Ingram did a bunch of work around this using the windows.performance API or performance timing API, and was able to instrument just a bunch of interesting moments in the application. While we were sending error data to out analytics tool, we sent diagnostic data to our analytics tool. Then, the analytics team was able to get that data back to us in elastic search. We put Kibana on top of Elastic and now we’re able to get great visualizations of the real world experience of our users. This thing above is not running on a web page test instance or anything; this is what we see in the world.

We can filter this by user region, by the operating system and all sorts of different things. So, if we want to drill down to see what is the mobile experience, we can see that. We can separate out the mobile experience from the desktop experience, or we can separate out the IE8 experience from the Chrome experience. There are two points. One is figuring out where are those key moments in our application. The other one – the hard problem for us is that it’s just an ocean of data. You can see seven lines on the picture above, which means we’re sending at least seven diagnostic events for every page view. For hundreds of millions of page views every month, we need to get the seven events back out. At the scale we’re operating at, one of the hard problems here is just the sheer magnitude of the data. If you’re operating a smaller scale, this is a way easier problem.

It’s possible you don’t have to solve the problem of that quantity of data. You could handle this in a couple different ways. One is to have a dial that you can turn to say what percentage of page views are going to send these events at all, or it may be that your audience size is more reasonable, and you can just send these for all page views. You can send these page views to Google Analytics or something and get your data out of there.

Debugging Common Mistakes

Stars are money for customers, and we want to minimize the amount of time that there are not stars. We put some tooling in place that lets us do this pretty quickly.

This is a tool that lets us rapidly recreate a scenario.

If we know the customer and we know they’re having a problem in production, we can go to this page and see that for this customer in production for the site called The Danger Zone, there’s not Firebird actually installed. No wonder they’re having a problem. And we can go investigate that. This tool gives us visibility into some kind of common mistakes so we could quickly identify them.

Also, prod/local links are development implementations of whatever the customer has in production so that we can experience the problem that they’re having on our own machine with code. We can go and manipulate those implementations rather than having to try and manipulate production code. So these are just tools that automate common debugging processes and common pieces of information that you want when you are debugging.

Production Debugging

Another issue one of our customers was that the feedback buttons above weren’t working in production. These are feedback buttons in the user interface, and when you click on them, they should increment. However, it wasn’t working for that client and we couldn’t figure out why.

We opened up our console, but there’s nothing there and that’s on purpose. We don’t want our application cluttering the console with noise. So we wrap all of our errors in try/catch, and we don’t let them go to the console because customers get grumpy when we’re in the console. We’re intentionally silent here, and this is a bummer when you wish you knew what was going on.

However, we added a couple of lines of code to our application so that if this cookie is set, then suddenly we’re making all kinds of noise in the console. We’re no longer catching errors, but we’re also spitting out a lot of information about the lifecycle of the applications. You can very quickly see the error happening when we were rendering the top view for the BVQAContainer. It was easy to see what was going on and to see errors without having to make break on call exceptions, which can suck because jQuery for example has all kinds of call exceptions in it. Catching exceptions is a good development pattern, but it can be a pain when you’re trying to debug.

Source maps

The other thing of course (and I’m embarrassed to have we haven’t done it yet) to enable production debugging is source maps. Source maps let you see the un-minified, fully commented version of your code at a click of a button and be able to debug using that rather than using the minified code. We also use Charles Proxy a lot to proxy the production files to something on our local machine but, source maps would be a great tool for us to do that as well.

Saftey Nets

Leaking a global variable is especially bad for third-party JavaScript apps because you might be stomping on someone else’s global. The things that you can do to get your app to tell you it’s in a good state – on your computer, not even out in the wild – is to set up Githooks. The important thing is to make sure that whatever hooks you set up are low-friction. You don’t want to have like a five-minute process that runs every time you commit. Maybe you want to save that process for just pre-push and maybe you should figure out why it’s five minutes and try and get it down to something less. Githooks are a great way to test code on your computer. We mostly save these for running the unit test as long as they’re fast and also running code-quality tools like JS Hint and JSCS.

We have continuous integration setup so that we can run tests that take longer only on pull requests. Once a pull requests exists, the continuous integration server kicks off and starts running our tests, and a pull request turns red or green.

You’ve probably seen errors like this from your code and you’re like, “Now I have to go figure out what is that even?” Assertions in your code are lifesavers.

We’ve gone and put assertions in our code around key moments where something might go wrong and now we get pretty error messages that actually tell us like what was undefined rather than just getting a useless message like that.

BVReporter.assert(
    this.componentId,
    'View ' + this.name + ' must have componentId'
);

That’s a easy thing to add to your application as well any time that you have an argument that is required. In JavaScript, you can say it with an assertion in your code to say like, “If I don’t have this.componentID right now, let’s just throw an error. Nothing’s going to go well from here on out.” This is especially valuable in a synch code: If you don’t check upfront that you have what you need, you might not see an error until some later turn of the event loop and that can be really hard to debug. Putting assertions in your code is another great way for your app to talk back to you during development and do it in a useful way, not do it like this.

Write for Us

RSS

Author

Rebecca Murphey

Rebecca Murphey is a staff software engineer at Bazaarvoice, where she leads a team that shepherds third-party JavaScript application development for products with a reach of hundreds of millions...

Hire the Author

Questions about this tutorial? Get Live 1:1 help from JavaScript experts!

Ahsan Zia

5.0

★ ★ ★ ★ ★

Software Consultant and Mentor with 7 years of experience

I have been working as a Software Engineer for 7 years delivering high quality solutions across mobile, web and desktop platforms. Each project has...

Hire this Expert