Building a logging system using the ELK stack (Elasticsearch, Logstash, Kibana)

Published Feb 05, 2018Last updated Feb 19, 2018
Building a logging system using the ELK stack (Elasticsearch, Logstash, Kibana)

In recent months, the engineering team here at Codementor started building our own logging system. We put the popular ELK (Elasticsearch, Logstash, Kibana) stack to the test and learned how to build a good logging system through this process. Here’s what we learned.

Why we need a logging system

The first thing to consider is whether you really need a logging system. We have a few specific issues we feel a logging system can address:

  • We want to know user path given user_id, which helps developers debug more easily
  • We want to be able to see our entire system's health status, request throughput, and response time in one place
  • We want to know a specific system's resouce usage (e.g. memory, load)

Some of these issues can be solved by using Datadog, and some with Google Analytics, but they have their limits. For example, we use Heroku to host our applications, and it requires additional setup for Datadog. Because Heroku doesn’t provide detailed metrics, the information we can see in Datadog is limited. Based on our usage, we also couldn’t justify Datadog’s cost. Google Analytics, on the other hand, is awesome when you want to see overall website usage, but it can't help when you try to see a specific user's request history.

How to build your ELK stack (Elasticsearch, Logstash, Kibana)

Ultimately, we decide to host an ELK stack as our logging system. We laid out a few requirements to help us decide which hosting solution to use.

  • Easy to scale
  • We don't want to spend too much time maintaining the ELK stack
  • It can't be too expensive

We evaluated quite a few solutions:

Logit

https://logit.io/
It provides a whole ELK stack and you can start your own ELK system with little setup.

Logz.io

https://logz.io/
Elaticsearch only, no Logstash. It provides a custom log shipper for Heroku.

Qbox

https://qbox.io/
Elaticsearch only, no Logstash.

Elastic Cloud

https://www.elastic.co/cloud
Elaticsearch only, no Logstash.

Because Logit not only provides Elasticsearch but Logstash as well, you don't need to host Logstash yourself. That makes it the best choice. It’s fairly expensive, however, so we decided to find an alternative that has a quick setup and low maintenance costs.

Since we were already familiar with AWS, we decided to try AWS Elasticsearch. We used ECS to host the Logstash container. This series of posts was a great resource and saved us a ton of time setting up Logstash.

The entire AWS ELK solution will take you a lot time to set up. Since this post is focused on ELK itself, I’ll write another post discussing how to set up ELK in AWS.

After you’ve completed the setup for your ELK stack, regardless of which solution you chose, there are a few issues we need to address.

Your log

You have to decide what information you want to collect. This will influence many dimensions, including, but not limited to, how to write your Logstash config, how to estimate your daily disk usage, how many requests your Logstash server will receive per minute or second, etc. Your instinct might be to collect all of this information, but you’ll find it difficult to manage messy and meaningless logs in your Kibana dashboard.

That means it’s a good idea to think about this from day one. What kind of logs do you want to ship to your Elasticsearch? What kind of logs can you just drop?

Before releasing your awesome new logging system to production, I find it’s a good idea to ingest all of your data to your Elasticsearch, query something you’d like to know from this logging system, and then prune all unnecessary data. You’ll get the chance to see what happens if you don't do something to control the situation and get a better sense of what data you really need.

I also find it really hard to make every log in the correct format. There are a bunch of edge case you can't possibly image when you start to write your Logstash config. You should adjust your Logstash config as you go until you’re happy with the result.

Authentication

Authentication is a serious issue that you should spend time on. You need to make sure your Logstash server can't ever be accessed publicly. At the very least, you should add HTTP auth to it. You should also check your Elasticsearch and Kibana server, especially your Kibana website, since you will put dashboards on it. Kibana will be accessed mostly by people in your organization, so you need to think a good way to manage access control. In our case, we added a HTTP proxy server in front of Kibana, and used the same JWT authentication that is used on our admin site. This means only people who can access the admin site can access our Kibana server.

Performance and storage

Test your Logstash server using production load and decide how many Logstash instances you need. There are also some useful configurations that can help you increase throughput. For example, you can adjust the queue type to persisted to allow Logstash to receive more requests simultaneously that will be processed later — you can find details about this here. Official documents like Performance Troubleshooting Guide and Deploying and Scaling Logstash are also worth a read!

There is also another critical issue you need to think about:

How many days worth of logs do you want to retain?

The answer to this question will largely depend on the use case for your logging system. In general, storing logs for at least 14 days should cover most common use cases. After deciding the log retention, you should estimate how many disks you will consume in one day, and multiply that by the number of days you want to retain. This will ensure your Elasticsearch disk is large enough. You should also decide how many shards you want to keep in one index. This is an important setting for Elasticsearch that greatly affects performance.

A shard is the core component in Elasticsearch for writing and querying documents (your logs). The more shards you have, the more write and read performance you can gain, but the more system resources (CPU, memory, etc.) will be consumed. The default setting for shards is five, which is a bit too many for common logging cases, unless you have a bunch of logs in one day.

There are some awesome post about shards, like Capacity Planningand Optimizing Elasticsearch: How Many Shards per Index?

You can change shard numbers for your future index this way:

// PUT https://es.your-awesome-domain.com/_template/logstash1

{
    "order": 1,
    "template": "logstash-*",
    "settings": {
        "number_of_shards" : 2
    }
}

Remember, you can't change the number of master shards after the index is created, so be sure to choose the correct number!

After your logging system is on production

Check your Elasticsearch free storage space regularly! It's very important to make sure you have enough space to write new logs. You can delete the old index to release space when needed.

Don’t forget to also back up your Elasticsearch. Remember the GitLab disaster? We all make mistakes! A good backup solution can mitigate any damage caused by your fat finger or something you can't control. For our solution, we perform a backup once a day. Backups won’t be something you’re aware of in everyday use, but it may just save you when disaster strikes.

Keep an eye on your Logstash server

Use Datadog, CloudWatch, or other similar tools to monitor your Logstash server. If your Logstash is down, incoming logs will be discarded, so it's suggested that you should host at least two Logstash servers to achieve high availability.

A wrong configuration will cause incorrect data formats in Elasticsearch, which will reduce the quality of query results. When you change your Logstash configuration, run some tests before deploying it to production. Check that parse results are correctly formatted. Rubydebug is your good friend when you try to adjust the configuration. Blue green deployment or rolling deployment are good patterns for deploying your Logstash server. It can help avoid server downtime.

Conclusion

Building a logging system is like building a project — there’s always room for improvement. Focus on what your needs are when building the logging system, but don’t be afraid to adjust as you go. Since deploying our logging system, our engineering and product teams have been able to gather more insights into our users to help with debugging and even product roadmapping.

Have you ever built a logging system before? What do you think of the ELK stack compared to other solutions you’ve tried in the past? I’d love to hear your comments!

Discover and read more posts from Ben at Codementor
get started