Codementor Events

What is Infrastructure?

Published Apr 12, 2019

This post came out of an in-chat discussion I was having with a junior engineer who was asking what the Go language was useful for. He was initially asking me if it worked well for web apps and while I said it could and is sometimes used for that, it's not the primary area that people are using it in, which is infrastructure code. That led to the question: what is infrastructure code?

The best way to illustrate infrastructure is with an example. Let's imagine we're building a photo-sharing service, call it Pinstablr. We'll start off in standard startup mode with an MVP: a basic JS client, a REST API, and a database (MySQL in this case). Looks something like this:

infra1.png

Here, we already have some infrastructure in place. The database is the infrastructure for storing data that needs to be easily queryable, like user and image data; the file system is the infrastructure for storing the photos; and we'll need to use some sort of hosting provider like AWS to host this app on the Internet, this provider will typically have some sort of virtualization infrastructure on top of their hardware to run your code independently of the other people who are running code on the same machines.

But this little app alone won't do very well when we release it to the public. There is just a single API server there, which will probably fall over once we get any sort of publicity advertising our app. Same goes for the database.

Fortunately this type of scaling is a solved problem in the tech world, just do a bit of researching and you can find a few tricks to improve scalability without having to do a whole lot:

  • Horizontally scale the API server by having multiple instances of it, stick a load balancer in front to distribute the requests.
  • Since we're likely to get far more GET requests than anything else, we can use a master-slave DB architecture to horizontally scale reads.
  • Some content is likely to be accessed much more than other data (hot content), and we'll have a long tail of content that isn't viewed very much. The hot content can be stored in an in-memory caching system like Redis or memcached.

Now we look something like this:

infra2.png

Now we have more software that doesn't impact the business logic or user interface, but are necessary to scale the product. This is infrastructure.

Let's take this example a bit further, since it can help us understand some of the newer developments in infrastructure like containerization (aka Docker) and orchestration (aka Kubernetes).

After running Pinstablr for a while and scaling it up (the current architecture should get us to millions of users) we start noticing the costs of our system are going up. Our current approach is simple, but not cost-effective: when things get slow, add more servers. If we are getting the exponential growth that many companies get, our costs will also increase exponentially. Furthermore, we can end up hitting some barriers that require us to do something other than just add more servers.

Let's say that the following things come up:

  • Doing a little bit of profiling and investigation, we discover that the API servers are spending 98% of their CPU time resizing and reformatting images (since users upload images in all different sizes and formats).
  • Our newly established marketing department wants to generate reports, so they're hammering our DBs with complex and unoptimized SQL queries. This chews up resources on our read DBs.
  • Our newly established UX team wants to run experiments to try to optimize the user experience on the site and mobile apps. They also need to run SQL queries and do analysis to see the impact of different experiments, but they'll also need to manage new tables that organize their experiments.

We can solve these problems fairly easily:

  • Offload image processing to servers dedicated to image processing, and use a worker pool + task queue system like RabbitMQ to handle that.
  • Split off a separate read DB for report generation, and have a process that periodically generates reports for the marketing team. Splitting off the DB is better than letting them hit the read DBs directly because it isolates the report generation from the production environment.
  • Split off yet another read DB for experiments, plus a new read-write DB that can be used for the experiment schema. Have a UI for the experimenters to work with their data.

Now we look like this:

infra3.png

Well. This is complicated. And we haven't even added in anything to do with money or advertising, nor have we added staging environments. We still have some single points of failure like the master database that can become overloaded. We haven't addressed how images are being stored.

We'd also need to do something to manage permissions, since the team that is building the UI and DB for the experiments team is probably different from the reporting team, and is different from the team that manages production, etc. Giving this many people global access to everything is a great recipe for disaster.

If we think this is complicated, imagine what the engineers on our newly hired devops team are doing right now! Other than looking for a new job of course, because this is a nightmare.

Docker helps make this process easier by packaging up each of these little boxes into a container. These containers can be run and built locally on a developer's machine - even Mac/Windows - to ensure that they work, and then pushed to a container registry where they can be deployed to production. Because they are little sandboxes that contain their environment (including the OS) they limit the risk that things break when moving from dev to production in ways that other types of deployment software like Chef or Ansible struggle with.

So that's one big reason why Docker is popular. Another big reason is versioning: you can tag images with a version to ensure that the versions of things in production or staging line up properly. There's lots of other reasons, you can feel free to ask Google if you want to know more.

Unfortunately Docker doesn't know much about the bigger picture and how these different components are connected. Here are some issues:

  • How does the API server know the address of the components it talks to, like the DB or Redis? Docker can provide a virtual network, but it still relies on static values like IP addresses to connect things together.
  • What happens when something crashes? When you have lots of machines running, it's a near-guarantee that something will crash somewhere, if only due to hardware failure.

This is the usefulness of Kubernetes. It's an orchestration software that makes it easier for these large numbers of containers to interact with one another. To handle containers talking to one another, it provides an internal DNS so that servers can refer to each other by name, which stays fixed and can be self-explanatory, instead of IP addresses, which can change, only point to a single machine, and are difficult to remember at scale. To handle the issue of machines dying, it is self-healing and will restart machines on different nodes if it sees that something has crashed.

Again, if you want to know more about Kubernetes, there is plenty of information out there for you on Google, so I don't really need to write much more about it.

Hopefully this clears up what infrastructure is a little bit. This post focused mainly on a web app and so the infrastructure is all oriented towards that problem, however you can have infrastructure that manages other things too. In my current role we train large scale machine learning models against petabytes of data. The infrastructure that we build manages that data, feeds it into training pipelines, handles things like data transformations and augmentations, etc.

In short, infrastructure is code that helps manage the scale of your system.

Discover and read more posts from Rob Britton
get started
post commentsBe the first to share your opinion
barlow jenkins
a month ago

@soccer skills world cup

To address these challenges, it is important to consider scaling strategies. This may involve implementing load balancing techniques, such as distributing the workload across multiple servers, to handle increased traffic. Additionally, optimizing database queries and introducing caching mechanisms can help improve performance.

Show more replies