When I first heard the term “cattle not pets” it was the perfect metaphor to describe a concept I had always been aware of when developing for the cloud, but never had the words to describe. The idea that you should remove individuality from your cloud infrastructure and treat your resources as nameless and dynamic like cattle. Resources come and go all the time so there is no time to name, feed, and care for them as if they were pets.
I’m sure many of us have been somewhere that has a fleet of servers named after superheroes, Disney characters, or something exceedingly nerdy like Dr. Who villains. When we start talking about scalability, though, characters can’t be imagined fast enough. Not to mention the hand feeding required to spin up new instances of an application over and over again. As we were developing our cloud infrastructure to scale for Muserk, our first goal was to never connect directly to an instance again. This felt like a great starting point to answer the question of how do we deploy applications, manage state, and debug issues that arise. This is mostly a qualitative look at how we began to scale our operations in the early days of Muserk…. So for you super nerds out there we won’t go into detail about things like load balancing, caching, or infrastructure as code.
DEPLOYING APPLICATIONS
Probably the most important aspect of scaling is being able to deploy an application programmatically. Once we can do that everything else is just a facility. The obvious answer here is Docker. The more advanced answer involves Kubernetes or Terraform, but that’s a topic for another day. With a containerized application we can control dependencies, versions, the operating system, and any configuration that needs to be done ahead of time. So, all we need is a platform to run our container. The advantage of this is that this platform can be anything! The container will run exactly what we need, the exact same way, anywhere that can support docker. Once the process of starting one of these containers is automated, we are free to start up as many as we would like, allowing a load balancer to route traffic appropriately.
MANAGING STATE
Next there is the problem of how to manage state on a server instance that is essentially garbage. Writing to disk is out of the question because all of that information would be lost from instance to instance. Well, what about NFS? This could be a plausible solution, but too slow without provisioned IOPs (which are expensive in the cloud). Besides, we should do better!
In fact, this was the starting point for really honing our data model and forced us to come up with a first pass at some sort of ETL. As we ingest data, how do we store it so that our applications can access it in a consistent way? Once all of our data is in one place, we can use it as our Single Source of Truth. Using a database as a SSOT is its own complexity. The real lesson for managing state across a scalable infrastructure is to AVOID state when you can.
DEBUGGING ISSUES
Most commonly, the reason for needing to log into an individual instance is typically to figure out what went wrong. As resources start to scale this gets increasingly difficult anyway because an error could have occurred on any one of 4, 10, or, theoretically, n number of instances. So how do we figure out where problems are happening and how to fix them? There are all sorts of things to monitor across our applications. User experiences, resource trends, load times, are a couple of examples. Most importantly, in my opinion, are the error logs.
When an error occurs, we want to be made aware of it. At first pass you should be using a logger. A logger lets us standardize how we create new logs by assigning a category for each type of log and ordering them by severity. Some common categories include DEBUG, INFO, and ERROR. In this example, DEBUG level logs may be information that would be helpful when figuring out what happened, but not necessary to be looked through all the time. INFO-level logs are adding a bit higher severity. These are messages we may always want to see so that we can see usage in real time. ERROR logs, being the most severe, can be alerted on. We can configure our system to report when an ERROR has been logged so that we can take immediate action. We can then use the INFO and DEBUG logs to determine what happened. If we’ve done it correctly these logs will have information on the unique machine the application is running on so we can handle hardware-specific problems. Once we are collecting logs from all machines across all applications, we can begin to build dashboards around each application. Combined with usage and hardware metrics, we have a central location to view all relevant information.
I hope this was in some way helpful for thinking about your own cloud infrastructure. As we continue to improve our architecture, we hope to have more to share. We are evolving our technology every day and are working hard to improve our ETL workflows and integrations into the substantial amount of processing we are doing with the data we generate. In the meantime, we will continue to backfill posts with what we have learned and implemented along the way of this journey into the final frontier.