Cattle Not Pets

When I first heard the term “cattle not pets” it was the perfect metaphor to describe a concept I had always been aware of when developing for the cloud, but never had the words to describe. The idea that you should remove individuality from your cloud infrastructure and treat your resources as nameless and dynamic like cattle. Resources come and go all the time so there is no time to name, feed, and care for them as if they were pets.

I’m sure many of us have been somewhere that has a fleet of servers named after superheroes, Disney characters, or something exceedingly nerdy like Dr. Who villains. When we start talking about scalability, though, characters can’t be imagined fast enough. Not to mention the hand feeding required to spin up new instances of an application over and over again. As we were developing our cloud infrastructure to scale for Muserk, our first goal was to never connect directly to an instance again. This felt like a great starting point to answer the question of how do we deploy applications, manage state, and debug issues that arise. This is mostly a qualitative look at how we began to scale our operations in the early days of Muserk…. So for you super nerds out there we won’t go into detail about things like load balancing, caching, or infrastructure as code.

DEPLOYING APPLICATIONS

Probably the most important aspect of scaling is being able to deploy an application programmatically. Once we can do that everything else is just a facility. The obvious answer here is Docker. The more advanced answer involves Kubernetes or Terraform, but that’s a topic for another day. With a containerized application we can control dependencies, versions, the operating system, and any configuration that needs to be done ahead of time. So, all we need is a platform to run our container. The advantage of this is that this platform can be anything! The container will run exactly what we need, the exact same way, anywhere that can support docker. Once the process of starting one of these containers is automated, we are free to start up as many as we would like, allowing a load balancer to route traffic appropriately.

MANAGING STATE

Next there is the problem of how to manage state on a server instance that is essentially garbage. Writing to disk is out of the question because all of that information would be lost from instance to instance. Well, what about NFS? This could be a plausible solution, but too slow without provisioned IOPs (which are expensive in the cloud). Besides, we should do better!

In fact, this was the starting point for really honing our data model and forced us to come up with a first pass at some sort of ETL. As we ingest data, how do we store it so that our applications can access it in a consistent way? Once all of our data is in one place, we can use it as our Single Source of Truth. Using a database as a SSOT is its own complexity. The real lesson for managing state across a scalable infrastructure is to AVOID state when you can.

DEBUGGING ISSUES

Most commonly, the reason for needing to log into an individual instance is typically to figure out what went wrong. As resources start to scale this gets increasingly difficult anyway because an error could have occurred on any one of 4, 10, or, theoretically, n number of instances. So how do we figure out where problems are happening and how to fix them? There are all sorts of things to monitor across our applications. User experiences, resource trends, load times, are a couple of examples. Most importantly, in my opinion, are the error logs.

When an error occurs, we want to be made aware of it. At first pass you should be using a logger. A logger lets us standardize how we create new logs by assigning a category for each type of log and ordering them by severity. Some common categories include DEBUG, INFO, and ERROR. In this example, DEBUG level logs may be information that would be helpful when figuring out what happened, but not necessary to be looked through all the time. INFO-level logs are adding a bit higher severity. These are messages we may always want to see so that we can see usage in real time. ERROR logs, being the most severe, can be alerted on. We can configure our system to report when an ERROR has been logged so that we can take immediate action. We can then use the INFO and DEBUG logs to determine what happened. If we’ve done it correctly these logs will have information on the unique machine the application is running on so we can handle hardware-specific problems. Once we are collecting logs from all machines across all applications, we can begin to build dashboards around each application. Combined with usage and hardware metrics, we have a central location to view all relevant information.

I hope this was in some way helpful for thinking about your own cloud infrastructure.  As we continue to improve our architecture, we hope to have more to share. We are evolving our technology every day and are working hard to improve our ETL workflows and integrations into the substantial amount of processing we are doing with the data we generate. In the meantime, we will continue to backfill posts with what we have learned and implemented along the way of this journey into the final frontier.

Cattle Not Pets – Configuring Scalable Resources

When I first heard the term “cattle not pets” it was the perfect metaphor to describe a concept I had always been aware of when developing for the cloud, but never had the words to describe. The idea that you should remove individuality from your cloud infrastructure and treat your resources as nameless and dynamic like cattle. Resources come and go all the time so there is no time to name, feed, and care for them as if they were pets.

I’m sure many of us have been somewhere that has a fleet of servers named after superheroes, Disney characters, or something exceedingly nerdy like Dr. Who villains. When we start talking about scalability, though, characters can’t be imagined fast enough. Not to mention the hand feeding required to spin up new instances of an application over and over again. As we were developing our cloud infrastructure to scale, our first goal was to never connect directly to an instance again. This felt like a great starting point to answer the question of how do we deploy applications, manage state, and debug issues that arise. This is mostly a qualitative look at how we began to scale our operations in the early days of Muserk so we won’t go into detail about things like load balancing, caching, or infrastructure as code. If you’re looking for that type of thing stay tuned!

Deploying Applications

Probably the most important aspect of scaling is be able to deploy an application programatically. Once we can do that everything else is just facility. The obvious answer here is Docker. The more advanced answer involves Kubernetes or Terraform, but that’s a topic for another day. With a containerized application we can control dependencies, versions, the operating system, and any configuration that needs to be done ahead of time. So all we need is a platform to run our container. The advantage of this is that this platform can be anything! The container will run exactly what we need, the exact same way, anywhere that can support docker. Once the process of starting one of these containers is automated, we are free to start up as many as we would like allowing a load balancer to route traffic appropriately.

Managing State

Next there is the problem of how to manage state on a server instance that is essentially garbage. Writing to disk is out of the question because all of that information would be lost from instance to instance. Well, what about NFS? This could be a plausible solution, but too slow without provisioned IOPs (which are expensive in the cloud). Besides, we should do better!

In fact, this was the starting point for really honing our data model and forced us to come up with a first pass at some sort of ETL. As we ingest data, how do we store it so that our applications can access it in a consistent way? Once all of our data is in one place we can use it as our Single Source of Truth. Using a database as a SSOT is its own complexity. The real lesson for managing state across a scalable infrastructure is to AVOID state when you can.

Debugging Issues

Most commonly, the reason for needing to log into an individual instance is typically to figure out what went wrong. As resources start to scale this gets increasingly difficult anyway because an error could have occurred on any one of 4, 10, or, theoretically, n number of instances. So how do we figure out where problems are happening and how to fix them? There are all sorts of things to monitor across our applications. User experiences, resource trends, load times, are a couple of examples. Most importantly, in my opinion, are the error logs.

When an error occurs we want to be made aware of it. At first pass you should be using a logger. A logger lets us standardize how we create new logs by assigning a category for each type of log and ordering them by severity. Some common categories include DEBUG, INFO, and ERROR. In this example, DEBUG level logs may be information that would be helpful when figuring out what happened, but not necessary to be looked through all the time. INFO-level logs are adding a bit higher severity. These are messages we may always want to see so that we can see usage in real time. ERROR logs, being the most sever, can be alerted on. We can configure our system to report when an ERROR has been logged so that we can take immediate action. We can then use the INFO And DEBUG logs to determine what happened. If we’ve done it correctly these logs will have information on the unique machine the application is running on so we can handle hardware-specific problems. Once we are collecting logs from all machines across all applications we can begin to build dashboards around each application. Combined with usage and hardware metrics, we have a central location to view all relevant information.

I hope this was in some way helpful for thinking about your own cloud infrastructure. We have come a long way in the past several years, and still have a long way to go. As we continue to improve our architecture we hope to have more to share. In. the meantime we will continue to backfill posts with what we have learned and implemented along the way.

Working Remotely

How Our Office Prepared Us To Be A Remote Team

As many startups do, Muserk began as a fully remote team. Once our business solidified, workflows increased and collaboration became increasingly more important. The logical next step was to get as many people as possible into one place. With team members all over the world, however, we couldn’t expect the entire company to move to Nashville.

In the throes of COVID-19 we were forced to move the team remote. We had gotten used to office life, and the team had grown significantly. Were those distant memories of a fully remote team lost? We couldn’t be sure how big of a challenge this would be to overcome. Luckily we still had members of the team outside of Nashville, and all along we had accidentally been preparing for this.

The office has always served as a hub for us. Once a quarter we assemble in Nashville for a week to share everything the company has been doing, and the division in each group’s effort is obvious. Each team is working on their own thing, and some of that knowledge doesn’t come across day to day. Because it doesn’t need to. Communication is key to facilitating a remote team, and, like any good team, we should even strive to over-communicate. That over-communication can quickly become a distraction, however, if it’s not effective.

Working in an office we’ve figured out what information streams matter to us, how to separate those, and how to tap into the ones we care about. In order to subscribe to all of the conversations happening at Muserk, and mute them when they get in the way, we make our chat channels as granular as possible. Discussions often happen in chat rather than in person, and we are in the habit of posting the results for those who were not there. We send casual meeting invites to those who may or may not care about the topic in case they want to be involved. Scheduled meetings auto-create video links within our calendars, and we join with tablets to use as whiteboards. When COVID-19 kept us from getting to the office, we worried about how it would hinder collaboration, and in hindsight we had been preparing all along. Without even knowing it we had fostered a remote-first culture that even our new hires, with no remote experience, were able to seamlessly adapt.

So has the office become unnecessary? Without it we wouldn’t have figured out who we are as a team. The lessons about communication may have required more effort, or taken longer to develop. We inadvertently learned a valuable lesson about disaster preparedness that we can carry on into the future. When we go back to the office, this moment will remain in the back of our minds. We may never be in this situation again, but if we are, the transition will be just as seamless.