Aside of performance metrics one important thing to keep an eye on for your production environments is the number of errors they are throwing and what's going on in the logs.
It's very unpractical to have services run but be blind and deaf to issues they encounter. It will cost users and it will cost a lot to your product.
Before even talking about production incidents we can look at how software exceptions (bugs) are handled in production and how any odd things logged can be seen and checked.
Software exceptions : failing code
There are several proprietary and open source tools out there that can be used with most stacks to catch and keep track of software exceptions.
It's a very important tool as it will allow your team to have a quick alert if code doesn't work in the intend way. Sentry, NewRelic, AppSignal (to name a few) are great to monitor those errors, report on them and display the context in which they happened.
This should be one of the first line of alert that something is not working properly in a service. Your users might actually see that something is not working properly but you should be alerted at the same time and start working on figuring our the problem and its solution even before your users start reaching out to you.
See what's happening
Exception catchers will alert you when something is going on, they will also tell you how the problem is looking but often you might want to look at the broader picture to know how the software walking into this issue.
Logs are often a good way to figure that part out. In the Cloud world we live in it's often unpractical to have to log into a remote server to see the logs of the services. It's best and easier to have all those logs sent to log services (either third parties or self hosted) to dig into them with one tool, in one place.
Handle the issues
Knowing there is an issue and looking into it are the day to day work of most software and devops engineers. But when the issue is beyond a simple little front end quirk and your product is partly or completely unresponsive you will have to consider this an incident rather than a bug.
Incident management is not an easy thing, but it can be learned and there are plenty of resources out there to help you do so.
The good thing is the earlier your team learns how to handle incidents the easier it will get to handle the big ones that will come later as the product grows. Again, it's something that the team can build up from early times and low risks. So, once the team counts 3 or 4 people it's time to start organizing how incidents are handled.
The first thing to do is to not panic. Panic is never a good solution, cooler heads have to prevail in such situations and they will lead the boat back to calm waters.
The goal is always to bring the impacted service, or services, back online as fast as possible. Whoever is looking into the incident should always look at how that can happen. The cause of the incident matters, it might not be obvious at first of course but some quick checks can help. Usually I go through this kind of check list :
- what are the symptoms?
- is the service totally or partially unavailable ?
- was there a deployment done recently ?
- was there a change in the infrastructure done recently ?
- was there a change in the database schema ?
- is there an issue with any of the third party services involved ?
Depending on the answers to those questions I should know more or less if :
- the problem can be solved by rolling back to a previous release
- the problem can be solved by restarting a service or a piece of the infrastructure
- the problem can be solved by reverting a change in the database schema
- the problem can be solved by disabling the third party service or isolating the service from it
If the service is totally down, the first order of business is to find a way to put the lights back on, even in a degraded mode. It's often better if the customers can see the service running rather than a blank page or errors.
If the service is partially down you still need to assess if what is working is the most important part of the product or not. This will also help decide what to do.
It's frequent to see software engineers run into the fire to try to extinguish it. It's honorable, but it's often preferable to go through something similar to the above checklist to decide what to do and in which order.
In short :
- assess the damages
- figure the probable cause
- assess a rough estimate of the time needed to bring the service back, even partially, without fixing the issue
- assess a rough, worst, case estimate of the time needed to fix the issue for good
- go with the shortest of the two above
The third step should be estimated as soon as possible, so that if the forth one starts to become a very distant possibility the team can prevent the end users from suffering for too long.
There is a lot to write at how to manage incident beyond this. The other important parts are about communication : with the rest of the technical team, the rest of the company and the users. Those three should be done by one person, involved in the incident management and with a clear head about it.
Using incidents for growth
Incidents happen. Learning how to manage them will help reduce their impact. Learning from them will help reduce their numbers. Post mortem, articles explaining the reasons, components and solutions of each incident, will be important to learn how to avoid similar issues in the future.
Incidents and their post mortems should be another part of the feedback loop for the technical team to organize their work, alter sprints and tickets to work on.
Incidents should not be feared, they should be prepared for. They happen, the team should feel prepared for. The team should know exactly how to organize itself whenever there is an incident with the objective to bring everything back online and working in the shortest amount of time possible. To do this, practice is key.
Early incidents (in the life of a product) are often low in impact and provide good opportunities to learn and organize. But as the team and product grow it will be actually practical, for the team, to create incidents to learn better how to prepare for real ones.
Fancy talking more on this topic ? Contact me to discuss how I can help your team establish or refresh exception and log handling in your product or an incident management strategy.