The state of site reliability engineering documentation continues to solidify as more large enterprises move their mission-critical applications to the cloud, making the site reliability engineer role a main stream.
SRE documentation versus traditional IT documentation
Traditional documentation efforts, such as IT documentation, fall under the support function. Therefore, it is easy to deprioritize documentation efforts with short-term resources. SRE documentation, on the other hand, is part of the site reliability engineer’s job description. There’s no way to ignore it as an integral part of SRE best practices.
DevOps and IT documentation can have various authors, such as developers, QA administrators, or technical writers. A great SRE must also be a talented communicator. Infrastructure Architect SREs are the primary authors of runbooks, internal tools documentation, and how-tos. Think of this type of SRE as the communicator of the SRE team.
Basic SRE Documentation
Some common types of documentation are fundamental to the SRE role and operational best practices.
Product Preparation Review Documentation
When SREs integrate new systems, they perform a Production Readiness Review (PRR) to ensure that the new systems meet their organization’s readiness standards. Take the time to create a PRR template that captures your organization’s preparedness standards and avoid reusing previous PRR documentation to minimize human error.
Writing a PRR requires IT teams to be as descriptive as possible when documenting system readiness. If you lack information to complete a section, document why you do not have it.
Choose a reviewer for your team’s PRR documentation based on who has the expertise to ask the right questions about system readiness. Work with the creator of the PRR to determine if the information in the PRR documentation is complete enough to demonstrate production readiness.
A service overview document guides SRE troubleshooting. SREs for all shifts should understand the system architecture, components, dependencies, and service contracts for each service they support. Therefore, service previews are high priority critical documents. Common elements of a service overview include the following:
- Description of service;
- links to other sources of information, such as monitoring dashboards and operations documentation; and
- reference architecture.
Creating a service overview should be a collaborative effort between development teams and SRE. This allows both teams to design an overview that prioritizes how the SRE team approaches troubleshooting. Choose an internal platform, such as a wiki, to publish your service overview to ensure it remains accessible to SRE teams.
Service previews are not a one-time effort. Teams should invest time in updating service overviews as services change and new dependencies emerge. SRE teams often use the PRR process to generate service overviews.
Playbooks, sometimes called runbooks, are basic operating documents that enable duty engineers to respond to service monitoring alerts. A well-designed and maintained runbook reduces the time needed to mitigate an incident because it contains troubleshooting procedures and links to operations and monitoring consoles.
SRE teams are increasingly turning to automation to create playbooks. Popular tools include Siemplify, now part of Google Cloud, Swimlane, and Jupyter.
Common elements of an SRE playbook include the following:
- define an incident for your organization;
- designation of incident response roles and responsibilities;
- standardized incident response procedures and workflows reviewed and tested by an SRE; and
- cheat sheets and checklists for SRE incident response.
Playbook creation best practices include the following:
- Start each playbook with a trigger, such as a watch alert.
- Structure playbook entries based on severity, impact, metric, history, mitigation, and discovery.
- Automate every action, including the simple steps, to remove as much human error as possible from the incident response process.
Playbooks tend to be culture-dependent, so consider any template you find online as a starting point, not a roadmap. Invest time to get input from SREs and the operations team on their playbook requirements, and build your template to meet your team’s needs.
The larger the cloud, the harder it can fall. In an era of significant cloud outages, your organization should define its post-mortem criteria before a triggering incident occurs. A typical post-mortem document includes the following:
- a management summary capturing the effects of the incident and the root cause;
- a technical summary of the effects of the incident on the business, users, teams, and systems, including approximate response times, detection method, and the solution applied by the SRE team to resolve the incident;
- incident history with additional detection details and screenshots of monitoring charts, timeline, root cause and resolution information; and
- lessons learned, including details on what went right and what went wrong in resolving the incident.
Establish clear and comprehensive templates for post-mortem procedures to establish organization-wide standards. The best models are wikis or can be found online, with writing tips and advice for SREs to follow. SREs should not start from a blank page when writing SRE documentation. Postmortems should be accessible and searchable through your organization’s internal collaboration platform.
Culture drives reviews, not the mere existence of documentation. Google SRE guidelines, for example, dictate that an organization must establish a post-mortem culture beyond documentation to avoid the anxiety of a team’s walk of shame after an incident. Post-mortems are not meant to sink to the bottom of email inboxes – these documents warrant stakeholder scrutiny.
Operating large-scale complex systems requires both technical and non-technical production policies. Policy documentation details mandates for production tasks, such as logging changes, retaining logs, naming internal services, and accessing and using emergency credentials.
Developing policy documentation involves creating and maintaining standard documentation templates. Some organizations train SREs to write policy documents as part of their onboarding process.
Because SRE documentation best practices are an integral part of the SRE job role, SRE documentation often escapes many of the challenges that traditional IT and DevOps documentation faces when competing for engineers’ resources and attention. . As with other SRE and DevOps practices, organizations should allow SRE teams the time to continuously refine their strategy, processes, and documentation tools to ensure that SRE documentation becomes an asset to overall business operations. systems.