Open-source News

How SREs can achieve effective incident response

opensource.com - Tue, 06/21/2022 - 15:00
How SREs can achieve effective incident response Robert Kimani Tue, 06/21/2022 - 03:00 Register or Login to like Register or Login to like

Incident response includes monitoring, detecting, and reacting to unplanned events such as security breaches or other service interruptions. The goal is to get back to business, satisfy service level agreements (SLAs), and provide services to employees and customers. Incident response is the planned reaction to a breach or interruption. One goal is to avoid unmanaged incidents.

Establish an on-call system

One way of responding is to establish an on-call system. These are the steps to consider when you're setting up an on-call system:

  1. Design an effective on-call system 
  2. Understand managed vs. unmanaged incidents
  3. Build and implement an effective postmortem process
  4. Learn the tools and templates for postmortems
Understand managed and unmanaged incidents

An unmanaged incident is an issue that an on-call engineer handles, often with whatever team member happens to be available to help. More often than not, unmanaged incidents become serious issues because they are not handled correctly. Issues include:

  • No clear roles.
  • No incident command.
  • Random team members involved (freelancing), the primary killer of the management process.
  • Poor (or lack of) communication.
  • No central body running troubleshooting.

A managed incident is one handled with clearly defined procedures and roles. Even when an incident isn't anticipated, it's still met with a team that's prepared. A managed incident is ideal. It includes:

  • Clearly defined roles.
  • Designated incident command that leads the effort.
  • Only the ops-team defined by the incident command updates systems.
  • A dedicated communications role exists until a communication person is identified. The Incident Command can fill in this role.
  • A recognized command post such as a "war room." Some organizations have a defined "war room bridge number" where all the incidents are handled.

Incident management takes place in a war room. The Incident Command is the role that leads the war room. This role is also responsible for organizing people around the operations team, planning, and communication.

The Operations Team is the only team that can touch the production systems. Hint: Next time you join an incident management team, the first question to ask is, Who is running the Incident Command?

More DevOps resources What is DevOps? The ultimate DevOps hiring guide DevOps monitoring tools guide A guide to implementing DevSecOps Download the DevOps glossary eBook: Ansible for DevOps Latest DevOps articles Deep dive into incident management roles

Incident management roles clearly define who is responsible for what activities. These roles should be established ahead of time and well-understood by all participants.

Incident Command: Runs the war room and assigns responsibilities to others.

Operations Team: Only role allowed to make changes to the production system.

Communication Team: Provides periodic updates to stakeholders such as the business partners or senior executives.

Planning Team: Supports operations by handling long-term items such as providing bug fixes, postmortems, and anything that requires a planning perspective.

As an SRE, you'll probably find yourself in the Operations Team role, but you may also have to fill other roles.

Build and implement an effective postmortem process

Postmortem is a critical part of incident management that occurs once the incident is resolved.

Why postmortem?
  • Fully understand/document the incident using postmortems. You can ask questions such as "What could have been done differently?"
  • Conduct a deep dive "root cause" analysis, producing valuable insights.
  • Learn from the incident. This is the primary benefit of doing postmortems.
  • Identify opportunities for prevention as part of postmortem analysis, e.g., identify a monitoring enhancement to catch an issue sooner in the future.
  • Plan and follow through with assigned activities as part of the postmortem.
Blameless postmortem: a fundamental tenet of SRE

No finger-pointing. People are quite honestly scared about postmortems because one person or team may be held responsible for the outage. Avoid finger-pointing at all costs; instead, focus solely on systems and processes and not on individuals. Isolating individuals/teams can create an unhealthy culture. For instance, the next time someone commits a mistake, they will not come forward and accept it. They may hide the activity due to the fear of being blamed.

Though there is no room for finger-pointing, the postmortem must call out improvement opportunities. This approach helps avoid further similar incidents.

When is a postmortem needed?

Is a postmortem necessary for all incidents or only for certain situations? Here are some suggestions for when a postmortem is useful:

  • End-user experience impact beyond a threshold (SLO). If the SLO in place is impacted due to:
    • Unavailable services
    • Unacceptable performance
    • Erratic functionality
  • Data loss.
  • Organization/group-specific requirements with different policies and protocols to follow.
Six minimum items required in a postmortem

The postmortem should include the following six components:

  1. Summary: Provide a succinct incident summary.
  2. Impact (must include any financial impact): Executives will look for impact and financial information.
  3. Root cause(s): Identify the root cause, if possible.
  4. Resolution: What the team actually did to fix the issue.
  5. Monitoring (issue detection): Specify how the incident was identified. Hopefully, this was a monitoring system rather than an end-user complaint.
  6. Action items with due dates and owners: This is important. Do not simply conduct a postmortem and forget the incident. Establish action items, assign owners, and follow through on these. Some organizations may also include a detailed timeline of occurrences in the postmortem, which can be useful to walk through the sequence of events.

Before the postmortem is published, a supervisor or senior team member(s) must review the document to avoid any errors or misrepresentation of facts.

Find postmortem tools and templates

If you haven't done postmortems before, you may be wondering how to get started. You've learned a lot about postmortems thus far, but how do you actually implement one?

That's where tools and templates come into play. There are many tools available. Consider the following:

  1. Existing ITSM tools in your organization. Popular examples include ServiceNow, Remedy, Atlassian ITSM, etc. Existing tools likely provide postmortem tracking capabilities.
  2. Open source tools are also available, the most popular being Morgue, released by Etsy. Another popular choice is PagerDuty.
  3. Develop your own. Remember, SREs are also software engineers! It doesn't have to be fancy, but it must have an easy-to-use interface and a way to store the data reliably.
  4. Templates. These are documents that you can readily use to track your postmortems. There are many templates available, but the most popular ones are:
Wrap up

Here are the key points for the above incident response discussion:

  • Effective on-call system is necessary to ensure service availability and health.
  • Balance workload for on-call engineers.
    • Allocate resources.
    • Use multi-region support.
    • Promote a safe and positive environment.
  • Incident management must facilitate a clear separation of duties.
    • Incident command, operations, planning, and communication.
  • Blameless postmortems help prevent repeated incidents.

Incident management is only one side of the equation. For an SRE organization to be effective, it must also have a change management system in place. After all, changes cause many incidents.

The next article looks at ways to apply effective change management.

Further reading

Get back to business and continue services in a timely manner by implementing a thorough incident response strategy.

DevOps What to read next What you need to know about site reliability engineering This work is licensed under a Creative Commons Attribution-Share Alike 4.0 International License. Register or Login to post a comment.

7 summer book recommendations from open source enthusiasts

opensource.com - Tue, 06/21/2022 - 15:00
7 summer book recommendations from open source enthusiasts Joshua Allen Holm Tue, 06/21/2022 - 03:00 1 reader likes this 1 reader likes this

It is my great pleasure to introduce Opensource.com's 2022 summer reading list. This year's list contains seven wonderful reading recommendations from members of the Opensource.com community. You will find a nice mix of books covering everything from a fun cozy mystery to non-fiction works that explore thought-provoking topics. I hope you find something on this list that interests you.

Enjoy!

Image by:

O'Reilly Press

97 Things Every Java Programmer Should Know: Collective Wisdom from the Experts, edited by Kevlin Henney and Trisha Gee

Recommendation written by Seth Kenlon

Written by 73 different authors working in all aspects of the software industry, the secret to this book's greatness is that it actually applies to much more than just Java programming. Of course, some chapters lean into Java, but there are topics like Be aware of your container surroundings, Deliver better software, faster, and Don't hIDE your tools that apply to development regardless of language.

Better still, some chapters apply to life in general. Break problems and tasks into small chunks is good advice on how to tackle any problem, Build diverse teams is important for every group of collaborators, and From puzzles to products is a fascinating look at how the mind of a puzzle-solver can apply to many different job roles.

Each chapter is just a few pages, and with 97 to choose from, it's easy to skip over the ones that don't apply to you. Whether you write Java code all day, just dabble, or if you haven't yet started, this is a great book for geeks interested in code and the process of software development.

Image by:

Princeton University Press

A City is Not a Computer: Other Urban Intelligences, by Shannon Mattern

Recommendation written by Scott Nesbitt

These days, it's become fashionable (if not inevitable) to make everything smart: Our phones, our household appliances, our watches, our cars, and, especially, our cities.

With the latter, that means putting sensors everywhere, collecting data as we go about our business, and pushing information (whether useful or not) to us based on that data.

This begs the question, does embedding all that technology in a city make it smart? In A City Is Not a Computer, Shannon Mattern argues that it doesn't.

A goal of making cities smart is to provide better engagement with and services to citizens. Mattern points out that smart cities often "aim to merge the ideologies of technocratic managerialism and public service, to reprogram citizens as 'consumers' and 'users'." That, instead of encouraging citizens to be active participants in their cities' wider life and governance.

Then there's the data that smart systems collect. We don't know what and how much is being gathered. We don't know how it's being used and by whom. There's so much data being collected that it overwhelms the municipal workers who deal with it. They can't process it all, so they focus on low-hanging fruit while ignoring deeper and more pressing problems. That definitely wasn't what cities were promised when they were sold smart systems as a balm for their urban woes.

A City Is Not a Computer is a short, dense, well-researched polemic against embracing smart cities because technologists believe we should. The book makes us think about the purpose of a smart city, who really benefits from making a city smart, and makes us question whether we need to or even should do that.

Image by:

Tilted Windmill Press

git sync murder, by Michael Warren Lucas

Recommendation written by Joshua Allen Holm

Dale Whitehead would rather stay at home and connect to the world through his computer's terminal, especially after what happened at the last conference he attended. During that conference, Dale found himself in the role of an amateur detective solving a murder. You can read about that case in the first book in this series, git commit murder.

Now, back home and attending another conference, Dale again finds himself in the role of detective. git sync murder finds Dale attending a local tech conference/sci-fi convention where a dead body is found. Was it murder or just an accident? Dale, now the "expert" on these matters, finds himself dragged into the situation and takes it upon himself to figure out what happened. To say much more than that would spoil things, so I will just say git sync murder is engaging and enjoyable to read. Reading git commit murder first is not necessary to enjoy git sync murder, but I highly recommend both books in the series.

Michael Warren Lucas's git murder series is perfect for techies who also love cozy mysteries. Lucas has literally written the book on many complex technical topics, and it carries over to his fiction writing. The characters in git sync murder talk tech at conference booths and conference social events. If you have not been to a conference recently because of COVID and miss the experience, Lucas will transport you to a tech conference with the added twist of a murder mystery to solve. Dale Whitehead is an interesting, if somewhat unorthodox, cozy mystery protagonist, and I think most Opensource.com readers would enjoy attending a tech conference with him as he finds himself thrust into the role of amateur sleuth.

Image by:

Inner Wings Foundation

Kick Like a Girl, by Melissa Di Donato Roos

Recommendation written by Joshua Allen Holm

Nobody likes to be excluded, but that is what happens to Francesca when she wants to play football at the local park. The boys won't play with her because she's a girl, so she goes home upset. Her mother consoles her by relating stories about various famous women who have made an impact in some significant way. The historical figures detailed in Kick Like a Girl include women from throughout history and from many different fields. Readers will learn about Frida Kahlo, Madeleine Albright, Ada Lovelace, Rosa Parks, Amelia Earhart, Marie Curie, Valentina Tereshkova, Florence Nightingale, and Malala Yousafzai. After hearing the stories of these inspiring figures, Francesca goes back to the park and challenges the boys to a football match.

Kick Like a Girl features engaging writing by Melissa Di Donato Roos (SUSE's CEO) and excellent illustrations by Ange Allen. This book is perfect for young readers, who will enjoy the rhyming text and colorful illustrations. Di Donato Roos has also written two other books for children, How Do Mermaids Poo? and The Magic Box, both of which are also worth checking out.

Image by:

Doubleday

Mine!: How the Hidden Rules of Ownership Control Our Lives, by Michael Heller and James Salzman

Recommendation written by Bryan Behrenshausen

More great content Free online course: RHEL technical overview Learn advanced Linux commands Download cheat sheets Find an open source alternative Explore open source resources

"A lot of what you know about ownership is wrong," authors Michael Heller and James Salzman write in Mine! It's the kind of confrontational invitation people drawn to open source can't help but accept. And this book is certainly one for open source aficionados, whose views on ownership—of code, of ideas, of intellectual property of all kinds—tend to differ from mainstream opinions and received wisdom. In this book, Heller and Salzman lay out the "hidden rules of ownership" that govern who controls access to what. These rules are subtle, powerful, deeply historical conventions that have become so commonplace they just seem incontrovertible. We know this because they've become platitudes: "First come, first served" or "You reap what you sow." Yet we see them play out everywhere: On airplanes in fights over precious legroom, in the streets as neighbors scuffle over freshly shoveled parking spaces, and in courts as juries decide who controls your inheritance and your DNA. Could alternate theories of ownership create space for rethinking some essential rights in the digital age? The authors certainly think so. And if they're correct, we might respond: Can open source software serve as a model for how ownership works—or doesn't—in the future?

Image by:

Lulu.com

Not All Fairy Tales Have Happy Endings: The Rise and Fall of Sierra On-Line, by Ken Williams

Recommendation written by Joshua Allen Holm

During the 1980s and 1990s, Sierra On-Line was a juggernaut in the computer software industry. From humble beginnings, this company, founded by Ken and Roberta Williams, published many iconic computer games. King's Quest, Space Quest, Quest for Glory, Leisure Suit Larry, and Gabriel Knight are just a few of the company's biggest franchises.

Not All Fairy Tales Have Happy Endings covers everything from the creation of Sierra's first game, Mystery House, to the company's unfortunate and disastrous acquisition by CUC International and the aftermath. The Sierra brand would live on for a while after the acquisition, but the Sierra founded by the Williams was no more. Ken Williams recounts the entire history of Sierra in a way that only he could. His chronological narrative is interspersed with chapters providing advice about management and computer programming. Ken Williams had been out of the industry for many years by the time he wrote this book, but his advice is still extremely relevant.

Sierra On-Line is no more, but the company made a lasting impact on the computer gaming industry. Not All Fairy Tales Have Happy Endings is a worthwhile read for anyone interested in the history of computer software. Sierra On-Line was at the forefront of game development during its heyday, and there are many valuable lessons to learn from the man who led the company during those exciting times.

Image by:

Back Bay Books

The Soul of a New Machine, by Tracy Kidder

Recommendation written by Guarav Kamathe

I am an avid reader of the history of computing. It's fascinating to know how these intelligent machines that we have become so dependent on (and often take for granted) came into being. I first heard of The Soul of a New Machine via Bryan Cantrill's blog post. This is a non-fiction book written by Tracy Kidder and published in 1981 for which he won a Pulitzer prize. Imagine it's the 1970s, and you are part of the engineering team tasked with designing the next generation computer. The backdrop of the story begins at Data General Corporation, a then mini-computer vendor who was racing against time to compete with the 32-bit VAX computers from Digital Equipment Corporation (DEC). The book outlines how two competing teams within Data General, both wanting to take a shot at designing the new machine, results in a feud. What follows is a fascinating look at the events that unfold. The book provides insights into the minds of the engineers involved, the management, their work environment, the technical challenges they faced along the way and how they overcame them, how stress affected their personal lives, and much more. Anybody who wants to know what goes into making a computer should read this book.

There is the 2022 suggested reading list. It provides a variety of great options that I believe will provide Opensource.com readers with many hours of thought-provoking entertainment. Be sure to check out our previous reading lists for even more book recommendations.

Members of the Opensource.com community recommend this mix of books covering everything from a fun cozy mystery to non-fiction works that explore thought-provoking topics.

Image by:

Photo by Carolyn V on Unsplash

Opensource.com community What to read next This work is licensed under a Creative Commons Attribution-Share Alike 4.0 International License. Register or Login to post a comment.

How to Install MySQL 8 in Fedora 36 Linux

Tecmint - Tue, 06/21/2022 - 14:18
The post How to Install MySQL 8 in Fedora 36 Linux first appeared on Tecmint: Linux Howtos, Tutorials & Guides .

MySQL is one of the oldest and most reliable open-source relational database management systems which is trusted and used by millions of users on daily basis. Since Fedora has recently announced their new version

The post How to Install MySQL 8 in Fedora 36 Linux first appeared on Tecmint: Linux Howtos, Tutorials & Guides.

Meta's Transparent Memory Offloading Saves Them 20~32% Of Memory Per Linux Server

Phoronix - Tue, 06/21/2022 - 06:15
Meta's engineering team today published an interesting blog post about Transparent Memory Offloading (TMO) as a new Linux kernel feature they developed that is already used in production on Facebook/Meta servers. Within Meta's data centers this TMO functionality is saving 20~32% memory per server across their millions of servers...

RHEL-Based AlmaLinux Announces "ALBS" Access For Its Public Build System

Phoronix - Tue, 06/21/2022 - 02:20
AlmaLinux today made public ALBS, the AlmaLinux Build System used to construct the recent releases of AlmaLinux 8.6 and AlmaLinux 9.0 across all supported architectures...

AMD PRO 5000 WX Series Coming To More System Integrators, DIY Market Later This Year

Phoronix - Mon, 06/20/2022 - 23:52
After announcing the Threadripper PRO 5000 WX series back in March and with Lenovo being their launch partner for these Zen 3 Ryzen Threadripper CPUs, AMD today shared an update on availability...

Pages