siliconindia | | November 20198CHAOS ENGINEERING TO IMPROVE SYSTEM RESILIENCY n the present tech-savvy world, random glitch-es in systems have become harder to predict and nearly impossible to afford by companies. These random failures impact a company's bottom line, making downtime a key performance indicator for engineers. These glitches can be a network-ing glitch in one of the data centers, a misconfigured serv-er configuration, shutting down of a node or any other kind of failure that propagates across systems. These outages usually bring catastrophic results and severe downtime in the regular functioning of a system.One single hour of an outage can cause millions of dol-lars to a company. As per Gartner, the average cost of IT downtime is USD 5,600 per minute. Since there is a differ-ence in the way each business operates, the cost of downtime can vary between USD 140,000 per hour to USD 540,000 per hour. As organizations cannot wait for an outage to happen, one should look at proactively identifying system weaknesses and applying chaos engineering practices to mitigate the risks.Chaos Engineering studies how large scale systems re-spond to all the random events. It is a disciplined approach to identify failures before they become outages. By testing the ways, how a system responds under stress, engineers can quickly identify and fix faults. The ultimate purpose behind chaos engineering is to limit the chaos behind out-ages caused by random events by carefully investigating ways to make a system more robust. While practicing cha-os engineering, planned experiments are performed on the systems to check the response of a system when such a situation occurs.Originally, Chaos Engineering was Netflix's rationale as they needed to be resilient against random host failures while migrating to AWS (Amazon Web Services). This re-sulted in the release of Chaos Monkey by Netflix in the year 2011. Additional failure injections were added on top of Chaos Monkey that allowed testing of more states of failures and build resilience to those. Netflix also de-cided to introduce a new role called Chaos engineering in the year 2014. And, then Gremlin announced Failure Injection Testing (FIT) tool built on the concepts of the Simian Army to build resilience in the systems against random events. With many organizations moving into cloud and microservice architecture, the need for chaos engineering has increased in recent years. Many larger technology companies like Amazon, Netflix, LinkedIn, Facebook, Microsoft, Google, and a few others are happily practicing Chaos Engineering to improve the reliability of their systems. IN MY OPINIONIBy George Ukkuru, Head ­ Quality Engineering, UST GlobalAn Agile Scrum Master, George boasts of having close to two decades of experience, during which he was associated with well-known tech companies such as Caravel, Sunquest Information Systems and SAP Labs India, prior to joining UST Global in 2008.George Ukkuru, Head
< Page 7 | Page 9 >