siliconindia | | November 20199Chaos Engineering works on the principle of running thoughtful experiments within the system, which brings out insights on how the system responds in case of fail-ures. There are three steps involved: -Step-1: The first step is to identify a fault that can be injected and create a hypothesis on the expected outcome by mapping IT or Business metrics. Step-2: It involves the execution of a test to measure the parameters around the availability and resilience of a system. The tests are focused around creating a failure by increasing CPU Utilization or inducing a DNS outage. Step-3: This is the last step and determines the success of the tests. The tests are halted if there is an impact on the metrics, and the failures are analyzed. The chaos ex-periment is considered successful only if a failure occurs. The tests are repeated by increasing the blast radius if the system is found to be resilient.After the completion of the experiment, the insights obtained provide information on the real-world behavior of the system during random failures. This helps engineer-ing teams to fix issues or define roll back plans. Introduc-ing Chaos Engineering in the organization brings in both business as well as technical benefits. For the business, Chaos Engineering helps in preventing significant losses in overall revenue, improves the incident management re-sponse, improves on-call training for engineer teams and the resiliency of the systems. From the technical point of view, data obtained from Chaos experiments results in increased understanding of system failure modes, im-proved system design, reduction in repeated incidents, and on-call burden. There are many tools which are available in the market for letting companies practice Chaos Engineering. Cha-os Monkey, Gremlin Inc., Simian Army are a few tools to name which can be easily implemented in the organi-zation. Organizations can also build their own Chaos En-gineering tools using code from open source tools. The process may be time-consuming and expensive but gives complete control over the tool, options to customize it and is more secure.Predicting system failures has become difficult due to complex application architectures. As the cost of downtime is high, the organization should take a pro-active approach to prevent crashes by applying chaos engineering practices. As the cost of downtime is high, the organization should take a proactive approach to prevent crashes by applying chaos engineering practices
< Page 8 | Page 10 >