Resiliency Testing Automation Approach for Distributed/Decentralized /Blockchain application with…

6 min readJun 26, 2022

Resiliency Testing Automation Approach for Distributed/Decentralized /Blockchain application with containerized application

Distributed/DLT/Blockchain Application — Failure Types

Background: Enterprise level distributed/decentralized applications are being designed and developed to be fault tolerant. Despite of creating fault tolerant application, No one can 100% be sure that application will be able to come back gracefully in the event of failures. As nature of failure can be different each time so developers has to design for all kind of anticipated failures. From broader perspective, those failures can be

Failure Type1: Network Level Failures

Failure Type2: Infrastructure (System or Hardware Level) Failures

Failure Type3: Application Level Failure

Failure Type4: Component Level Failures

Define 3-Step process for resiliency testing: Developer tries to develop the robust application to come back gracefully for all probable failures. Due to complex nature of application, there is still unseen failures keep coming up in production. It has become utmost important for testers to keep verifying the developed logic to define the resiliency of system for all such real time failures. What ways Testers are having for mimicking real time failures to prove how resilient application is against all such failures. Resiliency testing is the methodology that helps to mimic every kind of failures defined above. Let’s define generic process for each failure before getting into defining strategy for resiliency testing for distributed as well as decentralized application. Based on our experience with multiple customer engagement for resiliency testing, Following are the 3-Steps process to be done before defining resiliency strategy

1. Step-1: Identify all components/services/any sort of third party library or tool or utility,

2. Step-2: Intended functionality for each components/services/library/tool/ utility,

3. Step-3: Upstream & downstream interface along with expected result to function integration as per acceptance criteria

3 Steps Process to analyze failure impact on functionalities

As per defined process, tester has to collect all functional/non-functional requirements along with acceptance criteria for all above mentioned 4 failure types. Once all information gets collected then it should be mapped with 3 step process to lay down what to be verified for each component/service. After having mapping for each failure is 3-Steps process then we are good to define testing strategy along with automating the same to achieve correctness with reducing execution time. We have defined following 5 ways to define distributed/decentralized network in testing environment -

Distributed DLT Blockchain Application — Infrastructure Types

Infrastructure Type1: Setup physical machines with LAN

Infrastructure Type2: Setup virtual machines

Infrastructure Type3: Container Orchestration (Kubernetes)/Docker Containers on the same machine can be considered as an isolated machine

Infrastructure Type4: Cloud Platform

Infrastructure Type5: Hybrid Infrastructure (Can combine any of 2 or more from above defined infrastructure types)

Each approach has advantage/disadvantage of each approach to setup application in test environment. Our preference is to go with 3rd & 5th approach as those seem to be most used infrastructure in real time by customers.

As per our experienced in defining resiliency for any distributed application, we divide resiliency testing in following 3 modes (Each mode will be executed with controlled & uncontrolled wait time)

1. Mode1: Controlled execution for forcefully restarting components/services — Execution of components restart can be sequenced with defined expected outcome. Generally, we flow successful & failed transaction followed by ensuring reflection of transaction on the system from overall system behaviour. If possible then we can assert the individual components/services response for the flowed transaction based on intended functionality of restarted component/service. This kind of execution can be done with

a. Defined fixed wait time duration for restarting

b. Randomly selecting wait time interval

2. Mode2: Uncontrolled execution (Randomization for choosing component/service) for forcefully restarting components/services — Execution of components restart can be selected randomly with defined outcome. Generally, we flow successful & failed transaction followed by ensuring reflection of transaction on the system from overall system behaviour. If possible then we can assert the individual components/services response for the flowed transaction based on intended functionality of restarted component/service. This kind of execution can be done with

a. Defined fixed wait time duration for restarting

b. Randomly selecting wait time interval

3. Mode3: Uncontrolled execution (Randomization for choosing multiple component/service) for forcefully restarting components/services — Though this kind of test is more realistic to be performed but it has lot of complexity depending upon how the components/services are designed. If number of components/services are too many then the combination of test scenarios will be increased exponentially. So tester should design test with assistance of system/application architecture to make group of components/services to represent entity within system. Then Mode1 & Mode2 can be executed for those groups.

Failure Type1 — Network Level Failures: As distributed/decentralized application uses Peer-To-Peer networking to establish connection among the nodes so we need to get specific component/service detail on how that can be restarted along with how to verify the behaviour during down time & restarting of the same. Let’s assume system has 1 container within each node that is responsible for setting up communication with other available nodes then following verification can be performed –

1. During down time, other nodes are not able to communicate with down node

2. No cascading effect of down node to rest of the nodes within network

3. After restart & initialization of restarted component/service, other nodes are not only able to establish communication with down node but also transaction can be processed by down node

4. Down node can also interact with other nodes within system & route the transaction as expected

5. Data consistency can be verified

6. Latency of the system can also be captured before/after restart to ensure there is performance degradation being introduced to the system.

Failure Type2: Infrastructure (System or Hardware Level) Failures: As entire network is being run through containerized techniques so to mimic infrastructure failure, we can use multiple strategy like

1. Making containerized application down like if Docker is being used then making docker daemon process down

2. Making resource limit like memory, CPUs etc so low at container level that can easily get exhausted with mild load on the system

3. Overload the system with high number of transaction with various size of data generated by transaction

We can verify all functional & non-functional requirements is being met by the system as a whole with each failures described above or not.

Failure Type3: Application Level Failure: As distributed application has lot many containers to have application running so we only target to stop and start specific container having application logic. Important aspect for restarting application containers are timing of stopping and starting container to keep a track on transaction processing. 3 time dependent stages for application related container stop & start –

1. Stage1: Stop container before sending transaction

2. Stage2: Stop container after sending transaction with different time intervals e.g. stopping container immediately, after 1000 milliseconds, 10 seconds etc.

3. Stage3: Stop container when transaction in processing stage

For all above 3 stages, system behaviour can be captured and asserted against functional and non-functional acceptance criteria.

Failure Type4: Component Level Failures: Tester should verify the remaining containers for all 3 modes with 3 different stages with respect to time. We can create as many as scenarios for these containers depending upon following factors –

1. Dependency of remaining containers on other critical containers

2. Intended functionality of the container & frequency of usage those containers in most frequently used transactions

3. Stop & start for various time intervals (include all 3 stages to have more scenarios to target any fragile situation)

4. Most fragile or unstable or mostly reported error within remaining containers

By following above defined strategy for resiliency, Tester should always reconcile the application under test whether any areas are still left to be covered or not. If there is any component/service/third party module or tool or utility untouched then we can design scenarios by combining following factors –

1. Testing modes,

2. Time interval stages

3. Execution mode e.g. sequential and randomization of restarts

4. Grouping of containers for stopping and restarting

Based on our defined approach followed by implementation to multiple customers, We have prevented almost 60–70% of real time issues related to resiliency. We are still thriving to increase the prevention percentage for resiliency by having more advanced strategy though we have not achieved that yet. However, keep revising our approach based on our new experiences with new type of complicated distributed or decentralized application & new failures, we can definitely increase the prevention of real time issues.

In our next blog, we will try to generalize our automation approach for resiliency for our strategy. Please stay tune for our next blog.

Written by Abhishek Jain