Leveraging Queue Traffic Shadowing for Refactoring Purposes

August 13, 2024

Blogs

Refactoring a software system can be a daunting task, especially when dealing with microservices and complex workflows. However, queue traffic shadowing offers a powerful solution to refactor your system while minimizing downtime and ensuring that new changes do not disrupt the existing processes. In this comprehensive guide, we'll dive deep into the technical requirements, implementation steps, and benefits of traffic shadowing, using a real-world case study to illustrate the process.

Understanding Traffic Shadowing

Traffic shadowing is a technique that enables you to test new features or entire applications using production traffic before releasing them to production. By duplicating the traffic from the production path to the new application or feature, you can monitor how the new changes perform under real-world conditions without affecting the actual production environment.

Why Traffic Shadowing?

Minimize Downtime: Ensures that your system remains operational while you test and implement new features.
Error Detection: Identifies potential bugs and issues in the new feature before it goes live.
Performance Monitoring: Observes how the new feature behaves under actual traffic, helping in performance tuning.
Customer Satisfaction: Reduces the risk of releasing buggy or unstable features to end-users.

The Case Study: A Real-World Example

In our scenario, a customer needed significant changes to one of the major flows in the system. They wanted parts of the flow to be toggleable and parameterizable, aiming for:

No bugs at all.
Minimal deviation in results between old and new versions.
Minimum system downtime.
Uninterrupted development pace for other teams.

Technical Requirements and Challenges

The refactoring involved extensive changes across various parts of the code, including:

Data retrieval for the process.
Initial selection and preparation of the data.
Actual processing of the data.

This seemingly simple task turned into a massive refactoring operation, requiring careful planning to avoid downtime and maintain other teams' productivity.

Implementing Traffic Shadowing

Implementing traffic shadowing involves several key steps. Let's break them down:

1. Getting Traffic to Test Clusters

The first step is to route part of the production traffic to the test clusters without impacting the critical path. This involves duplicating the traffic and sending it to the new application or feature.

2. Annotating Traffic

Once the traffic reaches the test cluster, it's essential to mark it as shadowed traffic to differentiate it from actual production traffic. This helps in monitoring and analysis.

3. Comparing Live and Test Traffic

Compare the results of live service traffic with that of the test cluster post-shadowing. This helps in identifying discrepancies and ensuring that the new feature performs as expected.

4. Stubbing Collaborating Services

For certain test profiles, stub out collaborating services to isolate the functionality being tested. This helps in focused testing and analysis.

5. Database Virtualization and Materialization

Virtualizing the test cluster’s database and materializing it helps in ensuring that the test environment mimics the production environment closely. This helps in accurate testing and analysis.

Our System Architecture

The system we developed consisted of several microservices, with one microservice triggered by queue messages being the focal point of our refactoring. These messages passed through several stages:

Aggregation in a single queue.
Distribution across priority queues based on priority.
Final processing into rows in an SQL database.

We created a separate queue to redirect traffic for the refactored flow, added consumers to this queue, and controlled the traffic redirection with a configurable switch. This setup allowed us to manage the load and seamlessly switch between the old and new flows without affecting the overall system.

Deployment and Testing

We deployed our system across three environments:

Test Environment: Used for regression and integration tests with anonymized data and real resources.
Development Environment: A sandbox for the customer to verify feature functionality with anonymized production data.
Production Environment: The live environment where actual user traffic is handled.

These environments were isolated from each other, ensuring that real data was only present in the production environment.

Execution and Iterative Improvements

We implemented the necessary changes and tested them extensively. During the testing phase, we identified and fixed several bugs. Each iteration involved:

Deploying the refactored flow to production.
Redirecting traffic to the alternative flow.
Monitoring results and fixing discrepancies.
Repeating the process until satisfactory results were achieved.

This approach ensured no downtime and allowed us to continuously refine the refactored flow based on real-world feedback.

Final Deployment and Clean-Up

Once the customer approved the changes, we removed the old flow and cleaned up the temporary structures used for traffic shadowing. We also updated our regression and integration tests to align with the new flow.

Conclusion

Refactoring using traffic shadowing proved to be highly effective. It allowed us to meet our customer's stringent requirements related to downtime and feature stability. Additionally, we could iteratively improve the new flow based on real-world feedback without disrupting other development activities.

From a developer's perspective, traffic shadowing offers a robust method to test and deploy new features in a live environment safely. While it may pose implementation challenges, the benefits it brings in terms of error detection, performance monitoring, and overall system stability make it a worthwhile endeavor.