Load test for real time data processing
Overview
Load test is important part in system development for confirming if it can handle the expected requests. In this post, I want to share my learning from previous projects and useful tools.
Important block for load testing
According to this document, there are 3 key blocks for load test.
- Prepare production-like test environment
- Simulate user activity on scalable agent
- Monitor and/or log to measure the impact
In the real world project, I needed to do it on cost effective manner. It includes..
- Infrastructure cost: IT department ask team to use lowest cost for test.
- Operation cost: Environment preparation needs more time if I provision it manually. It can make team slow down to test, set next actions from result.
In short, our team needed to provision, test, and clean it automatically, so that we can set next actions from test results rapidly with low cost.
Prepare production-like test environment
For iterating load testing and performance tuning, I needed to prepare production-like clean test environment for avoiding side-effect by old data. In this situation, I recommend to utilize Infrastructure as a Code (IaC) with Terraform.
For deeper understanding, let’s assume you need to implement smart factory solution with this architecture and need to test colored area. If you want to provision it, please read this article and run sample code in GitHub.
In the project, this automated provisioning helped me a lot. For example..
- Clean up Event Hub: Event Hub doesn’t have method to clear the old messages, so needed to delete and provision it.
- Try several partition count: Event Hub doesn’t allow us to modify partition count even thought I needed to try multiple kinds of partition.
- Update multiple environmental variable such as batch size.
Simulate user activity on scalable agent
This is one of the big challenge area for real time processing system, because popular load testing framework such as JMeter, Locust…etc assumes web based application (please comment if I miss something), not real time processing application.
Therefore, I’ve implemented this sample code. It provision multiple Azure Container Instances and generate load as scalable agent. If this doesn’t fit to your goal, you may want to utilize following sample codes.
Monitor and/or log to measure the impact
Before starting load test, let’s plan based on this best practice. PO decide scenario in the real project, but in this article let’s have following scenario and success criteria.
[Step1 and 2: Identity key scenario and determine load]
- 1st Event Hub need to handle 12,000 messages/sec. Each message size is 0.2 KB.
- 1 % of messages has exceptional value (ex. -300 degree Celsius) by temporary hardware error. 1st Function filter out it.
- 2nd Function labels message based on payload.
[Step 3: Identify Success criteria]
- System can handle expected messages/sec
- System should not throws errors
- Average of end-to-end throughput should be under 5 sec
[Step 4: Select the right tool]
Shervyna’s blog post covers 3rd success criteria, so let me focus on 1 and 2 in this post. For success criteria 1 and 2, I recommend to use Azure Metrics for infrastructure observation and Application Insights for application observation.
With the above architecture, you may find following bottlenecks
- Event Hub can’t handle expected messages by assigned TU
- Azure Functions code doesn’t finish in expected duration or throws error
Let’s observe it with right tool as team. Your team member, IT department and stakeholders want to see the test result. For avoiding to create PowerPoint and Excel files again and again, I recommend to pin metrics in Azure dashboard and share it to team !
This is sample metrics I made.
- Orange area: Pinned Application map from Application Insights. You can easily find latency and Azure Functions scaling situation. In addition to it, you can find number of errors.
- Green area: Pinned Event Hub incoming/outgoing messages. You can see whether expected number of messages flows in each Event Hubs.
- Blue area: Pinned Event Hub Throttled requests. If it is over 1, your team need to consider scaling up Event Hub resource.
- Yellow area: Pinned Application Insights Failures. You can specify Function level, so this shows 1st Functions errors between 1st Event Hub and 2nd Event Hub.
This kind of dashboard really helped me to share and report information and set next actions as team. I’m happy if this learning helps you.