Performancing Metrics

Vamsi Tokala's blog

Wednesday, October 28, 2015

Interesting work

I am doing some exiting stuff on performance testing and engineering for a product which processes claims and business logic accordingly to the rules defined in it.  I cannot provide more details of the product considering sensitive nature.

This product is very complex system with a multi-tier architecture containing several COTS products integrated with it. One of our prospective customers who requested to get claims layer (one of the sub systems) to processes 200,000 claims in 1 hour with a Claim History of 780 Million Claims which is 10 times more than the volumes we envisaged for the product.
Previous benchmark for this subsystem was 9000 claims in 1 hour with empty Database history.

First Things First

We needed to first determine was this volume feasible? These types of volumes have traditionally been executed on mainframes, could they be transitioned to our hardware and software? We started to investigate our current test results and see if there were any indicators that we had hit our limits with our application. What we discovered was promising, we didn’t see servers being overexerted or the application bottlenecking. We also found that our current results indicated, that with some scaling, we would be able to hit volumes that would allow us to process 200,000 transactions.
With this information in hand, we went back to our leadership team and started discussing what the scope of the effort would be. As we discussed, we found discovered was a time constraint involved of only about 6 weeks, which led us to narrow our focus. We decided that to best meet the goals we would work in a “quiet environment”, focusing only on claims processing.
We also needed to add some additional parameters to our setup. We needed to load 24,000,000 users, 1,000,000 supporting data, and  780,000,000 claims which all had to be loaded into a new environment before we could begin testing. This added its own set of challenges and forced us to “think outside the box” and create unique ways to load data in a quick manner. We began planning for how we could load data incrementally and still be able to test. We used a combination of SQL scripts to directly insert data to the DB and existing processes within application to load data. We spread the execution across the multiple server instances we had and varied the execution to not conflict with test execution.
With the requirements in place as well as a plan to load data; we would try to answer the main question, “Can we scale out our claims engine to process 200,000 claims in an hour?” Our plan to reach this goal was to add additional processing servers; based on our current metrics, we would need about 20. This seems like a straightforward and simple approach, however we didn’t want to just add 20 servers; this wouldn’t help us prove that servers is able to scale over time, so we decided to take a more incremental approach. Our approach was to start with four additional servers and then add more as they became saturated.
We had our plan and our goals, now all we needed to do was test!

T’s & C’s of Testing

In order to create valid tests, we needed 200,000 users that we could use to process claims and we needed them to be unique to prevent our duplicate rules from flagging the claim and preventing it from being processed. We also needed to be able to test what various levels of claims history a user might have which could impact the processing speed. For example, does having no other claims in the system make the processed faster than having eight claims in history?
We created new claims in order to ensure that our rulesets were executed against the claims and that the testing was valid. We wanted to avoid negative testing scenarios, due to the time constraint, and focus on the positive scenarios. We also bypassed systems and added the claims directly to our queues to isolate our testing from external factors. We made this a repeatable process so that we could always test with clean data. The data was composed of three types of claims, professional, institutional, and pharmacy. We did this to properly replicate not only the volumes of claims, but the mix as well, that we would expect to see in a Production setting.
Once we had our test data prepared we moved on to formalizing our hardware setup and monitoring. We setup Dynatrace in order to monitor performance in the environments, along with utilizing Windows Profiler (winDbg), and SQL Profiler to gather metrics. We used these metrics to determine where we needed to address potential bottle necks, or where our setup needed to be reconfigured. The reconfiguring could be adding new memory or CPUS, adding more hard drive space, or even adding an entire server. We setup four processing servers and one DB server and prepared to run our first cycle of testing.

Claim processing layer Architecture
Our Services are hosted in NServiceBus in Web Services Layer (Interface between the Client(other interfaces) and core NServiceBus) and Business Services Layer (Service Business Logic).

Coming up next time: Test Execution and learning's

@2015, copyright Vamsidhar Tokala