I am doing some exiting stuff on performance testing and
engineering for a product which processes claims and business logic accordingly
to the rules defined in it. I cannot provide more details of the product considering sensitive nature.
This product is very complex system with a multi-tier
architecture containing several COTS products integrated with it. One of our
prospective customers who requested to get claims layer (one of the sub
systems) to processes 200,000 claims in 1 hour with a Claim History of 780
Million Claims which is 10 times more than the volumes we envisaged for the
product.
Previous benchmark for this subsystem was 9000 claims in 1
hour with empty Database history.
First Things First
We needed to first determine was this volume feasible? These
types of volumes have traditionally been executed on mainframes, could they be
transitioned to our hardware and software? We started to investigate our
current test results and see if there were any indicators that we had hit our
limits with our application. What we discovered was promising, we didn’t see
servers being overexerted or the application bottlenecking. We also found that
our current results indicated, that with some scaling, we would be able to hit
volumes that would allow us to process 200,000 transactions.
With this information in hand, we went back to our
leadership team and started discussing what the scope of the effort would be.
As we discussed, we found discovered was a time constraint involved of only
about 6 weeks, which led us to narrow our focus. We decided that to best meet
the goals we would work in a “quiet environment”, focusing only on claims processing.
We also needed to add some additional parameters to our
setup. We needed to load 24,000,000 users, 1,000,000 supporting data, and 780,000,000 claims which all had
to be loaded into a new environment before we could begin testing. This added
its own set of challenges and forced us to “think outside the box” and create
unique ways to load data in a quick manner. We began planning for how we could
load data incrementally and still be able to test. We used a combination of SQL
scripts to directly insert data to the DB and existing processes within application
to load data. We spread the execution across the multiple server instances we
had and varied the execution to not conflict with test execution.
With the requirements in place as well as a plan to load
data; we would try to answer the main question, “Can we scale out our claims engine
to process 200,000 claims in an hour?” Our plan to reach this goal was to add
additional processing servers; based on our current metrics, we would need
about 20. This seems like a straightforward and simple approach, however we
didn’t want to just add 20 servers; this wouldn’t help us prove that servers is
able to scale over time, so we decided to take a more incremental approach. Our
approach was to start with four additional servers and then add more as they
became saturated.
We had our plan and our goals, now all we needed to do was
test!
T’s & C’s of Testing
In order to create valid tests, we needed 200,000 users that
we could use to process claims and we needed them to be unique to prevent
our duplicate rules from flagging the claim and preventing it from being
processed. We also needed to be able to test what various levels of claims history
a user might have which could impact the processing speed. For example, does
having no other claims in the system make the processed faster than having
eight claims in history?
We created new claims in order to ensure that our rulesets
were executed against the claims and that the testing was valid. We wanted to
avoid negative testing scenarios, due to the time constraint, and focus on the
positive scenarios. We also bypassed systems and added the claims
directly to our queues to isolate our testing from external factors.
We made this a repeatable process so that we could always test with clean data.
The data was composed of three types of claims, professional, institutional,
and pharmacy. We did this to properly replicate not only the volumes of claims,
but the mix as well, that we would expect to see in a Production setting.
Once we had our test data prepared we moved on to
formalizing our hardware setup and monitoring. We setup Dynatrace in order to
monitor performance in the environments, along with utilizing Windows Profiler (winDbg),
and SQL Profiler to gather metrics. We used these metrics to determine where we
needed to address potential bottle necks, or where our setup needed to be
reconfigured. The reconfiguring could be adding new memory or CPUS, adding more
hard drive space, or even adding an entire server. We setup four processing
servers and one DB server and prepared to run our first cycle of testing.
Claim processing layer Architecture
Our Services are hosted in NServiceBus in Web Services Layer
(Interface between the Client(other interfaces) and core NServiceBus) and
Business Services Layer (Service Business Logic).
Coming up next time:
Test Execution and learning's
@2015, copyright Vamsidhar Tokala