Thursday, September 25, 2014

The Cinnamon Twist Alert - Handling Complex Boundary Conditions

The Old Ways Are Not Always Best

A new bug report came in. After reading through the report, the problem was clear. Our system did not completely support single-day batches of over 999 transactions.

The problem boiled down to a single counter used to track transactions throughout the day. The field for the counter has a fixed width of three digits. The sequence counter field begins with 001 and increments by one for each transaction all the way up to 999. This value is used to help identify and correct communication errors. When a terminal attempts to send a request but encounters an error, we resubmit the transaction using the same sequence counter. If the other side receives two similar (duplicate) transaction records with matching sequence counters (more on this later), the earlier of the two submissions is reversed and is not funded.

The problem with the sequence counter becomes obvious when you think about what happens when the terminal needs to go beyond 999 transactions in a day. The sequence counter will spill over the three digits available, reusing a value from earlier in the same batch or the forbidden value 000. Our software was not so silly as to ignore this problem. The solution, as it was implemented, was to increment the batch number by one and reset the sequence counter for the newly-created batch to 001.

Unfortunately for us, one of our clients began frequently exceeding the magic 999 transactions per day and was experiencing problems with this approach. Processing multiple batches on the same day led to reconciliation and accounting issues, while causing delays in the deposits of funds to the client's bank account. Obviously, these were problems we wanted to deal with swiftly, once and for all.

Edit for clarity: Some have asked why I didn't simply increase the width of the sequence counter field to more than three digits. This width was defined in a third party specification and was not under my control. My software had to deal with this somehow.

A New Approach

After consulting the documentation and our technical contacts and running through a couple of false starts, we formulated a new approach to the three-digit sequence counter problem. Instead of creating a new batch to deal with the overflow, we would simply roll the sequence counter from 999 back to 001 and continue processing everything normally.

Our biggest concern with this new approach was related to the special duplicate transaction checking mentioned previously. The duplicate checking logic considers the following criteria when determining whether a subsequent transaction request matches an earlier one:

  • Sequence Counter
  • Card Account Number
  • Total Dollar Amount

If all three of these values match, the earlier of the two transaction requests is silently reversed, leaving the transaction totals out of balance and the client short of money. For some silly reason, people get very upset when their money goes missing unexpectedly.

Mr. Cinnamon Twist

To describe a plausible scenario where I thought this might actually happen, I decided to write a brief
story about a man I dubbed Mr. Cinnamon Twist.

Cinnamon Bun
Cinnamon Twist
Mr. Cinnamon Twist is a businessman with a sweet tooth. Knowing he has a long day of meetings ahead of him, he stops in at the busy corner coffee shop looking for a morning treat. He spies a delicious, gooey cinnamon twist (with double frosting). His growling, empty stomach simply cannot resist. Leaving the shop with cinnamon twist in hand, he takes two bites and wraps up the rest as he hurries to catch a train heading downtown. Through the early morning, Mr. Twist savors his treat as he goes about his work. He prepares his materials for the big afternoon presentation for a prospective new client. After a long and stressful day of work, Mr. Cinnamon Twist boards the train heading towards home. Worn out and feeling exhausted, his mind wanders back to his early morning treat. He decides that he will treat himself to another (just this once) before dragging his tired body home,

In this scenario, it's plausible that our sweet-toothed protagonist used the same credit card to pay the same amount for both a very low and very high sequence number. If all the stars aligned and these transactions happened to reuse the exact same sequence counter, this would mean that Mr. Cinnamon Twist magically received his first treat of the day without being charged for it. Great news for Mr. Twist, not so good for the coffee shop who would be out the cost of a scrumptious cinnamon-flavored treat.

Back Of The Envelope

The first problem with the above scenario is simply noticing it. Detecting this type of situation on the fly is hard enough. With the requirement to be fast, high volume, and redundant between multiple data centers, this becomes complicated very quickly. The second problem is how to correct the situation once an error has been identified. I could think of a few tricks that I might consider, but I saw no obvious trivial approach for this problem.

While trying to avoid tackling this complex condition, I paused to look at some data. How likely is the above scenario? I looked at some rough numbers to try to get an idea. I looked at the number of clients exceeding the magic 999 transaction limit. I looked at the number of transactions using the same card at the same merchant on the same day. Using the classic back-of-the-envelope approach, I calculated that we would likely only see this situation a handful of times in a year.

It seemed to me that we were looking at a lot of complicated and error-prone work to save the cost of a tray of delicious cinnamon treats each year.

The Compromise

As it turns out, there is a manual process available to correct these types of transactions. By picking up the phone and talking to a real live human being, we are able to manually single out a transaction request and force it through.

Knowing that this manual correction process was available and fearing the work required to fully automate every possibility, I proposed a compromise. We would create a scheduled script to run daily and search the database for requests matching the duplicate transaction scenario above. If any duplicates were found, we would fire off an email alert message (subject line: "Cinnamon Twist") identifying the key transaction details and describing the process for manual corrections. The worst case, I thought, was that the alert would fire too often and I would have to implement the complex solution later anyway. The best case, on the other hand, was that the alert would basically never fire, saving a great deal of time and effort.

Sounding The Alarm

The first week after installing my script, there were still no email alerts. I was beginning to feel optimistic that we may never see the alert fire in practice. They say that trouble shows up when you least expect it. The day after sharing my optimism with my coworkers, we received our first Cinnamon Twist alert.

No problem, I thought. We followed the manual procedure only to discover that both transaction requests were good and no corrective action was required. This contradicted the documentation and our general understanding of how the system should work, but who am I to look a gift horse in the mouth?

Another week or two went by before the next alert fired. It seems that my back-of-the envelope calculations were a bit off. We were receiving more alerts than I had expected. This alert, too, turned out to be a false positive when we followed up manually.

We asked our technical contacts for clarification. After our messages got passed around a few times, our contacts eventually got back to us saying that this behavior was by design. It seems that we had worried ourselves over a problem that didn't actually exist.

We disabled the Cinnamon Twist Alert script a short time later. My "lazy" approach had saved me from implementing a lot of complicated logic for no reason.

An Ounce Of Cure

What is my point? What can we learn from these events? Perhaps it's time to spin the wheel of morality to tell us the lesson we should learn.

Maybe I was lucky. My calculations turned out to be somewhat (but not excessively) optimistic. There was a risk that we would need to handle these manual corrections frequently, leaving me scrambling to implement a complex change to relieve pressure from the rest of my team as quickly as possible. My approach was a calculated gamble, but it paid dividends even larger than I had anticipated.

To me, this is a turnabout on the old adage saying, "an ounce of prevention is worth a pound of cure." In this case, an ounce of cure (the alert and manual correction) was quicker, safer, and simpler to implement than a pound of prevention (a fully automated solution). In rare cases, the easiest way to deal with complex boundary conditions is not to. Instead, find a way to look for the errors and tidy up after they happen. Don't forget to calculate the risk and the cost, but you may just discover that you were about to make much ado about nothing.


Joshua Ganes

No comments:

Post a Comment