On the Life Cycle of a Leap Year Bug

Back in 2016, I wrote about leap year bugs. Since another leap year is almost here, I figure it's time to revisit the subject.

I've gathered quite a bit of knowledge in this area since then. Hunting down leap year bugs has actually been part of my full-time job for the last few months, believe it or not. When you work at a company like Microsoft, the sheer quantity of code written over the course of four years makes this a daunting task, but also a necessity.

Before we dive in to leap year, let's consider the life cycle of most bugs in software. They probably go something like this:

  1. A software developer writes code for some product or service.
  2. That code might go through review, testing, analysis, build, etc. Often, critical bugs are found during these phases and fixed before release. Sometimes not.
  3. Eventually the code is released into production, or shipped to a customer.
  4. The product or service is used. Critical bugs not found earlier are often exposed.
  5. The user complains, files a support ticket, opens an issue, etc.
  6. Sometimes code is rolled back to prior state, sometimes not.
  7. Developer fixes the bug, and an updated release is deployed.

Sure, that's glossing over many areas, but it's roughly what happens. The interesting part is that usually the entire life cycle is on the order of days, weeks, or maybe months at most. Also normal bugs tend to happen one at a time, or a few with each release.

Leap year bugs are not like this. They have a very different life cycle.

Life cycle of a leap year bug:

  1. A software developer writes code for some product or service.
  2. Again the code is tested, analyzed, etc. But unless it's done on Feb 29th, there's a non-zero chance that a leap year bug goes unnoticed.
  3. Eventually the code is released into production, or shipped to a customer.
  4. The product or service is used. And everything works fine.
  5. A long period of time passes. Maybe years.
  6. Multiple other products and services are written and deployed over this time. Maybe taking a dependency on the first one, or importing its code. These services all work fine too.
  7. Sometimes the original developer decides to move on to a new project, or a new company. Certainly nobody thinks there's anything wrong, because the products and services have been working quite well for years.
  8. Feb 29th comes around. Stuff breaks all at once. Multiple service failures. Pagers go off. People panic. Nobody can find anyone who knows why. Eventually, someone figures it out and patches the code, gets things back up and running, but the damage has been done.
    OR
    Maybe nothing goes down at all. But somehow the numbers for this month don't look quite right. They're all off by a day and nobody knows why.
    OR
    Maybe nothing happens at all. The part of the product or service with the leap year bug wasn't exercised this time. So it sits for another four years, building even more confidence that everything is fine...

It's interesting to me that despite so many documented cases of leap year bugs, that this is not better understood by our industry. In an effort to fix that, I've started tracking types of leap year bugs in a Stack Overflow posting. If you don't know what leap year bugs look like, or if you have any to contribute, please take a look there. I've seeded it with two common cases, and I'll add more over time.

Let me know in comments here what you think! I can always blog more too. :)