Saturday, July 02, 2005

Pragmatic Solutions For An Imperfect World

Some time ago, with only one week to go until the scheduled delivery deadline of our .NET client/server product, our customer reported increasing application instability. My developers had not observed that issue so far, but the application seemed to crash sporadically (but unrepeatably) on our customer's test machines. As we were coding in managed C# only, my first guess was a .NET runtime failure, or a native third party library bug. Luckily, we had installed some post-mortem tracer, which turned out the following exception stacktrace:

Unhandled Exception: System.NullReferenceException: Object reference not set to an instance of an object.

at System.Windows.Forms.UnsafeNativeMethods.PeekMessage(MSG& msg, HandleRef hwnd, Int32 msgMin, Int32 msgMax, Int32 remove)

at System.Windows.Forms.ComponentManager.
(Int32 dwComponentID, Int32 reason, Int32 pvLoopData)

at System.Windows.Forms.ThreadContext.RunMessageLoopInner(Int32 reason, ApplicationContext context)

at System.Windows.Forms.ThreadContext.RunMessageLoop(Int32 reason, ApplicationContext context)

at System.Windows.Forms.Application.Run(Form mainForm)

Great. Win32 API's PeekMessage() failing, and the failure being mapped to a .NET NullReferenceException. I was starting to get nervous.

I told our customer that our code was not involved (it's always good to be able to blame someone else). But as expected this answer was less-than-satisfying. The end-user couldn't care less about who was guilty, and so did our customer (and they are right to do so). They wanted a solution, and they wanted it fast.

Now I have some faith in Microsoft's implementation of PeekMessage(). It seems to work quite well in general (let's say in all Windows applications that run a message queue) - so something must have been messed up before, with PeekMessage() failing as a result of that. Something running in native code. Something like our report engine (no, it's not Crystal Reports).

We had not invoked our reports too frequently during our normal test runs, as the report layouts and SQL-statements are being done by our customer. So after some report stress testing, those crashes also occurred on our machines. Rarely, but they did. And they occurred asynchronously, within seconds after the last report invocation. Here was the next hint. This was a timing problem, most likely provoked during garbage collection.

So how to prove or disprove this theory? I simply threw all reporting instances into one large ArrayList, so that those would never be picked up by the garbage collector (SIDENOTE: NEVER DO THIS IN A REAL-LIFE PROJECT), and voila: no more crashes, even after hours of stress testing. Obviously keeping all reporting instances in memory introduces a veritable memory leak (still better than a crashing application someone might argue, but this is something I never ever want to see in any implementation I am responsible for). But I had a point of attack: the reporting instances (or one of the objects being referenced by those instances) failed when their Finalizers were invoked.

First of all I noticed that the reporting class (a thin managed .NET wrapper around the reporting engine's native implementation) implemented IDisposable - so I checked all calling code for correct usage (means invocation of Dispose(), most comfortably by applying C#'s "using" construct). When implemented properly, this should prevent a second call to Dispose() during finalization, which might be the root of evil. But our code seemed to be OK.

Next I hard-coded GC.SuppressFinalize() for all reporting instances that had been disposed already, in order to prevent the call to its Destructor (Finalizer in .NET terms) as well, but still no cure - obviously it was not the reporting instance itself that crashed during finalization, but another object referenced by the reporting instance. I ran Lutz Roeder's Reflector, and had a look at all the reporting .NET classes, resp. their Dispose()- and Finalizer-Methods: they only wrapped away native code.

If I could only postpone those Finalizer-calls until as late as necessary (e.g. until free'ing their memory was an absolute must - native resources would not be the problem, as they would be cleaned up long before during the synchronous call to Dispose()). The moment the application would run out of memory even after conventional garbage collection (which might never happen), it would collect the reporting instances. I needed SoftReferences. The garbage collector delays the collection of objects referenced by SoftReferences as long as possible. Unfortunately, .NET does not provide the concept of SoftReferences (Java does, though). .NET has WeakReferences, which will be picked up much earlier than SoftReferences. So I simply started my own background thread, which would remove reporting instances from a static Collection after some minutes of reporting inactivity, hence make them available for garbage collection.

Sometimes luck just strikes me, and that's what happened here: this approach extinguished all sudden crashes. The reporting instances got picked up by the garbage collector (I debugged that), but just a little bit later. Late enough for their finalization to run smoothly. So up to now I don't know the exact cause (as mentioned before, it must have been a timing problem - if reporting instances survive some seconds, they will not crash during finalization). We are still investigating it, and we will find a more suitable fix. But much more important: we shipped on time, and we made our customer happy. All end-user installations are working fine so far.

Do I like the current solution? No, it's a hack. But still it's 10.000 times better than shipping an unstable product. From a certain level of complexity on, all software products contain hacks. Just have a look at the Windows 2000 source (esp. the code comments) that were spread on P2P-networks some time ago. In an imperfect world, you sometimes have to settle with imperfect (but working) solutions for your problems.