I would like to discuss some ideas about testing and maintaining systems that I feel strongly about. This might seem like an odd subject to feel strongly about – after all, documentation and testing are the dull bits of any project. Clearly I am biased by the nature of my job since I am often called upon to handle a crisis where an enterprise level application has gone seriously wrong. The situation is always an emergency. It almost invariably involves serious financial loss. In rare cases, it has put people in danger. The first priority is to get the system running again, or as is more commonly the case, limping along.
There are two things that generally cause these emergencies and they are very often found in the same project. The first is poor design. The second is poor testing. Generally poor design is found during testing. If your testing is bad enough then it may be discovered the day that your system goes live. If you find yourself holding the baby when an enterprise solution goes down, your best hope is that you have a good design that has been implemented badly. That may result in some late nights chasing bugs but it is fixable. A botched design may require an almost complete rethink, salvaging what code can be saved from the old system.
Let us consider how much testing an application needs. The answer is that it very much depends on the application. At one extreme, you have a simple application which is to be used rarely by few people and where the results of failure are trivial. A typical example of this is a tool used by a single developer, or a group of his peers, to perform a limited task where it is possible to get the same results in a different way. Let us imagine that you have a little in-house command line tool that tells you what the copyright strings embedded in a DLL are. This tool clearly needs minimal testing and fault reporting is built in since you can pick up the phone and talk to the guy who developed it. At the other end of the spectrum, we have military applications where an error can cost thousands or millions of lives. You really, really don’t want a false positive or negative in an application designed to detect a first launch of nuclear weapons. That sort of software can not be over tested.
For most of us, the applications that we develop fall somewhere in between these extremes. Most of us write applications that are used in the commercial world. The impact of these systems failing is normally a financial loss. For a line of business application, the loss can be significant.
How much testing needs to be done depends on both the complexity of the application and the seriousness of any resulting failure. However, projects generally go over time and budget. When the code is written, it is tempting to cut testing to the bone to get at least close to schedule and after all, the code is good. It was written by smart people. How likely is it that there are hidden bugs? I sometimes think that smart people are the most dangerous kind since they can find new ways to mess up that we stupid folk can’t imagine.
It is accepted wisdom that applications should be rigorous in their error checking and should never assume that the parameters offered to them from an external source are valid. This is simple good sense. However, test data is almost always carefully selected to only contain valid data that will pass the tests. The real world is seldom so cut and dried as that. Real data will contain errors. So should test data. If nothing else, test data should contain errors for reasons of code coverage. If you test with only good data then you are ignoring many, many code paths.
Testing broadly falls in to two categories – functional and scalability/performance. Let us consider functional testing first.
Functional testing is testing to ensure that the expected results are obtained for sample inputs. That is to say that the program does what it was intended to do. Virtually no-one tests that the program does not also do what it was not intended to do. If an application slowly leaks memory then it may be months before anyone notices. If the leak is slow and small then it might be a minor bug that can be ignored. Not all bugs are that benign. I recall one bug that was very nasty and took 6 months of my life to find and fix. The application did what it was supposed to do. It also overwrote some system memory (this was not a Windows application). After a particular sequence of operations was carried out 6 times, the program became unable to access the filing system on the machine. The reason was that a side effect of some faulty logic overwrote successive chunks of some operating system structures until the OS filing system failed. So, whenever possible, test that an application doesn’t do anything that it shouldn’t as well as does what it should. There are a number of tools to help you do this under Windows and hopefully also under Linux. For Windows, the cheapest is the application verifier from the applications compatibility toolkit – a free download from http://msdn.microsoft.com/library/default.asp?url=/library/en-us/dnanchor/html/appcompat.asp . It won’t catch logic errors but it will catch a lot of bad system calls that would cause you grief later – including a lot of the ever popular heap corruption errors. I don’t use Linux so I can’t recommend any tools for that but I am sure that there must be some. Whatever your operating system, the time for such errors to show up is in the test lab rather than on the live server.
I think that now would be a good time to talk about system testing and unit testing. A unit is a component of a system. Let us imagine for a second that your system contains a component which processes a customer order. It works with the billing module. It works with the presentation layer to build a confirmation record for the customer. They are not written yet. The temptation is to wait until they are done and then test them all together. After all, they have to work together so that makes sense, no? Well, actually, no. Let us imagine that we have these three systems all written 4 months down the line. You kind of remember how the order processing component worked. You hook them up after fixing a problem where Bob was working on a different version of the interface specification but never mind, you are ready to test. Great – we are system testing. You test and something doesn’t work. You have no idea what! You have 3 completely unproven components and you are using them to test each other. You could step through in a debugger and follow what is happening but that will take a long time. Let us assume that you pull a couple of 90 hour weeks and get it working pretty well with the test data. Your boss seems happy again and all is well. At that point, how much of the code have you run in that subsystem? 40%? If you were careful with your test data, maybe you have run 70%. Do you feel happy releasing a product where at least 30% of the code has never been run? Would you like to bet your business on it? If you would then may I recommend spread betting? For the rest of us, I recommend unit testing. You build a harness and test the component in isolation. This is often rejected as being too time consuming and I can see why. It does take time to build the harness and test but it saves time later. If the software is critical, you may need to simulate errors to find out just what happens if that memory allocation fails.
Think of it like shopping. You are going to have to pay for what you get. If you pay cash, it is immediately visible and you feel the pain. Alternatively, you can put it on your credit card and your bank balance looks fine. However, you always pay in the end and the longer you delay paying, the more it costs you. System testing components that are well unit tested is always cheaper and easier. It is a classic case of a stitch in time saving nine.
Now, let us talk about system testing. One thing to recognise is that systems rarely have one release and no changes ever. It is an old joke but there is some truth in it. The user does know what he wants and will tell you the moment you deliver what he asked for.
Ad hoc testing is good. Automated testing is even better. The great thing about automated testing is that it can be reproduced consistently. If you fix a bug that had no UI changes then the test script that ran last time should run just as well now. That doesn’t mean that it is fully tested, but it does mean that you haven’t broken anything fundamental. Some parts of the industry call this a smoke test which is originally a hardware testing term. You turn it on and see if it starts to smoke. Automated tests can take longer to create than ad-hoc tests but they are a gift that gives on giving. There are several test tools out there and it would be unfair of me to recommend one but perhaps a little *rational* thought would single out one that *rose* in your mind when you needed to *test*
Scalability and Performance testing are normally the last types to be done. Please note that I say “are” not “is”. These are very different things as some developers have learned at great cost in late nights and missed deadlines. Performance testing shows how fast the code runs for one user. Scalability shows how well the application copes with many users and/or much data. Performance can be important when designing a game, a compiler or a system that is essentially single user. Scalability can be critical with a modern three tier application. Consider the fastest growing area, namely online thin client solutions. Ebay, Amazon, your bank and a hundred other companies do more and more of their business online. If you are launching an electronic banking solution, it is not going to be a happy day when you find that your system can not be split over multiple servers because of an error in its architecture. Functional testing can start early on with unit testing; Scalability testing should start as early as possible even if you have to simulate large sections of the system. Scalability issues are often fundamental to the design of a system and accordingly fundamentally difficult to fix if you make a bad decision. To make matters worse, you normally discover scalability issues when your application is about ready for release. In the case of some developers, you discover them shortly after your application has gone live when it almost immediately goes dead again because more than 100 people have tried to use it.
So, what is the relationship between scalability and performance for a web application? Let us imagine a web based application which displays a catalogue and accepts orders. That should be easy to imagine as there are hundreds out there and everyone re-invents the wheel. Performance is when it takes .1 of a second to order. Scalability is when the response time is less that .8 of a second when 500 users make an order. The two goals are often in conflict. This is something that application developers often discover when they start writing server applications. In a thick client application, it is a great idea to cache information and preload it for the user. When you have 200 users, that stops being a great idea. It may well be better to go to the database each time and let the database designer worry about how to handle the requests efficiently. That is normally a safe thing to do as the big database manufacturers have spent millions of dollars in tuning those database engines.
There are various tools which are good at testing these systems. Application Centre Test which comes with Enterprise flavours of Visual Studio is one and there are multiple third party solutions. Ideally, these should be used with multiple client machines attacking a single server since multiple instances of a test tool are not at all the same in terms of timing. I would advise aggressive testing as well. Test for expected load. Test for wildly optimistic load just in case your solution turns out to be what the world was waiting for. Test until failure – push the box or better yet the boxes until they fall over. If at all possible, spread the server load over more systems than you initially expect to run on and repeat. Learn where the bottlenecks are and open then up - and repeat the tests. Ideally, everything should fail at levels you will never reach with all components giving up at about the same time. Oh, it is a good idea to test with realistic data if you can so that you don’t find that you are looking at better performance because you always hit the cache or worse performance because you have an artificial hotspot in the database. Remember that real data contains errors and users who disappear without warning. So should your test data.
Finally, a few words about keeping these systems running after they are first developed. Systems generally fall in to two types. The first is systems that will be live for a long time and will evolve as the business changes. This is a very common scenario. The second is one off systems written for a special event such as an election or the Olympics (which I ignored earlier). These don’t live long but they are very critical while they last. In both cases, I recommend that there should be a test server (or cluster as the case may be). This is also sometimes called a staging server. This server should be identical to the real one. It should have copies of all the supporting servers. It should be in a position where it could be switched live in a few minutes if needed. Except when it is being used for specific testing, it should be identical to the live server and you must be ready to revert it to the state of the live server or send it live. You need this. The alternative is to test any changes in the software on the live business with no fallback plan. You also need it to act as a backup for a worst case scenario failure or even to share load if a system has a surge in demand.
I have often explained this to wise and intelligent people who have been shocked by the idea. They always ask me the same question in an appalled voice - “Have you any idea how much that would cost?” As it happens, yes, I do have a pretty good idea of what it would cost. How much not doing this would cost a business in the worst case varies a great deal. I have known companies who were losing over 1/3 of a million dollars a day because they didn’t have such a setup. The debate about what to do took 3 days. Delivery and set up of the hardware took 3 more. The system had already been down for a few days. All in all, the lack of the test server cost them about $3 million. The test system cost a them about $6000 plus time from their systems admin staff. Even if the losses had been 1% of what they were, the test server would have been cheaper.
So, testing may not seem very exciting. I would agree that it is not. However, not testing properly is very exciting indeed. If you live for the thrill and don’t care about how difficult it is to get another job then feel free to go for the excitement. For the rest of us with a mortgage and 2.4 children, I can not recommend good testing well enough.