Unit Testing: Defining the right “Unit”
I am a strong proponent of automated testing and test driven development. But if asked if I do unit testing verses integration or some higher level of testing, I will usually ask the questioner to define “unit.”
To some, this may seem like questioning the definition of “is” but I don’t think so. Consider a set of unit tests that test a single class, the stereotypical case. Let us assume the class was somewhat complex and there were fifty tests to exercise all of the code in the class. Later the class gets refactored into a facade class, supporting the interface of the original class, and a small set of simpler classes behind it that work together to do the work of the original class.
Should we now write unit tests for each of these new classes? Why? What is the ROI? If the original set sufficiently tested the original class, and the refactoring was just that, a change in code structure that did not modify its behavior, do they not now sufficiently test the classes as a group?
Indeed, in my experience, even if we started out with a set of classes that work together to perform a business function†, the best value in testing is still to write tests that test that the business function is performed correctly. Such tests will always be valid as long as the business function, i.e. the functional specification of the code, does not change. Tests below this level, in my experience, are fragile and break upon refactoring because they are too closely tied to the implementation of the code under test. If the value of testing is to enable refactoring with confidence, then the tests must survive the refactoring.
How does this fit in with TDD? In TDD we should never write code unless we have a failing test. If each test expresses a detail of the functional requirements of the software (as opposed to a detail of its implementation), then as each test passes, a functional requirement is met, we refactor and move on. It should not matter if we wrote one line, one class or a dozen classes to make the test pass.
Some may argue that writing comprehensive tests at this higher level of abstraction is too difficult. Rather one should write general tests that level. One might, for example, assert that a value is returned. But lower level tests should be written to assert that the correct value is returned for every edge case.
This can sometimes be true. Sometimes tests for edge cases at higher levels of abstraction are harder to set up than the effort is worth, and a lower level test, even if fragile, gives better ROI. However in my experience, in the general case, what makes testing the edge cases difficult at the higher level is usually the same thing that makes any testing difficult: bad design, inexperience with testing, bad tooling, or a combination thereof. Even granting that higher level tests are objectively harder to write, if one practices writing harder tests, it eventually gets easy and one becomes a better tester than they otherwise would have been.
In the end, for each business function, there needs to be an API the implementation of which is responsible for performing that function, and that implementation needs to be testable within one or more contexts (a given system state that can be mocked or otherwise simulated). If the implementation is a single function, a class or an entire module is not the relevant concern. The concern is what are the inputs and what are the outputs and testing that the anticipated inputs all lead to their correct outputs.
Yes, we want to write our tests at the lowest level possible within this context. But we do not want to go below this level. We do not want to be testing components that are simply implementation details of the software’s functionality. Such tests break under refactoring, lead to higher maintenance costs, rarely add value and hence have poor ROI.
There is an exception. For teams or developers new to TDD and writing well designed code in general, lower level tests can provide value. Writing lower level tests is easier. More importantly, being forced to make the lower level components testable helps one to learn good design. It enforces loose coupling, proper abstractions and the like. However once these skills are internalized, they can be exercised without needing to write tests to enforce them. These tests are a learning tool that can and should be discarded.
There is a corollary to this. If a developer doesn’t stop writing these low level tests once he no longer needs them, if he doesn’t instead start writing test at the business functional level, it is entirely possible to develop a system that is fully “tested” but fails to do the right thing. Every low level unit can work as intended but in aggregate fail to work together as intended. One needs tests that assert that the system as a whole, or meaningful segments of it, perform as intended.
I will conclude by admitting that I have not truly answered the question I started out with. How do we correctly define a unit? I have asserted that the best definition of a unit “the code that implements the lowest level business function.” In short we need to be able to discern the boundary between business function and implementation detail. Pointers on how to do this shall perhaps be the topic of another post. For now I will only say that finding the level of test abstraction that will maximize ROI is as much an art, learned from experience, as it is anything else. But one will never develop the art, unless one first realizes it is to be sought after. And challenging those who have not already done so to start looking is the real point of this post.
† Throughout this discussion, I am using the term “business functionality” loosely to refer to what the software is supposed to do conceptually, the details of its functional specification as distinct from details of the implementation of that specification. The term “business” itself may not be properly applicable to all real world cases.