https://luzkan.github.io/smells/what-comment
I don't think the given code example is that bad or that the suggested replacement is an improvement. This will mostly be about the "factor out methods" part and less so about the "what comments" part.
- The original code could be read from top to bottom, in order. The replacement code is spread out across the screen in non-linear order, which harms legibility.
- A comment can be a grammatical English sentence that explains something. A function/method name must be fairly short and has a limited character set. By turning comments into function names, we are losing our power to express ourselves in natural language.
- It may be argued that a comment may go out of sync with the code it describes. But why does this argument not also apply to the name of the function wrt its body?
- In the original function, all the report generating/tweaking etc was an implementation detail of the
run
method, invisible to the outside world. In the replacement, create_report
and send_report
are now public. This means other people may use those methods independently of run
. So, in the first case, we could refactor run
with impunity, but in the latter case, we have to maintain backwards compatibility for create_report
. Moving internal functionality out into separate methods is not without cost, as it increases the "API surface area". Even if you name them with an underscore prefix _create_report
, that's no guarantee someone won't use them anyway, since Python has no real notion of public vs private methods. See also Hyrum's Law.
- The example obscures likely complexity behind the
...
in the parameter lists. What often happens in real code is that these intermediate stages require a large amount of state to do their job. They can depend on a lot of variables that are defined and mutated at various points in the run
method. When you factor out these stages into separate functions, you somehow need to make that state accessible to the new functions. There are three ways of doing this, all of them bad:
- Global variables. These are obviously bad so I won't spend more time on them
- Really long parameter lists:
def create_report(self, spam, eggs, ham, apple, banana, orange, grapefruit, ...)
. These things are unwieldy, and a sign that the function really shouldn't be a function.
- You can paper it over by bundling up all the variables into a dictionary//dataclass and passing in that "parameter object" as a single argument. But a) you still need to somehow document what keys/fields are on the parameter object, and b) now you need to construct the parameter object at the callsite and destructure it again in the pulled-out method. All this added bureaucracy is incidental complexity, which reduces the signal-to-noise ratio of our code.
- Turn the variables into attributes on
self
and share the state via the class instance, i.e. self.spam
, self.eggs
, and so on. This is only slightly better than global variables -- it's still shared mutable state and all the headaches that implies. And again, people might end up relying on those instance attributes in downstream code, or elsewhere in your class. State that was once localized in both space and time, risks becoming entangled across the entire class or wider codebase.
I've worked on codebases where people did this pervasively. Perfectly readable-from-top-to-bottom code was chopped up and scrambled into an indecipherable labyrinth of methods, methods which otherwise served no standalone purpose. The motivation was only that the original method body exceeded some arbitrary length. I came to think of it as an antipattern.
Fundamentally, the underlying fallacy is in not understanding what a function is for. In Python, at least, the syntax:
def some_function(a, b):
... # blah blah blah
return result
Does three things:
- it gives a name to a block of code
- it lets us reuse that block of code
- it introduces a new scope for the duration of that block (and pass in arguments and return a value from it)
However, we do not always need all three things. The Principle of Least Power suggests that we should use the least powerful language construct for our purposes. For example, often we want 2 and 3, but not 1, i.e. an anonymous function. The lambda
syntax lets us define anonymous functions in that case, e.g. for making short inline callbacks passed into another function.
In code like the # Creating report
section of the original run
method, it's very common that we don't have any particular desire to reuse it, because it's some highly specific non-generalizable thing from the middle of some complicated piece of work. In fact it can be harmful to allow reuse, for the reasons given earlier. (Think of it as a kind of information hiding / implementation hiding). Furthermore, it's not clear that naming the code via the function name gives any benefit over describing it in English in a comment, given that we're not using it in more than one place. So, one reason for using a function is ruled out, and the other is equivocal.
It is the third reason, introducing a new scope, that is most attractive. Scopes are a powerful tool for managing state. Any new variables defined in a function body are local to that function, and are destroyed and garbage collected when the function exits, unless explicitly returned to the caller. They also reduce cognitive load -- nobody has to worry that the banana
variable defined in function A has anything to do with the banana
in function B.
This reasoning suggests the following course of action:
Instead of factoring out the sections of run
into public methods, make them into nested functions of run
:
class Foo:
def run(self, ...):
...
def make_report():
vanilla_report = get_vanilla_report(...)
tweaked_report = tweaking_report(vanilla_report)
final_report = format_report(tweaked_report)
return final_report
report = make_report()
def send_report():
send_report_to_headquarters_via_email(final_report)
send_report_to_developers_via_chat(final_report)
send_report()
Note the following features of this code:
- You can read it in order, from top to bottom
- The nested functions do not need to take any parameters, because their bodies see everything of the enclosing scope (i.e. if
run
defined a variable a
, then make_report
can access a
). This means there is no trouble with enormous parameter lists or implicit state on self
...
- But the
vanilla_report
and tweaked_report
variables are constrained to only exist within the body of make_report
. There's no possibility that tweaked_report
gets randomly used again 200 lines later on in run
. This means, as soon as our eye leaves the body of the nested function, we can safely forget about those variables. This frees up mental capacity to focus on the next thing. Remember, we can only keep 7+/-2 things in mind at once! We reduce cognitive load by limiting the amount of stuff existing in the run
namespace.
- Nobody will improperly depend on
make_report
or send_report
because they can't be accessed from outside run
...
- But we have not lost any flexibility, because we still have the option to factor out
make_report
, send_report
into their own methods if that does turn out to be desirable in the future.
But the proposal is still deficient, because of the redundancy:
def make_report():
...
report = make_report()
We are using a function for something that is only ever used once (lexically), which gives rise to the awkward pattern where we define a function then immediately invoke it to assign the result to a variable, both of which have slightly different names: def make_banana(): ...; banana = make_banana()
. There is still a tension from using a construct which can potentially be reused, but in fact only using it once. This suggests we use a weaker construct, but what?
We could instead name the function the same as the eventual variable that takes on its return value:
def report():
...
report = report()
And this turns out to be a special case of the decorator pattern! i.e. whenever we have
def some_function(...):
...
some_function = something(some_function)
We can use the @
syntactic sugar to shorten it:
@something
def some_function(...):
...
In our case, we can use this seemingly trivial decorator:
def assign(f):
return f()
And use it like this:
@assign
def report():
vanilla_report = get_vanilla_report(...)
tweaked_report = tweaking_report(vanilla_report)
final_report = format_report(tweaked_report)
return final_report
What this does is define the report
function, immediately calls it, then assigns the result to a variable also named report
. This means the original function is no longer accessible, i.e. it's a function that can only be called once! Now there is no redundancy.
I called it assign
because you can think of this decorator in another way -- instead of a weakening of a function, it's an expansion of the assignment statement =
. It's as if we could write:
report =
vanilla_report = get_vanilla_report(...)
tweaked_report = tweaking_report(vanilla_report)
final_report = format_report(tweaked_report)
return final_report
i.e. assignment over multiple lines, with its own scope, that we can early-return from if we want.
We can do something similar for the sending report part, although in that case there's no meaningful return value, so assign
is a misnomer. We can make another, even weaker decorator:
@do
def send_report():
send_report_to_headquarters_via_email(report)
send_report_to_developers_via_chat(report)
This makes it clear that we don't care about the return value (the variable send_report
will end up with the value None
). If send_report
needed to do some intermediate computations, it can use local variables to keep track of those and they won't persist afterwards.
The final code is:
class Foo:
def run(self, ...):
...
# Detailed "what" comment here
@assign
def report():
vanilla_report = get_vanilla_report(...)
tweaked_report = tweaking_report(vanilla_report)
final_report = format_report(tweaked_report)
return final_report
# Detailed "what" comment here
@do
def send_report():
send_report_to_headquarters_via_email(report)
send_report_to_developers_via_chat(report)
And it's perfectly fine to include a long "what" comment here, if it is necessary. Comments and function/variable names are both for human beings, not the computer. Or if you still find long "what" comments offensive, you can format it as a docstring for the nested function instead 馃檪
Obviously, if we really do need to use a piece of code more than once, and/or we want it to be accessible from the outside (e.g. in tests), then by all means we should factor these parts out into public methods/functions. My argument is that we should think carefully before doing so, and deliberately opt-in to doing it, because there are costs as well as benefits.
To conclude: the advice to "factor out long functions into separate methods" misdiagnoses the problem as one of function length, when really the problem is how much local state the function encompasses. They are correlated but they're not the same. State is the thing that's hard to reason about. We manage state by partitioning it into smaller scopes. In that sense both approaches are the same. But in the approach I've given, the order of execution remains the order things appear on the screen, we don't introduce more publicly-visible methods than we really need. For the kinds of "we need to do a long laundry list of stuff" type functions (there are more of these than the "Clean Code" people want to admit), this way of organizing code is much better in my experience.
I didn't come up with all this myself, I got a lot of the ideas from the video essay Object Oriented Programming Is Bad. The relevant part starts at about 37:14.