Tuesday, August 26, 2014

Why Your Code Has So Much Duplication

Here is a quite nice example code block for C#'s Dataflow blocks, specifically showing usage of the WriteOnceBlock:

ActionBlock writeToConsole1 = new ActionBlock( integer => Console.WriteLine( "Console 1: " + integer ) );
 
// true if the source should unlink from the target after successfully propagating a single message
// otherwise, false to remain connected even after a single message has been propagated
bool unlinkAfterOne = false;
 
WriteOnceBlock writeOnceBlock1 = new WriteOnceBlock( integer => integer );
writeOnceBlock1.LinkTo( writeToConsole1, unlinkAfterOne );
 
// prints 12 via Console 1:
// Console 1: 12
writeOnceBlock1.Post( 12 );
 
// create 4 additional Targets
ActionBlock writeToConsole2 = new ActionBlock( integer => Console.WriteLine( "Console 2: " + integer ) );
ActionBlock writeToConsole3 = new ActionBlock( integer => Console.WriteLine( "Console 3: " + integer ) );
ActionBlock writeToConsole4 = new ActionBlock( integer => Console.WriteLine( "Console 4: " + integer ) );
ActionBlock writeToConsole5 = new ActionBlock( integer => Console.WriteLine( "Console 5: " + integer ) );
 
// link those Targets to WriteOnceBlock which already holds a value
writeOnceBlock1.LinkTo( writeToConsole2, unlinkAfterOne );
writeOnceBlock1.LinkTo( writeToConsole3, unlinkAfterOne );
writeOnceBlock1.LinkTo( writeToConsole4, unlinkAfterOne );
writeOnceBlock1.LinkTo( writeToConsole5, unlinkAfterOne );
 

This is example code, and so there was no reason to make it particularly clean and beautiful and obvious via method and variable extraction (though look at how the author aliased the non-obvious false value as the unlinkAfterOne variable). This code illustrates how a lot of real world code is written.

I think the example code is just fine. I wouldn't change it. Examples have different requirements and goals than real code. They're meant to take up little space on a page and to be complete enough that you can follow them without investing much, and so that the function can be illustrated. I'm not picking on this author. He did a good job.

It is when real-world code follows this pattern that we need to rethink organization. Check out the signal to noise ratio in the code. Some lines have very little unique content, mostly duplicating the line above or below them.

For example code, it's really great. For REAL code, however, it's got a real problem.

What if we rearranged the code to be by subject rather than by step? It would look like this:

const bool unlinkAfterOne = false;

ActionBlock writeToConsole1 = new ActionBlock( integer => Console.WriteLine( "Console 1: " + integer ) );
 
WriteOnceBlock writeOnceBlock = new WriteOnceBlock( integer => integer );
writeOnceBlock.LinkTo( writeToConsole1, unlinkAfterOne );
writeOnceBlock.Post( 12 );
 
// create and link additional Target 2
ActionBlock writeToConsole2 = new ActionBlock( integer => Console.WriteLine( "Console 2: " + integer ) );
writeOnceBlock.LinkTo( writeToConsole2, unlinkAfterOne );

// create and link additional Target 3
ActionBlock writeToConsole3 = new ActionBlock( integer => Console.WriteLine( "Console 3: " + integer ) );
writeOnceBlock.LinkTo( writeToConsole3, unlinkAfterOne );

// create and link additional Target 4
ActionBlock writeToConsole4 = new ActionBlock( integer => Console.WriteLine( "Console 4: " + integer ) );
writeOnceBlock.LinkTo( writeToConsole4, unlinkAfterOne );

// create and link additional Target 5
ActionBlock writeToConsole5 = new ActionBlock( integer => Console.WriteLine( "Console 5: " + integer ) );
writeOnceBlock.LinkTo( writeToConsole5, unlinkAfterOne );

This is doing the same thing, but in a slightly different order.

Organizing by subject lets us see that entire swaths of code are duplicated, where organizing by step made it look like duplication was slight and very local -- one line at a time had a small change that would complicate duplication removal.

Now we can see that the code in Target2 differs from target3 in only two ways: the variable name, and the text.  The difference in the two paragraphs is one digit, repeated three times.  There is a lot of code space eaten up for such a tiny bit of unique content.

Some will note that in the new example, we don't create all the variables at the top of the block, and that bothers them because it's harder to find where variables are declared.  Bear with me, because that problem will start to disappear in a moment.

Now can you see the signal:noise issue?

We can simplify this by extracting one function. This makes us extract the unlinkAfterOne to a class field, but that's not a big deal (I think).

public void WriteToConsole(string id, WriteOnceBlock parent) 
{
ActionBlock block = new ActionBlock( integer => Console.WriteLine( "Console " + id + ": " + integer ) );
parent.LinkTo( block, unlinkAfterOne );
}

Now the naming is weak, but let's ignore that for a second...

ActionBlock writeToConsole1 = new ActionBlock( integer => Console.WriteLine( "Console 1: " + integer ) );

 
WriteOnceBlock writeOnceBlock1 = new WriteOnceBlock( integer => integer );
writeOnceBlock1.LinkTo( writeToConsole1, unlinkAfterOne );
writeOnceBlock1.Post( 12 );
 
var writeToConsole2 = WriteToConsole("2",writeOnceBlock);
var writeToConsole3 = WriteToConsole("3",writeOnceBlock);
var writeToConsole4 = WriteToConsole("4",writeOnceBlock);
var writeToConsole5 = WriteToConsole("5",writeOnceBlock);

The code is smaller already, and has a better signal:noise ratio. But if we look we see that console 1 is really the same kind of code as the others listed below. It's the same kind of thing, though it wasn't inline with the duplication we saw earlier. Let's fix that.

 
var writeOnceBlock = new WriteOnceBlock( integer => integer );
var writeToConsole1 = WriteToConsole("1",writeOnceBlock);
writeOnceBlock1.Post( 12 );
 
var writeToConsole2 = WriteToConsole("2",writeOnceBlock);
var writeToConsole3 = WriteToConsole("3",writeOnceBlock);
var writeToConsole4 = WriteToConsole("4",writeOnceBlock);
var writeToConsole5 = WriteToConsole("5",writeOnceBlock);

Now, we have used var and an extracted method to reduce the duplication, increasing the signal:noise ratio considerably.  Locally, we have a lot less code to deal with.

Overall, we need fewer comments now and we have less "fiddly bits." The code has fewer ways to be broken. There is still a lot of duplication in names. We might create a builder . But right now, it's so simple that we don't need comments. All we need is to see the helper function once and we grok this code.

Provided I need all of these consoles for some purpose other than demonstration, we could compress this further this way:

 
var consoles = new List>();

var writeOnceBlock = new WriteOnceBlock( integer => integer );
writeOnceBlock1.Post( 12 ); 

foreach(var id in new[]{"1","2","3","4","5"}) {
consoles.Add(WriteToConsole(id,writeOnceBlock));
}
forfo

This code can be extended more simply than before, and is down to about 5 statements (depending on how you squint. Now that it's so compressed, we could inline the WriteToConsole and not lose readability, but I actually find it more confusing when the loop body is complicated by creating the node and linking it inside the loop.

The real question, though, is not merely code size. If you were to modify the first loop to add a 6th, how would you do it? By copy-paste duplication? What if you were to modify the latest code -- would you use duplication?

The problem with having duplication is that it increases the urge to create duplication. Code that is small and tight might sometimes require some rework, but it is more likely to be managed without copy-paste.

The reason code has so much duplication is that duplicate breeds duplication. Clean code has advantages because the easiest way to maintain it is not the worst possible way to maintain it.

You may not agree that this is better.  If so, join me in the comments.


No comments:

Post a Comment