11 January 2018

Abusing multiple-dispatch creatively

As part of creating a new POD tree for the Perl 6 utilities in hopes of letting others create their own subclasses, I came across this interesting use of multiple-dispatch. One of the Pod types that the Perl 6 compiler generates is for inline attributes like making text bold or italic with formatting codes like 'B<text bold>'. Bold, italic, and underline formatting codes all generate the same Pod::FormattingCode object, with a different .type value, so they look something like

class Pod::FormattingCode {
  has Str $.type; # This is 'B', 'I', etcetera as the need arises.
}
class Pod::FormattingCode::Bold { }

I have a bunch of .to-node() methods that are specialized on turning raw Pod objects into something a little more useful. One of these, to convert a Pod::FormattingCode into my internal Node::FormattingCode::Bold object, looks like this:

multi method to-node( Pod::FormattingCode $pod ) {
  given $pod.type {
    when 'B' {
      Node::FormattingCode::Bold.new( :type( $pod.type ) )
    }
  }
}

my $bold = Pod::FormattingCode.new( :type( 'B' ) );
self.to-node( $bold ); # Calls the multi method above.

All of the methods that convert $something to a node are named .to-node(), and I can rely on Perl 6's multiple dispatch to keep them separate for me. This is important to me mainly because of locality of reference. If you're debugging my code, and you want to know where something gets converted to a Node:: object, just look through the .to-node() methods. Now, looking at the given-when block, that's going to grow, and by quite a bit. At least three lines for every formatting code that I find in the documentation.

And it gets a bit worse. Say I want to do the right thing, and factor out the '.new(...)' lines into their own method, because I'm pretty sure they'll grow, as I find neat little corner cases for each of the Pod types. I'd have to name them something ugly, which breaks up my idea.

Since the method still converts a Pod object to a Node object, it'd be nice to be able to reuse the .to-node() method, but to fit it into the existing scheme of things I"d have to create a new object like a Pod::FormattingCode::Bold, create a new instance of that, and then I'd be able to do multiple-dispatch on that type. But that means creating not one but two new classes for every Pod::FormattingCode - one for the "shim" that I use to dispatch on, and another one for the actual object I'm going to return to the user. And it's even worse than that, because it's possible, though very unlikely, that the Perl 6 team will one day create a Pod::FormattingCode::Bold class and trounce on my own name-space, undoing my hard work.

Well, as you might have guessed, there is a solution, and it doesn't involve trouncing on anyone's namespaces. And it still lets us stick to the principle of reusing .to-node() for all of our node-conversion needs. Here's how you write it:

multi method to-node( Pod::FormattingCode $pod ) {
  self.to-node( $pod, $pod.type );
}
multi method to-node( Pod::FormattingCode $pod, 'B' ) {
  Node::FormattingCode::Bold.new( :type( 'B' ) );
}

my $bold = Pod::FormattingCode.new( :type( 'B' ) );
self.to-node( $bold ); # Returns a Node::FormattingCode::Bold object.

What this does may not be obvious at first glance, so I'll walk through it. We create a Pod::FormattingCode object of type 'B' for Bold, and magic happens. The first magical bit is that multiple-dispatch kicks in, so that when .to-node() gets called, Perl 6 does its best to find the closest signature. And in this case it's a perfect match. .to-node( Pod::FormattingCode ) is perfectly matched by the first method. Inside that method, we do something a little sneaky. We break out the type, and rely again on multiple dispatch. This time we're calling .to-node( Pod::FormattingCode $pod, 'B' ), and again we have a perfect match, but this time it's the second method, down below. That creates the new Node::FormattingCode::Bold object and returns it.

How did that work, you might ask? Well, multiple dispatch in languages like C++ or Java work based on types. You can sort of simulate what we just did in C++ with templates, and Java's generic sort of do what we did, but not quite, and with much more work. TL;DR Perl 6's multiple dispatch works on argument values as well as types, so you can dispatch both on a generic Str class as well as a specific instance "foo" of Str. This means you can write Haskell-like code in Perl 6.

multi sub fib( $n where * < 0 ) { 0 }
multi sub fib( 0 ) { 0 }
multi sub fib( 1 ) { 1 }
multi sub fib( $n ) { fib( $n - 1 ) + fib( $n - 2 ) }

The where declaration there lets us cleanly handle negative values as well, as a bonus. No matter what (integer) value the client passes to the fib subroutine, Perl 6 will dispatch it cleanly, so that fib(2) will call sub fib(1) and return 1, rather than calling sub fib($n) and going into an infinite regress. I was just working along on Pod::To::Tree, did that particular bit of refactoring and thought you might like to hear about it. This is your friendly neighborhood Perl Fisher, signing off.

07 January 2018

Tree Surgery

HTML and I tend to get along like a house on fire. I can work with it when the need arises, but I still prefer to do semantic markup and let CSS do with it what the end-user wants to see, which these days is shiny sidebars, blogroll posts and code blocks that neatly let the user just drag-and-select the text they want and copy it into their editor of choice.

As you can see by this posting, I haven't quite gotten there yet. Equally, you can see that I haven't given up altogether, because I'm still here, posting and tweaking things. A couple of months ago (okay, it just feels that way) I figured out that it'd be much simpler for me to take a Perl 6 POD document and add the markup that I need to that.

After some pondering I realized "Wait, isn't there already a Perl 6 POD converter out there? Called perl6 --doc? And can't I re-purpose what it does to generate Blogspot-ready HTML?"

Pod::To::HTML already is out there, but the way the code is written made it hard if not impossible to subclass, because most of the important stuff was inside sub blocks, not out in methods where they could easily be sub-classed.

So I started rewriting it. A few days later, after grumbling and wondering why this particular bit had been written the way it was, I went onto the usual suspect Perl IRC channel where the author likely hung out, so I could get a useful answer. He wasn't around, but someone pointed me to another module, Pod::TreeWalker, and told me that he too was interested in fixing this.

So, as things happen, conversation started. I rewrote the guts of what I had with Pod::TreeWalker, and found that it did some things I really didn't want it to. All I wanted... hoo boy. I know that phrase all too well, it usually means I'm going to end up rewriting some module only to run into the problems they encountered. Well, once more into the breach, and all that.

So, I decided that while I liked Pod::TreeWalker's style - it's an event-driven system, it did place some limitations on how you could work with it. For example, when a pod-table-start event came along, it was pretty easy to figure out "Hey, this is the start of a POD table, so I can generate a `<table>' string and add it to my running HTML." And this worked fine, for all of about 20 minutes. Because it also generated other events in places where I didn't think it should, such as paragraphs inside a list item, which made the generated HTML look odd.

For instance,

=item foo

generates this sequence, when marshalled back out to HTML:

<li><p>foo</p></li>

What's happening here is that the library sends out:
  • An 'item-start' event, so I tack on '<li>' to the HTML.
    • A 'paragraph-start' event, so I tack on '<p>' to the HTML.
    • A 'text' event, with 'foo' as the txt, so I tack on 'foo'.
    • A 'paragraph-end' event, so I tack on '</p>' to the HTML.
  • An 'item-end' event, so I tack on '</li>' to the HTML.
Now, as I've mentioned earlier, I didn't particularly like the fact that the sequencer created a paragraph event, when there's no particular need to do so. But I'm stuck with it. I still have possibilities, though. I can...
  1. Post-process the HTML and do a quick regex to remove the <p>..</p> tags, but, and repeat after me, friends don't let friends run regex on HTML.
  2. When a 'paragraph-start' event is encountered, see if the HTML has <li> already there, and ignore it if so. But see #1.
  3. Hack the module to pass along the "parent" event when firing off an event, so I could look at the "parent" event and if that's a paragraph, ignore it.
  4. Wait a minute, parent... <digs_through_source/> it's already got a tree of POD it's walking, if I pull out just the tree, then it's actually less code to walk the tree, and when I encounter the <p> node I can tell it to look at its parent... right.
Armed with this realization, I sally forth. Lucky for me, Perl 6 POD is already laid out as a tree, so it's pretty simple to start walking it. Now, there are a bunch of straightforward ways to write a walker like this, but I rather prefer to use the multiple-dispatch features of Perl 6 for this purpose.

method walk( $node ) {
  self.visit( $node );
  if $node.^can('contents') {
    for @( $node.contents ) -> $child {
      self.walk( $child );
    }
  }
}

POD nodes are laid out in a pretty simple fashion. They're either "leaves" like text blocks (our infamous paragraph contents) or they're shells, like a list of items. An easy way to tell whether a node is a leaf or not is whether it has 'contents', and we do that by the "^can('contents')" test. Just calling the method directly would work as well, but every time we called it on a leaf node, we'd get a runtime error. Not good.

Once you know that bit, the code sort of falls into place.
  • Visit $node (AKA "do something with it", like print its text)
  • If it's got children:
    • For each child (that's the "@( $node.contents ) -> $child" bit)
    • Walk over that child.
So your user-declared visit() method will get called once on every node in the entire tree, in a depth-first search, so it's in the perfect order to return HTML to you. Well, almost, but the exceptions aren't worth talking about.

Great, we can walk the tree in depth-first order, and we've got a handy visit() method that'll do something. We can even add a $.html attribute that we can accumulate our HTML into as we go along, problem solved!

has Str $.html;
method visit( $node ) {
  if $node ~~ Pod::Table {
    $.html ~ '<table>'; # .. hey, wait a minute...
  }
}

Hold the phone, this just tells me when we've encountered, say, a Table node. I wanted to be able to write something when a table starts, and when it ends. And I wanted to know what the table's parent was, like we talked about lo those many paragraphs ago.

No worries, we're really close, honest. I'll change the walk() method just to show you.

method walk( $node, $parent = Nil ) {
  self.start-node( $node, $parent );
  if $node.^can('contents') {
    for @( $node.contents ) -> $child {
      self.walk( $child, $node );
    }
  }
  self.end-node( $node, $parent );
}

The '= Nil' is a handy shortcut so that you can call the walk() without having to specify a Nil parent. In your code you can just call walk($pod) without anything special, Perl 6 will just fill in the missing argument for you.

Also, you'll see that the generic visit() call is gone, there's now in its place a start-node($node,$parent) and end-node($node,$parent) call. We can easily use them like this:

has $.html;
method start-node( $node, $parent ) {
  given $node {
    when Pod::Table { $!html ~= '<table>' }
  }
}
method end-node( $node, $parent ) {
  given $node {
    when Pod::Table { $!html ~= '</table>' }
  }
}

And voilá! start-node() gets called when a table starts, and its companion end-node() gets called after all of the table contents are displayed, so we can write in the '</table>' tag at the right time. And we can even check out the table's parent node at $parent. If there isn't one, then we're at the top of the tree!

There are a few minor downsides to this, though. For one, every time we learn about a new Pod node, we're going to have to update both the start-node() and end-node() method. But we can fix that simply. Perl 6 lets us dispatch methods by type, using the multi keyword. So, let's try that.

has $.html;
method start-node( Pod::Table $node, $parent ) { $!html ~= '<table>' }
method end-node( Pod::Table $node, $parent ) { $!html ~= '</table>' }

Much less noise, and Perl 6 will know exactly how to dispatch our types. But when the code out in the wild encounters a new Pod node that we didn't know about, it'll break with a horrible stacktrace, so let's fix that right now.

has $.html;
method start-node( $node, $parent ) { die "Unknown node " ~ $node.^WHAT.perl }
method end-node( $node, $parent ) { die "Unknown node " ~ $node.^WHAT.perl }

There, now our code will gracefully die when it's encountered a node that it's never seen before, and report exactly what the node is so that when someone makes a bug report on GitHub we'll know what to do.

Now, I should reveal that my upcoming Pod::To::HTMLBody module doesn't quite work like this. I do use some of these techniques behind the scenes, and ultimately I walk the tree almost exactly in the same way, but I've done things differently for several different reasons. I guess you'll have to wait for the next part of this article to learn what's going on, and what new challenges I faced making this particular module.

Until then, this is your humble author signing off.