Levenshtein – Programming Excursions in Perl and Raku

Wherein the toys we play with reveal the changes that have morphed us over time, and sometimes the parts can be viewed in reversed over to reveal an alternate meaning.

THE WEEKLY CHALLENGE – PERL & RAKU #96

episode one:
“Dancing Backwards in High Heels”

Task 1

Reverse Words

Submitted by: Mohammad S Anwar

You are given a string $S.

Write a script to reverse the order of words in the given string. The string may contain leading/trailing spaces. The string may have more than one space between words in the string. Print the result without leading/trailing spaces and there should be only one space between words.

Example 1:

Input: $S = "The Weekly Challenge"
Output: "Challenge Weekly The"

Example 2:

Input: $S = "    Perl and   Raku are  part of the same family  "
Output: "family same the of part are Raku and Perl"

Method

Well there’s nothing to this, right? We use split on whitespace, reverse the array and then join it up again. Done. Next!

my $str=  "    A   B  C ";
## note the added vertical bars to show the problem
say join '| |', reverse split /\s+/, $str;

Oh, wait…

~/Code/PWC/splittest.pl
--------------------------------------------------------------------------------
C| |B| |A| |

Is that a… trailing space? (Yes, I added the vertical bars to make it more visible.) Hmmm. It seems split is finding the delimiter at the beginning of the string, producing an extra empty element at the front of the list, which ends up at the end of the list, which gets involved in the join, which leaves us with a space.

This wouldn’t be a problem if we trimmed whitespace around the string. We could do this by substitution, matching on space at the front, and hey, why not the back as we’re at it, and replacing it with nothing. Snip, snip and we’re back in business.

say join ' ', reverse split /\s+/, $_[0] =~ s/^\s*|\s*$//gr;

Now we’re stripping stuff off the front and back of the sting and all is well again in the world. It is, however, a bit… messy.

Now “messy” isn’t really a quantifiable term, strictly speaking. It’s just a statement of being. One that left unchecked leads to hairiness, and hirsute code leads to confusion, bugs and premature heat death of the universe. So that’s to be avoided.

You know, now that we’ve brought the big guns of the regex engine to bear, it seems a little repetitive to both do the substitution and the split. We should be able to combine those ideas, and indeed we can:

my @out;
$_[0] =~ s/(\w+)/ unshift @out, $1 /ge;
return join ' ', @out;

Here we incorporate the /e switch to eval the matched capture, only in this case harnessing a side-effect, as it is known, and loading the match into an array from the front, reversing the order of occurrence to read last to first. Essentially we’re turned our regex into a loop structure. It’s also a bit longer.

I know, I know, lines of code do not equate to quality, shorter is not by definition superior, and a better goal is to keep things to one action per line for clarity. That said, harnessing a side effect, while legal and valid, seems to me akin to using map in void context: map is for applying a function over an array of data, not just another cool way to make a loop. Using it that way seems unnecessarily indirect when a simple for is waiting around for expressly that purpose. Likewise turning a substitution into a loop works, but here we aren’t even substituting anything. It seems too clever and could probably be phrased better with a while and a simple match, broken up into several lines. I think we can do better.

And better we can. On thing we’ve finally done is brought the focus of the matching around to the things we do want, versa the things we want to exclude. We’re matching a list of words and selecting these out, rather than finding whitespace to remove. With that in mind, why be picky at all about what we want, when we know exactly what we don’t? We don’t know every bit of punctuation or such that might find its way into a word. But space is, like, space, man.

We can snatch what we want by defining it as not being the thing we don’t want. Instead of word characters, let’s indiscriminately match everything that’s not whitespace. And if we do that match globally, we will produce a list, which we can then reverse and join:

join ' ', reverse $_[0] =~ m/(\S+)/g;

And there we have it. Compact, clear and direct.

PERL 5 SOLUTION

Jan 21, 2021 at 10:30:47 PM
~/PWC/96_1_backwards_sentence.pl
--------------------------------------------------------------------------------
family same the of part are Raku and Perl

use warnings;
use strict;
use feature ":5.26";

sub rev_sentence {

    return join ' ', reverse $_[0] =~ m/(\S+)/g;

}

raku solution

In Raku the process works the same, but we get to chain the functions in a clear graphical order. I like the airiness of well-formatted Raku.

unit sub MAIN (Str $str = "         The Weekly       Challenge    ") ;

($str ~~ m:g/ (\S+) /).reverse
                      .join(' ')
                      .say;

episode two:
“Toying Around Until We Find an Answer”

task 2

Edit Distance

Submitted by: Mohammad S Anwar

You are given two strings $S1 and $S2.

Write a script to find out the minimum operations required to convert $S1 into $S2. The operations can be insert, remove or replace a character. Please check out Wikipedia page for more information.

Example 1:

Input: $S1 = "kitten"; $S2 = "sitting"
Output: 3

Operation 1: replace 'k' with 's'
Operation 2: replace 'e' with 'i'
Operation 3: insert 'g' at the end

Example 2:

Input: $S1 = "sunday"; $S2 = "monday"
Output: 2

Operation 1: replace 's' with 'm'
Operation 2: replace 'u' with 'o'

Method

I first became acquainted with the Levenshtein distance when I was doing some Natural Language Processing work, trying to find fuzzy matches between named entities pulled out of real-world, uncontrolled documents, these documents referring to related events but written by different, basically anonymous authors. Sometimes there would be small but significant differences between names pulled from different sources, and an edit distance could provide a metric to help decide whether different documents were in fact naming the same person or place.

Of course this didn’t work perfectly, as often a single middle initial is added to distinguish different people with the same sur- and given names. But the corpus itself consisted of related information, within both files and file names, so in fact the misses were fewer than one might expect.

So the Levenshtein distance ended up being useful to this task without quite solving it completely, as more of an aid to guessing than a universal panacea. Such is par for the course in NLP, where there is generally no one tool available to solve your problem, but there are several techniques that will gladly help you solve part of your problem. This idea, of breaking a problem into smaller parts and solving those, curiously factors into our solution today.

About the “edit distance” specified in the task description — there are different ideas as well about this, which broadly speaking is the number of changes required to turn one word into another. When we allow the three actions

insertion of a single character
deletion of a single character, or
substitution of one character for another

this matches the operations for calculating the Levenshtein distance, so that’s what has gotten to me talking about that now. In his original work, Levenshtein considered both insertion and deletion as one operation each , and the cost of a substitution as 2, a deletion followed buy an insertion. Nowadays this operation cost of substitution is usually considered to be only 1, which is, somewhat confusingly, still considered the Levenshtein distance. That is to say without clarification, there are two common parameters for making the calculation that yield different results. What’s more peculiar is that no one seems to mention the disparity, defining the metric either one way or the other and not acknowledging alternatives.

Without being able to access Levenshtein’s 1965 paper (in Russian and apparently not available in translation) I can’t seem to take this oddity any further, but I’d like to get to the bottom of it once and for all. But doing the given example using Levenshtein’s original costs we get the result

Levenshtein("kitten", "sitting") = 5, ## not 3

so we won’t be using that today. The allowance of the 3 valid operations: insertion, deletion and substitution, is what defines this particular metric.

Other ways to describe an “edit distance” have been devised with different operation sets, such as additionally allowing transposition (the Damerau–Levenshtein distance) or only allowing substitution (the Hamming distance). Because it is the most commonly used method for calculating an edit distance, the Levenshtein distance is often used interchangeably with the the more general term, but be aware the cost parameters can vary using the same language, so these cannot be assumed in cross-comparisons between applications. Just a little word to the wise.

Now where were we? Ahh, right, complexity and things blowing up in our faces.

A naive solution for manipulating the strings at every position in every way possible quickly explodes exponentially as the three options multiply for each character in each string, yielding O(3^n+m) time complexity, which is ridiculous. There is however another way, using a concept known as dynamic programming. To find the Levenshtein distance between the two strings, we’re going to implement the Wagner-Fisher algorithm.

As an aside before we begin, if the term “dynamic programming” seems evocative without being particularly forthcoming about its meaning, this is no accident, because it doesn’t, in fact, mean anything in particular. In can be viewed as an early example of what is now known as corporate nothingspeak — coined in 1952 or so by its architect, a man named Richard Bellman, it was devised to be unassailable rather than to define. In other words to sound good and reveal nothing, allowing him to pursue his research unfettered by bureaucracy.

“…it’s impossible to use the word dynamic in a pejorative sense. Try thinking of some combination that will possibly give it a pejorative meaning. It’s impossible. Thus, I thought dynamic programming was a good name. It was something not even a Congressman could object to” — Richard Bellman, Eye of the Hurricane: An Autobiography (1984, page 159)

Dynamic programming in practice is an idea I find analogous to recursion, where in solving a large complex problem we don’t know exactly how to get there, but we do know both how to take a step forward and how to know when we’ve arrived. In fact recursion is an essential component to the idea, combined with maintaining a map of partial solutions as they are computed to be fed back into the recursion. By dividing up a problem into manageable sub-problems, and carefully remembering the partial solutions we come up with for reference in taking the next step, we can home in on a solution from several directions until we have spanned the entirety of the problem space in manageable, incremental bites.

PERL 5 SOLUTION

In the Wagner-Fisher algorithm, instead of directly trying to morph on string into the other, the solution is based on small transformations of one string, and then adding additional characters one at a time and recomputing. A running tally of sub products is kept in a working array and as new characters are considered an ideal solution is constructed to incrementally perform the transformation from one string into the other.

To accomplish this a transformation matrix is set up, that itself maintains the count of operations taken to get to a given point. Each array row of minimum transformations is itself transformed as one string is mutated, in a series of steps, by each character in the other string. By the time the last character in the second sting has been considered to affect the transformation of every character position in the first, in the our running array of partial results, the final minimum value will be found in the lower right corner of the matrix we’ve built. The last position in the last line, at the end of the end of the line, waiting for us to get there.

use warnings;
use strict;
use feature ":5.26";
use List::Util qw( min );

sub levenshtein {
    my ($s, $t) = @_;
    my ($m, $n) = map { length($_) } ($s, $t);
    my @mat;
    
    $mat[$_][0] = $_ for ( 0..$m);
    $mat[0]     = [ 0..$n ];

    for my $j ( 1..$n ) {
        for my $i ( 1..$m ) {
            my $cost = (substr($s,$i-1,1) eq substr($t,$j-1,1) ? 0 
                                                               : 1 );  ## this could be 2
            $mat[$i][$j] = min( $mat[$i-1][$j] + 1,
                                $mat[$i][$j-1] + 1,
                                $mat[$i-1][$j-1] + $cost );
        }
    }
        
    return $mat[$m][$n];
}

Raku Solution

unit sub MAIN (Str $s = 'Sunday', Str $t = 'Saturday') ;

my ($m, $n) = ($s, $t).map: {.chars};
my @mat;

@mat[$_; 0] = $_ for 0..$m;
@mat[0]     = 0..$n;

for (1..$n) -> $j {
    for (1..$m) -> $i {  
        @mat[$i; $j] = ( @mat[$i-1;   $j] + 1,
                         @mat[$i;   $j-1] + 1,
                         @mat[$i-1; $j-1] + ($s.substr($i-1,1) ne $t.substr($j-1,1)).Int  ).min;
    }
}

say @mat[$m;$n];

The Perl Weekly Challenge, that idyllic glade wherein we stumble upon the holes for these sweet descents, is now known as

The Weekly Challenge – Perl and Raku

It is the creation of the lovely Mohammad Sajid Anwar and a veritable swarm of contributors from all over the world, who gather, as might be expected, weekly online to solve puzzles. Everyone is encouraged to visit, learn and contribute at

https://perlweeklychallenge.org

Tag: Levenshtein

The Wagner–Fischer-Price, Backwards

THE WEEKLY CHALLENGE – PERL & RAKU #96

Task 1

Reverse Words

Example 1:

Example 2:

Method

PERL 5 SOLUTION

raku solution

task 2

Edit Distance

Example 1:

Example 2:

Method

PERL 5 SOLUTION

Raku Solution