sorting – Programming Excursions in Perl and Raku

Wherein we decide who is in, and who is out, and whether everything else in in order lest we go without.

THE WEEKLY CHALLENGE – PERL & RAKU #111

episode one:
“In or Out, Cat. Make Up Your Mind.”

Task 1

Search Matrix

You are given 5×5 matrix filled with integers such that each row is sorted from left to right and the first integer of each row is greater than the last integer of the previous row.

Write a script to find a given integer in the matrix using an efficient search algorithm.

Example

    Matrix: [  1,  2,  3,  5,  7 ]
            [  9, 11, 15, 19, 20 ]
            [ 23, 24, 25, 29, 31 ]
            [ 32, 33, 39, 40, 42 ]
            [ 45, 47, 48, 49, 50 ]

    Input: 35
    Output: 0 since it is missing in the matrix

    Input: 39
    Output: 1 as it exists in the matrix

Method

We have, in our multidimensional matrix, two nested, sorted datasets. We can of course randomly access any particular cell and check its value, but traversing the whole range, from the first element of the first row to the last of the last, is a tad more complicated than simply iterating over a single long list. In any case, starting from the lower bound and working upward is just about the least optimal way of searching without purposely trying to do a bad job of it. And we’ve been expressly tasked to be efficient.

So how do we go about this? To be efficient we need to leverage all of the information available to us in the system and use that to hone our search. In the most basic manner, we need to first locate the correct row that the value may be located in, then search the selected row to look for a matching element. “Efficiency” becomes a somewhat difficult to interpret term when sorting a through 25 values, however, as literally any method is likely to be lightning fast, and relatively efficient already. In a small and tight system like this the cost of complicating the search will be directly competing with the gains made, much more so than with a large data set. With a 1000×1000 matrix a 10% improvement will save 100,000 searches; with our version it will only save 2-3, yet will in general require the same amount of code to implement.

It’s my working theory, going in to this, that the overhead added for the various conditionals in implementing a binary search would probably add more complexity, and hence operations, than cycles saved by the process over the limited time it will have before terminating.

As it’s unclear exactly where the line of simplicity versus optimized searching is going to fall, we’re going to set up a variety of methods and see what works best. As the task was quite clear that we will be working with a set 5×5 matrix we will tune specifically for that, rather than a more general-purpose solution. It’s not hard to imagine a subroutine that needs to quickly judge some data; one that’s essential and run often enough so that through scaling microsecond differences in execution will matter and add up.

With those parameters set I came up with four plans:

solution types

Flatten the matrix and look inside. The flattening is not complicated, and the traversal can be done by a fast Perl function or XS module. The fundamental idea here is to have the least amount of code possible. The average number of searches should be 13.5, with a maximum of 27. I’m counting any value below the lowest or above the highest bound as dispatched in one search.

The first version used List::Util::any() to look through the array because it sounded fast and the docs promnised it would short-circuit as soon as it found a match. The docs went on to say that because of this behavior it was a good candidate to replace grep in scanning through arrays. As it worked out the built-in Perl grep turned out to be significantly faster. Perl is a well-tuned machine.
Move listwise through the rows until our value is below the last element, starting from row 0. This method definitively identifies the value as belonging within a specific row, which can then be searched. This takes 2-6 tries to get the row and 3 to get the element. Adding up the possibilities yields an average of 6.6 searches, with 11 as the worst-case.

As with the previous example, I started with any, then sped things up significantly using grep. As there are only 5 values, I then tried unwinding the grep into a cascading conditional, and got another 10% speed boost.
I tried a very simple divide-and-conquer algorithm, starting at the middle row and moving either up or down from there. I hit on using the spaceship operator from sort, figuring it would be a strong target for optimization in the backend. By summing the operations of comparing the target value to the first and last elements of the middle array, one of the 5 values {-2,-1,0,1,2} will be produced. The 2 and -2 results mean the selected row is either completely above or below the target value, and a 0 means the value is above the lower bound and below the upper — the value is within the row. Conveniently, the only way to produce a 1 or -1 value is to have a match with either the upper or lower bound element, allowing us to return true immediately as we don’t need to know which. The row-search conditionals are arranged to fall through, and should they arrive at 0 there are only 3 possible remaining positions that may match, and these are enumerated with their own conditionals. With this method we are down to 3.4 searches on average with a maximum of 6. However the searches are getting more complex, with multiple operations in every validation.

One problem encountered was that should a value lie above one row but below the next an oscillating state is set up, alternating between two rows as the position of the value lies between them. We need an additional conditional to check for this.

Despite stripping the control structures and computation to a minimum I could not get this technique to overtake the simple row-wise search.
In a last effort to unseat the throne I came up with the simplest way to add a binary division step to the row search that I could, with the smallest amount of extra code. This method starts at the third row, then moves the seleted row either upwards or downwards from there. The search within the row remains identical to trial number 2. With this we average 5.8 tries with a maximum of 9.

Although this change will locate the correct row in an average of 2.2 row-checks, an improvement on the 3 tries of the original, it still ends up running about 10% slower, as detailed below.

PERL 5 SOLUTION

I spent no small amount of time changing out and honing the four techniques until I felt I had each running about as fast as I could have them, setting up a battery of tests to make sure I didn’t break anything. I ran a benchmark against them as I went to see what worked, and ended up with the changes detailed above. First the code:

use warnings;
use strict;
use feature ":5.26";
use feature qw(signatures);
no warnings 'experimental::signatures';

our $mat = [  [  1,  2,  3,  5,  7 ],
              [  9, 11, 15, 19, 20 ],
              [ 23, 24, 25, 29, 31 ],
              [ 32, 33, 39, 40, 42 ],
              [ 45, 47, 48, 49, 50 ]  ];



sub search_flattened ( $val) {                              ## [1]
    return (grep {$_ == $val} map {$_->@*} $mat->@*) || '0';
}


sub search_rows ( $val, $row = 0) {                         ## [2]            
    return 0 if $val > $mat->[-1]->[-1];
    
    $row++ while ($val > $mat->[$row]->[-1]);
    
    return 1 if (    $val == $mat->[$row]->[0] 
                  || $val == $mat->[$row]->[1] 
                  || $val == $mat->[$row]->[2] 
                  || $val == $mat->[$row]->[3] 
                  || $val == $mat->[$row]->[4] );
    return 0;
}


sub search_divide ( $val, $row = 2) {                       ## [3]
    while (1) {
        $_ = ($val <=> $mat->[$row]->[0]) + ($val <=> $mat->[$row]->[-1]);
        if ($_ == -2) {
            return 0 if ($row == 0) or ($val > $mat->[$row-1]->[-1]);
            $row--;
        }
        elsif ($_ == 2) {
            return 0 if ($row == $mat->$#*) or ($val < $mat->[$row+1]->[0]);
            $row++;
        }
        elsif ($_) {
            return 1;
        }
        else {
            return 1 if (   $val == $mat->[$row]->[1] 
                         || $val == $mat->[$row]->[2] 
                         || $val == $mat->[$row]->[3] );
            return 0;
        }
    }
}

sub search_rows_mid ( $val, $row = 2) {                     ## [4]
    return 0 if $val > $mat->[-1]->[-1] or $val < $mat->[0]->[0] ;
    
    $row-- while ($val < $mat->[$row]->[0]);
    $row++ while ($val > $mat->[$row]->[-1]);
    
    return 1 if (    $val == $mat->[$row]->[0] 
                  || $val == $mat->[$row]->[1] 
                  || $val == $mat->[$row]->[2] 
                  || $val == $mat->[$row]->[3] 
                  || $val == $mat->[$row]->[4] );
    return 0;
}

The benchmarking process:

use Benchmark qw( cmpthese );

cmpthese(-3, {
'row'     => sub { search_rows($_) foreach 0..51; },
'row-mid' => sub { search_rows_mid($_) foreach 0..51; },
'flat'    => sub { search_flattened($_) foreach 0..51; },
'div'     => sub { search_divide($_) foreach 0..51; },

});

done_testing();

And here are the results:

           Rate    flat     div row-mid     row
flat    13657/s      --    -59%    -68%    -71%
div     33533/s    146%      --    -21%    -29%
row-mid 42575/s    212%     27%      --    -10%
row     47312/s    246%     41%     11%      --

Flattening the array, although just a single line, proved very inefficient. The clear winner was the simple row-wise search and then iterating across. This technique examines a few more candidates than the other methods, but also deals with them and moves along very quickly, and in the end that approach won the day. It seems the best way to efficiency in this task is to not execute code.

raku solution

In Raku I abstained from further analysis and essentially ported the winning technique over. Most of the code disappeared but it remains the same algorithm. Here, because it seemed consistent I had it return a boolean True or False. It’s remarkable how little code is left.

unit sub MAIN (Int $val = 23) ;

our @mat =    [  1,  2,  3,  5,  7 ],
              [  9, 11, 15, 19, 20 ],
              [ 23, 24, 25, 29, 31 ],
              [ 32, 33, 39, 40, 42 ],
              [ 45, 47, 48, 49, 50 ]  ;
              

say @mat.first({ $val == any $_.list  })
        .Bool;

Python Solution

In Python, on the other hand, we can clearly see how the components compare.

import sys

def inArray(mat, val):
    if (val > mat[-1][-1]) | (val < mat[0][0]):
        return False 
    row = 0
    while val > mat[row][-1]:
        row += 1  
    return val in mat[row]
   
   

mat = [[  1,  2,  3,  5,  7 ],
       [  9, 11, 15, 19, 20 ],
       [ 23, 24, 25, 29, 31 ],
       [ 32, 33, 39, 40, 42 ],
       [ 45, 47, 48, 49, 50 ]] ;

default = 22

if len(sys.argv) == 1:
    val = default
else:
    val = sys.argv[1]
    
res = inArray(mat, val)

print(res)

episode two:
“Pithy Title”

task 2

Ordered Letters

Submitted by: E. Choroba

Given a word, you can sort its letters alphabetically (case insensitive). For example, “beekeeper” becomes “beeeeekpr” and “dictionary” becomes “acdiinorty”.

Write a script to find the longest English words that don’t change when their letters are sorted.

Method

I have collected a number of dictionaries on my system, with between about 180,000 to 270,000 words in them. One thing that struck me immediately was that no matter which I chose, sorting all of those words was not going to be a good way to go about this.

So the trick here is not to find an efficient sorting method for the words, comparing pre and post, but rather to find an efficient manner of determining whether a word is already sorted, a subtle difference. This is much less intensive as any word requiring any rearrangement at all immediately fails the test.

Working across the words letter-by-letter, starting with the first and comparing adjacent pairs using the sort comparison operators, we can avoid sorting the words entirely. If in any pair the right letter is lexicographically smaller than the left, the word fails and we move to the next word immediately. The only words that end up “sorted”, which is to say that all of their letters are examined and processed, and the words that were already sorted to begin with. Which, it turns out, includes only 729 word and word-like things out of 235,886 entries in usr/share/dict.

The longest alphaliteral word in English (“alphaliteral” being a great word in itself I might add) I could find anywhere on the web was “aegilops” at 8 letters, being an variant spelling of ‘egilops’, which we have here. The longest in in any dictionary I had were the 7 letter examples shown.

max length 7

adelops
alloquy
beefily
begorry
billowy
egilops

PERL 5 SOLUTION

In Perl I made a general-purpose chomp to get rid of some errant carriage returns in one file. The meat of the matter is in the subroutine, where substr examines the letters without requiring us to break the string up into an array of characters.

use warnings;
use strict;
use feature ":5.26";
use feature qw(signatures);
no warnings 'experimental::signatures';
use List::Util qw( max );

my $dict = '/usr/share/dict/words';

open(my $fh, "<", $dict) or die "can't open dict $dict: $!";

my %bag;

for (<$fh>) {
    s/[\n\r]//g;                    ## general-purpose chomp
    $_ = lc;                        ## lowercase the word (if in ALLCAPS, for instance)
    $bag{$_} = length $_ if is_sorted($_);
}

my $max = max (values %bag);
say "max length $max\n";

my @longest = sort grep { $bag{$_} == $max } keys %bag;

say $_ for @longest;



sub is_sorted ($word) {
    return 0 if length($word) < 3;  ## reject short words
    for (1 .. length($word)-1) {
        return 0 if (substr($word, $_-1, 1) cmp substr($word, $_, 1)) == 1 ;
    }
    return 1;
}

Raku Solution

In Raku we can form a pipeline to load the hash array directly from the dictionary filename, which I find remarkable. Further cleverness in the subroutine allows us to break up the word and examine it as a sequence of tuples. The rotor method is just such a fun function to use. The version we use here reads 2 elements forward and then steps back 1 as it ratchets across the word comparing pairs of letters.

unit sub MAIN () ;

our $dict = '/usr/share/dict/words';

my %sorted_words = 
    $dict.IO.lines
            .map({.lc})
            .grep({is_sorted($_)}) 
            .map({ $_ => $_.chars });
            
my $longest = %sorted_words.values.max;

say "longest word length $longest\n";
.say for %sorted_words.keys.grep({%sorted_words{$_} == $longest});


sub is_sorted ($word) {
    return False if $word.chars < 3;
    for ($word.comb).rotor(2 => -1) -> ($a, $b) {
        return False if $a cmp $b == 1
    }
    return True
}

Python Solution

One of my favorite things about Python is list comprehensions, and so I used a couple to process the sorted-words dictionary.

dictFile = '/usr/share/dict/words'

def isSorted(word):
    if len(word) < 3:
        return False
    for i in range(1, len(word)):
        if word[i-1] > word[i]:
            return False
    return True

bag = {}

f = open(dictFile, "r")
for line in f:
    line = line.rstrip('\r\n').lower()
    if isSorted(line):
        bag[line] = len(line)
f.close

maxLen    = max( v for v in bag.values() )
longWords = [ w for w in bag if len(w) == maxLen ]

print("longest word length", maxLen, "letters", "\n")
for word in longWords:
    print(word)

The Perl Weekly Challenge, that idyllic glade wherein we stumble upon the holes for these sweet descents, is now known as

The Weekly Challenge – Perl and Raku

It is the creation of the lovely Mohammad Sajid Anwar and a veritable swarm of contributors from all over the world, who gather, as might be expected, weekly online to solve puzzles. Everyone is encouraged to visit, learn and contribute at

https://perlweeklychallenge.org

Tag: sorting

Are We in the Matrix? Get in Line, Friend, Get in Line.

THE WEEKLY CHALLENGE – PERL & RAKU #111

Task 1

Search Matrix

Example

Method

solution types

PERL 5 SOLUTION

raku solution

Python Solution

task 2

Ordered Letters

Method

PERL 5 SOLUTION

Raku Solution

Python Solution