binary search – Programming Excursions in Perl and Raku

Wherein we let alliteration get the best of us, as we wend and wind our way back from where we’ve been to where we need to be.
(quiet musical accompaniment, cicadas and bird sounds)

THE WEEKLY CHALLENGE – PERL & RAKU #98

episode one:
“Readin’, ‘Ritin’ n’ ‘Rithmatic”

Task 1

Read N-characters

Submitted by: Mohammad S Anwar

You are given file $FILE.

Create subroutine readN($FILE, $number) returns the first n-characters and moves the pointer to the (n+1)th character.

Example:

Input: Suppose the file (input.txt) contains "1234567890"
Output:
    print readN("input.txt", 4); # returns "1234"
    print readN("input.txt", 4); # returns "5678"
    print readN("input.txt", 4); # returns "90"

Method

I know some will deride this task as too simple, but I genuinely enjoy these command-focused little challenges. Why? Well, much like in past challenges where we delved into the “flip-flop” operator <>, or the special sed-like case for split, this task can be accomplished by Perl immediately using a built-in function. But not, by any means, a commonly used one. And it’s good to remember these less-used paths are there waiting to be walked down. It likely won’t come up very much at all, but a tool is a tool, and when you need it it can be very helpful indeed to have something made for that express purpose. Also, we’re going to provide some of our own entertainment along the way.

The relevant function here is

read FILEHANDLE,SCALAR,LENGTH

which, as you might think, reads from a filehandle. It’s the way it does that that is of particular interest to us today.

The read function reads the number of bytes specified by LENGTH from FILEHANDLE and deposits them into the variable indicated in SCALAR. Those paying attention will notice that when a character is one byte, as, say, in ascii text, this is exactly what is being asked for. A filehandle, once read from, maintains a pointer to the next unread character, so it will automatically resume reading from where it left off without additional effort. That default behavior then takes care of the second criterium. In fact, to read from any other place in the filehandle, say to rewind back to the beginning of the file, we need to use a different function,

    seek FILEHANDLE,POSITION,WHENCE

to change the pointer and pick a new spot. But we won’t get into the weirdnesses of that here today. Life is going to become complicated enough.

No, today we’re going to address a different set of weirdnesses, because in this modern world a character can no longer be assumed to be just one byte long. Unicode UTF-8 characters, for instance, compose a variable length encoding that can be between one to four bytes each. So given unicode input, how can we be sure to read in, say, 20 characters?

In fact it’s not too hard, with everything required in core packages. You only need to tell Perl what you want to do. This being Perl, though, There Is More Than One Way To Do It®.

Perl input and output is handled through the PerlIO system, which provides the filehandles that we read from and write to. There is default behavior on these filehandles, and in most cases that serves just fine — in fact that’s kind of the idea behind default behavior, isn’t it? People expect Do What I Want behavior from Perl and for the most part they get it. But when dealing with I/O operations, we’re interacting with the outside world. And as anyone who lived through 2020 will know, the outside world can be a strange and terrible place.

Consequently, we may need to explicitly tell Perl things about a filehandle, unusual things, say that its data will be encoded in UTF-8. We can either tell individual filehandles what to expect, after opening them, or use a pragma for the script to change the defaults for all.

To tell an individual filehandle to use a particular encoding layer, we can call

binmode $fh, ':utf8';

after opening it, which will properly define its behavior. But once read in, what if you have to write out the data again? Unless we also notify the output filehandle of what to expect we will end up with “wide character” warnings all over the place.

Doing things this way gives the finest control, but also must be done for every filehandle. This isn’t likely to be too big a deal, but what about simply changing the default expectation? How about a pragma directive?

There are several unicode related pragmas available, so it can be a little confusing. Don’t, for instance, be drawn in by the lure of

use utf8;

because that probably isn’t what you want. That tells Perl that your script contains unicode characters, say in a string literal, that might get parsed wrong. Although potentially quite useful, that’s not what we need right now. What we do need to indicate is that the data is UTF-8, specifically a specific bytestream from an incoming filehandle.

To do this, we can use binmode, as above, but I think it’s easier to change the defaults for all I/O:

use open qw( :encoding(utf-8) :std );

This is the open pragma by the way, quite distinct from the the open function. I won’t say the one has nothing to do with the other, because the pragma directly influences the the filehandles created by the function, but they are two different things and should not be confused.

This directive tells the interpreter that all filehandles, coming and going, are to be understood to transmit UTF-8 octets. Furthermore, the :std subpragma indicates that the standard filehandles, STDIN, STDOUT and STDERR are included. For UTF-8 I/O, it’s pretty likely this behavior will cover what we want.

By specifying that we’re using UTF-8 we can correctly grab however many bytes of data are required to yield the requested number of characters, no matter how many bytes compose each. Of course this is Unicode, and things can get messier, specifically with combining character sequences. I’m reasonably confident this approach will still work out in the hairy fringes, but intend to look further. Unicode is the challenge that keeps on giving. With the entire corpus of written language as a goal there are an awful lot of special cases to deal with.

PERL 5 SOLUTION

The input file specified looks like this. There’s a pattern to it to make it easier to count the characters.

A☀3♠5和7A101☀3♠5和7A201☀3♠5和7A301☀3♠5和7A401☀3♠5和7A50
B☀3♠5和7B101☀3♠5和7B201☀3♠5和7B301☀3♠5和7B401☀3♠5和7B50
C☀3♠5和7C101☀3♠5和7C201☀3♠5和7C301☀3♠5和7C401☀3♠5和7C50
D☀3♠5和7D101☀3♠5和7D201☀3♠5和7D301☀3♠5和7D401☀3♠5和7D50
E☀3♠5和7E101☀3♠5和7E201☀3♠5和7E301☀3♠5和7E401☀3♠5和7E50

the output from the script as written, reading 3 groups of 20 characters and calling say in each one:

A☀3♠5和7A101☀3♠5和7A20
1☀3♠5和7A301☀3♠5和7A40
1☀3♠5和7A50
B☀3♠5和7B1

As you can see, 20 graphemes are delivered, even if they are composed of multi-byte unicode codepoints. Also note the newline character is counted and printed, so the third line has 10 visible characters, but the fourth only has 9. Both lines are output from the third say, the newline is just another character in the middle of the string..

use warnings;
use strict;
use feature ":5.26";

use open qw( :encoding(utf-8) :std );

my $len = 20;
my $file = './unicode-text-test.txt';

open( my $fh, '<', $file ) or die "no such file $file : $!";

say readN($fh, 20);
say readN($fh, 20);
say readN($fh, 20);

sub readN {
    no warnings qw( uninitialized );
    my ($fh, $length, $offset) = @_;
    my $buffer;
    read( $fh, $buffer, $length, $offset );
    return $buffer;
}

raku solution

Raku, being fresh off the presses, ink still drying, boldly looking into the future, always assumes everything is UTF-8 right out of the box. What it doesn’t have, however, is a drop in version of read(). Instead it has getc, for inputting a single character, not a byte. So that works for us. Placing it in a loop allows us to read as many characters as we need.

unit sub MAIN (Str $file = 'unicode-text-test.txt', Int $length = 20 ) ;

my $fh = open $file, :r;

## read first 20 chars
say readN($fh, $length);

## read next 40 chars starting at postion 21
say readN($fh, $length*2);

$fh.close;

sub readN( $fh, $length ) {
    my $out ~= $fh.getc for ^$length;
    return $out;
}

episode two:
“Circling the Drain”

task 2

Search Insert Position

Submitted by: Mohammad S Anwar

You are given a sorted array of distinct integers @N and a target $N.

Write a script to return the index of the given target if found otherwise place the target in the sorted array and return the index.

Example 1:

Input: @N = (1, 2, 3, 4) and $N = 3
Output: 2 since the target 3 is in the array at the index 2.

Example 2:

Input: @N = (1, 3, 5, 7) and $N = 6
Output: 3 since the target 6 is missing and should be placed at the index 3.

Example 3:

Input: @N = (12, 14, 16, 18) and $N = 10
Output: 0 since the target 10 is missing and should be placed at the index 0.

Example 4:

Input: @N = (11, 13, 15, 17) and $N = 19
Output: 4 since the target 19 is missing and should be placed at the index 4.

Method

We could take a naive approach — a brute force assault on the citadel, boldly running the gauntlet through the front door up to the insert point, but we can do better. We can be sneaky and go around back, improving our idea of where where owe are and where we’re going with every step taken. If we start in the middle of our list rather than the beginning, successively subdividing the remaining range, we can home in on the correct placement in a much more efficient manner.

This method is known as a binary search: we start off knowing that the correct location, whatever that may be, lies within the bounds of the array after the new element has been added. I mean, it might be tautologically obvious that after the element has been added, it will be held at some position within the array, but you have to start somewhere. We know, thus, before we start, that the lower bound for the correct placement is 0, and the upper bound is the length of the starting array plus one, for the new element. The known range is quite broad at this point, but through a series of actions we can refine it until there is only one position left, which is the correct place to insert the new element.

We start by looking to add the element at the half-way point. At every trial, first we see whether the index we’re examining already has the value we’re inserting. If it is, we’ve found the placement and we’re done.

In the more-likely chance it’s not equal, the value will either greater than or less than that at the position. Again, stands to reason. And with that calculation we’ve learned some new information: for example, if the value is already greater, than the correct insert location cannot be less than that index. We can now adjust our boundaries; the lower limit can be moved upwards to our mark. We can also reset it to be one greater than the checked postion, as we know it doesn’t lie there either. Likewise, if the value is less, we move the pointer for the upper bound to be the index one less than the one tried.

We’ve now constricted the known range for the correct insert index by one-half. Not bad, but in the immortal words of Karen Carpenter, we’ve only just begun (alway remember, she considered herself a drummer first who also sang).

But that aside — we continue with this process repeatedly, at each pass redefining the available range for the result, until one of two things happen: either we land on an existing element with the value, or determine that if the new element were placed there, the preceding element would be less, the next greater. Eventually one of these cases will hold true¹, and at that point we have located the correctly sorted location for the new element.

¹ For the sake of completeness, this isn’t exactly true. Sometimes the value on the new element will be less than the first element or more than the last, placing it outside the bounds of the existing array. We intercept those cases first, as we only need to append the new value to either the lower or upper end of the array and don’t need to search around inside at all. But once we begin our search, our method will always eventually settle in on the insert point.

PERL 5 SOLUTION

Because the directives say to insert the element into the list, we’ll take the list in as an array reference, then apply the splice to the referenced list once we’ve found the insert point. If the element is already there we’ll of course leave the array unchanged. In any case the list is altered, or not, in-place and the position of the new element is returned as requested.

use warnings;
use strict;
use feature ":5.32";

sub insert {
    my ($num, $arr) = @_;
    $num > $arr->[-1] and do { push $arr->@*, $num;    return $#$arr };
    $num < $arr->[0]  and do { unshift $arr->@*, $num; return 0 };

    my $lower = 0;
    my $upper = $#$arr;
    while ( $lower <= $upper ) {
        my $pos = int( ($lower+$upper)/2 );                 ## midpoint

        return $pos if $arr->[$pos] == $num;
        if ($arr->[$pos-1] < $num < $arr->[$pos]) {
            splice( $arr->@*, $pos, 0, $num );  
            return $pos;
        }

        $arr->[$pos] > $num ? ($upper = $pos-1)             ## restrict the range
                            : ($lower = $pos+1);
    }
    
}

use Test::More;

is insert( 3,  [1, 2, 3, 4]     ), 2, 'ex-1, exists already';
is insert( 6,  [1, 3, 5, 7]     ), 3, 'ex-2, insert into middle';
is insert( 10, [12, 14, 16, 18] ), 0, 'ex-3, less than first';
is insert( 19, [11, 13, 15, 17] ), 4, 'ex-4, more than last';

for my $n (1..13) {
    $n = 500 - 37*$n;
    is insert( $n, [1..500] ), $n-1, "long list: target -> $n";
}
 
is insert( 1,  [2, 4, 6, 8]     ), 0, 'insert into idx 0';
is insert( 3,  [2, 4, 6, 8]     ), 1, 'insert into idx 1';
is insert( 5,  [2, 4, 6, 8]     ), 2, 'insert into idx 2';
is insert( 7,  [2, 4, 6, 8]     ), 3, 'insert into idx 3';
is insert( 9,  [2, 4, 6, 8]     ), 4, 'insert into idx 4';

is insert( 2,  [2, 4, 6, 8]     ), 0, 'match idx 0';
is insert( 4,  [2, 4, 6, 8]     ), 1, 'match idx 1';
is insert( 6,  [2, 4, 6, 8]     ), 2, 'match idx 2';
is insert( 8,  [2, 4, 6, 8]     ), 3, 'match idx 3';

done_testing();

Raku Solution

unit sub MAIN () ;


use Test;
plan 17;

is insert( 3,  (1, 2, 3, 4)     ), 2, 'ex-1';
is insert( 6,  (1, 3, 5, 7)     ), 3, 'ex-2';
is insert( 10, (12, 14, 16, 18) ), 0, 'ex-3';
is insert( 19, (11, 13, 15, 17) ), 4, 'ex-4';
for 1..13 -> $n is copy {
    $n = 500 - 37*$n;
    is insert( $n, (1..500) ), $n-1, "long list: target -> $n";
}




sub insert ($num, @arr) {
    $num > @arr[*-1] and return @arr.elems;
    $num < @arr[0]   and return 0;

    my $lower = 0;
    my $upper = @arr.end;

    while ( $lower <= $upper ) {
        my $pos = (($lower+$upper)/2).floor;
        return $pos if @arr[$pos] == $num or @arr[$pos-1] < $num < @arr[$pos];

        @arr[$pos] > $num ?? ($upper = $pos-1)
                          !! ($lower = $pos+1); 
    }
    
}

The Perl Weekly Challenge, that idyllic glade wherein we stumble upon the holes for these sweet descents, is now known as

The Weekly Challenge – Perl and Raku

It is the creation of the lovely Mohammad Sajid Anwar and a veritable swarm of contributors from all over the world, who gather, as might be expected, weekly online to solve puzzles. Everyone is encouraged to visit, learn and contribute at

https://perlweeklychallenge.org

Tag: binary search

Practicin’ the Three ‘R’s, Hummin’ an’ Homin’ in on Home

THE WEEKLY CHALLENGE – PERL & RAKU #98

Task 1

Read N-characters

Submitted by: Mohammad S Anwar

Example:

Method

PERL 5 SOLUTION

raku solution

task 2

Search Insert Position

Submitted by: Mohammad S Anwar

Example 1:

Example 2:

Example 3:

Example 4:

Method

PERL 5 SOLUTION

Raku Solution