DSLs – Programming Excursions in Perl and Raku

Wherein we party like it’s 1999 all over again. Postmodernism and poststructuralism rule the day, allowing us to mix and match our content however we see it — something started here can finished up over there, and, as long as someone says it’s alright, insert whatever we want, whenever we want.

THE WEEKLY CHALLENGE – PERL & RAKU #99

episode one:
“Oh My God! It’s Full Of Stars!”

Task 1

Pattern Match

Submitted by: Mohammad S Anwar

You are given a string $S and a pattern $P.

Write a script to check if given pattern validate the entire string. Print 1 if pass otherwise 0.

The patterns can also have the following characters:

? – Match any single character.

* – Match any sequence of characters.

Example 1:

Input: $S = "abcde" $P = "a*e"
Output: 1

Example 2:

Input: $S = "abcde" $P = "a*d"
Output: 0

Example 3:

Input: $S = "abcde" $P = "?b*d"
Output: 0

Example 4:

Input: $S = "abcde" $P = "a*c?e"
Output: 1

Method

We have here a tiny wildcard language, with only two placeholders: one to represent any single character, the other to match any sequence of characters. Given two strings, one without wildcards, the other with, we are asked to determine whether the second can be said to represent the first; which is to say they match.

Seeing as we already have a better-than-good, state of the art pattern match engine available to us in the Perl language, it seems like the path of least resistance is to transform our toy language into a full blown regular expression and see if it matches.

The wildcard characters in our given language differ slightly from their more familiar counterparts in regular expressions, in that they match a positive number of characters only, disallowing the absence of a match. That is to say, as defined an asterisk mark matches any sequence of characters rather than any number; a sequence is an entity that either exists or does not, whereas a number represents any quantity, which may include 0. The presumption is that for a sequence to exist in a meaningful way, it needs to be a sequence of things that exist, but that a numerical quantity still exists even if that quantity is none.

Likewise the question mark matches any character, and the absence of a character is no longer a character. For the purposes of completeness, we’ll define a character as anything with an ASCII code.

Fortunately we are not left in a bind about what to do should we wish to include the actual characters ? and * in the string to be matched, because these characters, even if considered as wildcards, will continue to match themselves as they fit the definition of ‘any’ character. Thus we need not worry about having to somehow ‘escape’ these characters with a signifying character such as a backslash, which leads to a recursive conundrum of what to do should we wish to include a non-escaping backslash. All these questions can of course be resolved, but that problem is considerably more complicated. I consider it a bullet dodged.

And so in the absence of further guidance, that’s what we’re going to go with today.

PERL 5 SOLUTION

To translate our wildcards into proper regular expressions, we only need to substitute one set of character classes and modifiers for the other. The ? is equivalent to a single dot, ., which of course matches any one character. The wildcard * in turn is the same as the modified expression .+, which represents any sequence of one or more characters of any type.

Further, as we are asked to validate the entire string, we will need to anchor the ends of our expression to the front and back of the input. For this our regular expression will need to enclose our translated terms within brackets, so to speak, with a carat ^ preceding and a dollar sign $ following. We can then insert our newly modified term within a match operator and interpolate it before evaluation and seeing how it goes.

To do the wildcard transmutation we’ll use substitution operator, which seems quite fitting.

In other news, there’s been quite a lot of talk in the air lately about removing the signatures feature from the experimental list, as apparently it’s been stable for quite some time, at least in its basic functionality. Being quite used to accessing @_ all these years I haven’t felt the pressing need for this very normal feature but if it’s destined to remain I suppose the time has come to see how I like it in Perl. I have to say I’ve found the recent introduction of both post-dereferencing syntax and say very agreeable, although I do still find myself still writing

say "@$_" for @list

on occasion to print out lists of lists, because it’s so, well, clean is a good word. But that in turn means I’m using both forms of dereferencing in a single script, which I’m not quite good with. Maybe I am, maybe not. I don’t quite know how I feel on the subject. My stylebook is tailored for clarity and readability over all else, and I don’t know where to make this call.

So on that rambling note, I’ve decided to start incorporating signatures. It should work out ok. We’ll see how it goes.

use warnings;
use strict;
use feature ":5.26";
use feature qw(say signatures);
no warnings 'experimental::signatures';


my ($str, $exp) = @_;
if (defined $str and defined $exp) {
    say validate($str, $exp);
}

sub validate ($str, $exp) {
    $exp =~ s/\?/./g;
    $exp =~ s/\*/.+/g;

    return $str =~ m/^$exp$/ ?  1
                             :  0;
}
                       
use Test::More;

is validate("abcde", "a*e"),    1, 'ex-1';
is validate("abcde", "a*d"),    0, 'ex-2';
is validate("abcde", "?b*d"),   0, 'ex-3';
is validate("abcde", "a*c?e"),  1, 'ex-4';

is validate("abcde", "bc?e"),   0, 'no head';
is validate("abcde", "ab*c?e"), 0, '* cannot be empty';
is validate("abcde", "a*c???"), 0, '? cannot be empty';
is validate("abcde", "a*c?"),   0, 'no tail';

is validate("a??cde", "a*c*"),  1, '? in input';
is validate("a***c*", "a*c?"),  1, '* in input';

is validate("?*??*??***?*", "????????????"),   1, 'line noise';
is validate("************", "*"),              1, 'password';


done_testing();

raku solution

For Raku we need to change the format slightly to include the angle brackets but otherwise the process remains unchanged.

sub validate( Str $str, Str $exp is copy ) {
    $exp ~~ s:g/\?/./;
    $exp ~~ s:g/\*/.+/;
        
    return $str ~~ m/^<$exp>$/ ?? 1 
                               !! 0
}

multi sub MAIN ( Str $str, Str $exp ) {
    say validate( $str, $exp );
}

episode two:
“Pick One From Column A and Two From Column B”

task 2

Unique Subsequence

Submitted by: Mohammad S Anwar

You are given two strings $S and $T.

Write a script to find out count of different unique subsequences matching $T without changing the position of characters.

UPDATE: 2021-02-08 09:00AM (UK TIME) suggested by Jonas Berlin, missing entry [5].

Example 1:

Input: $S = "littleit', $T = 'lit'
Output: 5

    1: [lit] tleit
    2: [li] t [t] leit
    3: [li] ttlei [t]
    4: litt [l] e [it]
    5: [l] ittle [it]

Example 2:

Input: $S = "london', $T = 'lon'
Output: 3

    1: [lon] don
    2: [lo] ndo [n]
    3: [l] ond [on]

Method

At first look I thought this was going to be very complex. After all, each substring can be divided into 2^length-1 subset groupings of the letters, and those groupings can be arranged themselves over the searched string in some combinatorially large manner. Shooting from the hip, it looks pretty exponential to me. As the string and the subsequence length get longer it looks like it will get out of hand rather quickly.

But in reality, not so much. Sure, there are pathological cases where things do blow up, but for the most part much of the string length can be filtered away.

Let’s break down what we really need to do to find a potentially picked-apart, embedded substring. Starting from the left side, we need to find the first letter of the target subset. If it’s not found, then our job is easy — there are no instances of the subset to be found. So let’s assume we find it. Then what? Well from that point forward we start looking for the next character in the target sequence. Proceeding this way, repeating the process, we look thorough each letter until we find the last. Should we be able to do that, then we’ve successfully found a pattern. But how do we find them all ?

Well one thing is that given multiple instances of the first character, should we successfully find a complete sequence starting at each of those instances then each would be a separate unique pattern. So, when we find the first instance, we could perform another action, in addition to looking for the second character. We could continue to look for another occurrence of the first character, and attempt to build a subsequence from there instead.

This series of actions defines a recursive procedure:

at each stage we are looking forward from a given position for the next letter in the sequence. On finding it, we either:
1. start looking for the next character in the sequence and recurse from that point forward
2. continue looking for the same letter and recurse from that point forward
3. if we’re looking for the last letter and we find it, we’ve found a complete pattern
4. if we don’t find it, or otherwise run out of string to look in, that particular search has failed

This radically prunes our tree of possibilities. We are only ever looking for a single letter at a time in any one search, and the length of the string searched is constantly shortened as each character is found. An additional search is only spawned when a character is found.

In an absolute worst case scenario the number of paths followed is somewhat less than 2ⁿ, where n is the length of the searched string. That case is searching a string composed of only one letter, for a subsequence composed only of the same letter and one-half the length;

$string = 'aaaaaaaaaaaaaaaaaaaa' $target = 'aaaaaaaaaa' 
result   = 184756  instances
searches = 801422  both successful and unsuccessful 
2^20     = 1048576

That’s absolutely the worst case scenario as well. Alternately, looking for the word ‘the’ in Act 1 Scene 1 from Hamlet — 1400 words, 7800 characters — took a little while but was only 20,000,000 occurrences in 40,000,000 searches or so. ‘zoo’, because there are relatively few zs, is only 62000 occurrences and nearly instantaneous.

PERL 5 SOLUTION

To quickly process the strings, we can avoid breaking them into arrays and just deal with them whole, using substr(). We can speed through the search text using index, and then when an instance is found recurse with the substring following the position. In the case of searching for the next target character only the tail of the target is passed along, minus the found character; in the case of looking for a second occurrence the target is passed unchanged. The base case is running out of either string, and a cumulative count is set up to ratchet when a new instance of the last target character is found. Once we cover our options there really isn’t very much to it.

use warnings;
use strict;
use feature ":5.26";
use feature qw(say signatures);
no warnings 'experimental::signatures';


sub descend ($str, $target) {  
        my $count = 0;
        return 0 unless length $str > 0 and length $target > 0;
    
        my $t   = substr( $target, 0, 1 );
        my $idx = index $str, $t;
        
        if ($idx > -1) {
            $count++ if length $target == 1;
            $count += descend( substr($str, $idx+1), $target);
            $count += descend( substr($str, $idx+1), substr( $target, 1 ));
        }
        
        return $count;
}



use Test::More;

is descend( 'london', 'lon'), 3, 'ex-2';
is descend( 'littleit', 'lit'), 5, 'ex-1';
is descend( 'abcabc', 'abc'), 4, 'abcs';
is descend( 'aabbaa', 'aba'), 8, 'repeated letters';
is descend( 'aaaa', 'aa'), 6, 'only one letter';

done_testing();

Raku Solution

unit sub MAIN () ;

use Test;

plan 6;

is descend( 'london', 'lon'), 3, 'ex-2';
is descend( 'littleit', 'lit'), 5, 'ex-1';
is descend( 'abcabc', 'abc'), 4, 'abcs';
is descend( 'aabbaa', 'aba'), 8, 'repeated letters';
is descend( 'aaaa', 'aa'), 6, 'only one letter';
is descend( 'bookkeeping', 'boke'), 8, 'bookkeeping';

sub descend (Str $str, Str $target) {
    my $count = 0;
    return 0 unless all($str.chars, $target.chars) > 0;
    
    my $t   = $target.substr(0,1);
    my $idx = $str.index($t);
    
    if $idx.defined {
        $count++ if $target.chars == 1;
        $count += descend( $str.substr($idx+1), $target);
        $count += descend( $str.substr($idx+1), $target.substr(1) );    
    }

    return $count;
}

The Perl Weekly Challenge, that idyllic glade wherein we stumble upon the holes for these sweet descents, is now known as

The Weekly Challenge – Perl and Raku

It is the creation of the lovely Mohammad Sajid Anwar and a veritable swarm of contributors from all over the world, who gather, as might be expected, weekly online to solve puzzles. Everyone is encouraged to visit, learn and contribute at

https://perlweeklychallenge.org

Tag: DSLs

Diffraction Gratings Producing a Wildcard Wonderland

THE WEEKLY CHALLENGE – PERL & RAKU #99

Task 1

Pattern Match

Example 1:

Example 2:

Example 3:

Example 4:

Method

PERL 5 SOLUTION

raku solution

task 2

Unique Subsequence

UPDATE: 2021-02-08 09:00AM (UK TIME) suggested by Jonas Berlin, missing entry [5].

Example 1:

Example 2:

Method

PERL 5 SOLUTION

Raku Solution