Finding Common Basis with a Bag of Sharks

Wherein we discover the biggest things are built from small beginnings, and then analyze the components that ultimately lead to a street fight.

THE WEEKLY CHALLENGE – PERL & RAKU #81

episode one
“Common Bonds”


TASK #1 › Common Base String

Submitted by: Mohammad S Anwar

You are given 2 strings, $A and $B.
Write a script to find out common base strings in $A and $B.
A substring of a string $S is called base string if repeated concatenation of the substring results in the string.

Example 1:
Input:     $A = "abcdabcd"     
$B = "abcdabcdabcdabcd"
Output: ("abcd", "abcdabcd")
Example 2:
Input:     $A = "aaa"     
$B = "aa"
Output: ("a")

Method

This really isn’t as complicated as it may sound. If you take a string, or even a character, and repeat it unchanged any number of times to make a new string, then that short, original part is called a “base string”.

Ok, not *any* number. Not negative numbers, no that doesn’t make any sense. And the idea of repeating a string 0 times has interesting philosophical implications but that’s a little outside of our scope here 1. Let’s say instead any ordinal number of times excluding zero, which may or may not be considered an ordinal number depending on who you ask. Fine. Is that well-defined enough yet?

But back to base strings. As one of these odd, unusually repetitive target strings is a multiple of a base, it follows conversely that that string must be by definition sized as a harmonic divisor of the original: 1/2, 1/3, 1/4 etc.

That is to say a string of length 8 can only be composed of bases of length 4 or 2 or 1. And a string of length 9 can only be composed of strings of length 3 or 1. We can save some trouble by looking at only those substrings that fit this basic constraint. Oh, and let’s not forget 1/1; any string could reasonably be considered a base string to itself, repeated once. Why not?

  • Do we need this?
  • No, it’s not defined and could go either way.

But I see no reason not to allow it. So a string of length 9 can be composed of bases of length 1, 3 or now 9.

Any base string will begin the source string at index 0 and extend for n characters, with n contained within that set of fractional components outlined above with respect to the source string length. Thus 1/2 length, or 1/3, 1/4 etc. So what we need to do is start at the beginning of the string, proceed for some fraction of the total length, and see whether that substring can be repeated to compose the original string. If so it’s a base string.

Once we have a list of base strings for one of the inputs, we can hash the results and obtain another list from the second input. By iterating through the second list, we can check each for existence in the lookup hash, and if found add it to an output list of common base strings.


1 Ok, I’ll give it a shot: It stands to reason that any string, even an infinite string, can be considered a base string to the null string, as that is the result when anything is repeated 0 times. “But what of the null string itself?” You might well ask. I’m not sure what repeating the null string no times would mean exactly, but perhaps we could consult the great Billy Preston, Nothin’ from nothin’ leaves nothin’, suggesting that result, too, is the null string. Whether or not we accept the opinion of a musician on the matter, we are saved by the identity, that repeating the null string 1 time would produce the null string again unchanged. As the null string is the absence of a anything, this state of emptiness can be achieved in multiple ways, by both removing everything present or by adding nothing in the first place. Both achieve the same result.

A liitle confusion erupts if we consider the operation of repetition analogous not to multiplication but to exponentiation. I can see the argument for this assumption, and it matters because 00 can be considered either 1 or NaN, depending on context, and mapping either of these results to our hypothetical sends our ship into the rocks.

We should avoid that.

In any case if we include the option of allowing 0 repetitions then the only thing we can infer for is the null set, which could be considered to have an infinite number of base strings, arguably even the null string itself. If and only if, of course, we allow meaningful physical existence to the null string in the first place. I say we exclude null strings and zeros both and be done with it. It’s a lot of work for quite questionable gain.

Perl Solution

We’re going to follow the algorithm outlined above, breaking the base string reduction off in its own subroutine, find_base_strings(). In that we carve out substrings from the front of the input of increasing lengths, then match a regular expression to see if we can construct the entire original string from beginning to end. Once we have the list for one input we hash the strings as keys, then we can compare the list for the other input using the exists function on the hash keys.

use warnings;
use strict;
use feature ":5.26";

## ## ## ## ## MAIN:

@ARGV == 0 and @ARGV = qw(agtcagtcagtcagtc agtcagtcagtcagtcagtcagtcagtcagtc);

my ($A, $B) = @ARGV;
my @out;
my %subs_a = map {$_ => 1} find_base_strings($A);
for (find_base_strings($B)) {
    push @out, $_ if exists $subs_a{$_} ;
}
say $_ for @out;


## ## ## ## ## SUBS:

sub find_base_strings {
    my $str = shift;
    my $len = length $str;
    my @out;
    
    for (1..$len) {
        next unless $len % $_ == 0;
        my $sub = substr $str, 0, $len/$_;
        my $res = $str =~ /^(?:$sub)+$/;
        push @out, $sub if $res;
    }
    return @out;
}

Raku Solution

In Raku, we have an improved hash table solution available, the Bag type, and with it we are given an adjective :exists to check whether or not an item is in the bag. We retain the separate find_base_strings()function, but streamline the loop iterator to only those substring lengths that are already known to be able to evenly compose the original string, using a chain of map and grep.

unit sub MAIN ($A = "aaaaaaaaaaaaaaaaaaaaaaaa", $B = "aaaaaaaaaaaa") ;
my @out;

my $bag-A = (find_base_strings($A)).Bag;
@out.push: $_ if $bag-A{$_}:exists for find_base_strings($B);
.say for @out;



sub find_base_strings ($str) {
    my @bases;
    for (1..$str.chars).grep($str.chars %% *)
                       .map($str.chars/*)       {
        my $sub = $str.substr(0,$_);    
        @bases.push: $sub if so $str ~~ /^ $sub+ $/;
     }
    @bases;
}

episode two:
“A Bag of Sharks”


TASK #2 › Frequency Sort

Submitted by: Mohammad S Anwar

You are given file named input.
Write a script to find the frequency of all the words.
It should print the result as first column of each line should be the frequency of the the word followed by all the words of that frequency arranged in lexicographical order. Also sort the words in the ascending order of frequency.

INPUT file
West Side Story 
The award-winning adaptation of the classic romantic tragedy "Romeo and Juliet". The feuding families become two warring New York City gangs, the white Jets led by Riff and the Latino Sharks, led by Bernardo. Their hatred escalates to a point where neither can coexist with any form of understanding. But when Riff's best friend (and former Jet) Tony and Bernardo's younger sister Maria meet at a dance, no one can do anything to stop their love. Maria and Tony begin meeting in secret, planning to run away. Then the Sharks and Jets plan a rumble under the highway--whoever wins gains control of the streets. Maria sends Tony to stop it, hoping it can end the violence. It goes terribly wrong, and before the lovers know what's happened, tragedy strikes and doesn't stop until the climactic and heartbreaking ending.

NOTE

For the sake of this task, please ignore the following in the input file:

. " ( ) , 's --

OUTPUT

1 But City It Jet Juliet Latino New Romeo Side Story Their Then West York adaptation any anything at award-winning away become before begin best classic climactic coexist control dance do doesn't end ending escalates families feuding form former friend gains gangs goes happened hatred heartbreaking highway hoping in know love lovers meet meeting neither no one plan planning point romantic rumble run secret sends sister streets strikes terribly their two under understanding until violence warring what when where white whoever wins with wrong younger 
2 Bernardo Jets Riff Sharks The by it led tragedy
3 Maria Tony a can of stop
4 to
9 and the

Method

Here’s a tiny introduction to Natural Language Programming (NLP) for you all.

The request requested is for a data model known as a naive bag of words, with the output viewed by frequency. The idea being to catalog the words in the text, counting each instance of every word, and then listing them, categorizing the output by those counts. We end up with first a list of words only occurring once, then those words appearing twice, then three times, etc. The text is generally processed to standardize the data, making the output more useful, and we will do a bit of that as well here. In the bag-of-words model, as it is known, words that show up more frequently can lead to inferences into the content of the text.

We’ll start by preprocessing the data: in this simple data model we will scrub certain defined punctuation and possessive cases into spaces, making sure to keep a single hyphen to preserve compound words. We will also extend the directives and lowercase the whole text. As we won’t be doing any name recognition in this simple model, we aren’t going to worry about losing capitalization for those entities, instead concerning ourselves with making sure “their” and “Their” get counted as the same word. This is of course a judgement call and not specified behavior but is normal to this basic word analysis, so we’re going to go ahead and do it.

Consequently the output is slightly different than the example. For instance, ‘their’ is moved to the second category, and the output is actually in lexicographic order as requested, rather than the example ASCII sort with capital letters first.

1   adaptation
    any
    anything
        ... snip  ...
    wrong
    york
    younger

2   bernardo
    by
    jets
    led
    riff
    sharks
    their
    tragedy

3   a
    can
    it
    maria
    of
    stop
    tony

4   to

9   and

11  the

As you can see, the names of the principals, along with the word “tragedy” occur more often within the paragraph. So it would be reasonable to infer that this is some sort of tragedy involving Maria and Tony, which in fact it is.

Next-level improvements on this method might be to isolate the names of people and places from the text, in the form of Named Entity Recognition. This could be done with dictionaries of known names, or conversely looking for words that aren’t in a dictionary. Even without any meta-awareness to the content at all, we could begin to identify these proper nouns by selectively removing the capitalization of letters only at beginning of sentences, that is to say after a period or terminal punctuation, or at the beginning of a paragraph or quote. Then unusually capitalized words could be identified as proper nouns on the basis of their position within a sentence. As many sentences begin with names, this wouldn’t be sufficient with only this short summary to work with, but over the entire script of West Side Story we should be able to successfully pull out all of the characters and place names in this manner. By associating words with metadata, or “chunking”, we can draw on these identifications later to improve the conclusions from our frequency analysis.

Another more advanced version of the bag-of-words is known “tf-idf”, for Term Frequency – Inverse Document Frequency analysis. That closely related model examines multiple texts within a corpus, and for each text the base word frequency as we’ve done here is then adjusted by the rate of occurrence of the word in all of the documents viewed together. In this way very common words, such as “a” or “the”, will be deemphasized as popular in the corpus as a whole, and more frequent words specific to only one text, like “sharks”, “jets” and “tragedy” will stand out more. This is the way the bag-of-words model is usually used, with raw frequency data adjusted into weights, and td-idf makes those weights more useful.

Perl Solution

Preprocessing the input is easily done with a regex. For more complex treatment it would presumably make sense to break the munging down into discreet sections, to facilitate fine-tuning, but for out minimal example we’re just converting a variety of irrelevant characters to spaces, where they will be absorbed into the split /\s+/ .

After pretreatment, the words are separated out and hashed with each instance incrementing a frequency count for that word. To invert that hash we need to associatively map frequency values to a list of all words that occurred with that frequency. Then by sorting the keys of the invert hash we can produce our output lists.

use warnings;
use strict;
use feature ":5.26";

## ## ## ## ## MAIN:

local $/ = undef;
my $input = ;

## preproc
$input =~ s/[. " ( ) ,]|'s|--/ /xg;
$input = lc($input);

my %bag;
my %freq;

## proc
my @words = split /\s+/, $input;
$bag{$_}++ for @words;

while (my ($key, $value) = each %bag) {
    push $freq{$value}->@*, $key;
}

## output phase
for (sort {$a-$b} keys %freq) {
    say +(sprintf "%-4s", $_) . join "\n    ", sort $freq{$_}->@*;
    say '';
}

__DATA__
West Side Story

The award-winning adaptation of the classic romantic tragedy "Romeo
and Juliet". The feuding families become two warring New York City
gangs, the white Jets led by Riff and the Latino Sharks, led by
Bernardo. Their hatred escalates to a point where neither can coexist
with any form of understanding. But when Riff's best friend (and
former Jet) Tony and Bernardo's younger sister Maria meet at a dance,
no one can do anything to stop their love. Maria and Tony begin
meeting in secret, planning to run away. Then the Sharks and Jets plan
a rumble under the highway--whoever wins gains control of the streets.
Maria sends Tony to stop it, hoping it can end the violence. It goes
terribly wrong, and before the lovers know what's happened, tragedy
strikes and doesn't stop until the climactic and heartbreaking ending.
Raku Solution

In Raku, again we have the Bag type to construct our bag of words, which gathers our frequency values automatically. I’ve chosen to ignore inputting the data as distracting from the method, so here we are given the summary snippet as a heredoc assignment at the beginning. Because the frequency analysis is automatic to the Bag type, we can simply chain that into the processing and immediately invert the data into an output %freq hash. Then the frequency hash is sorted and displayed. Of note we do need to apply trim to remove trailing whitespace and the newline from the original heredoc, or else we end up with a spurious null key in our Bag.

unit sub MAIN () ;

my $input = q:to/__END__/;
    West Side Story

    The award-winning adaptation of the classic romantic tragedy "Romeo
    and Juliet". The feuding families become two warring New York City
    gangs, the white Jets led by Riff and the Latino Sharks, led by
    Bernardo. Their hatred escalates to a point where neither can coexist
    with any form of understanding. But when Riff's best friend (and
    former Jet) Tony and Bernardo's younger sister Maria meet at a dance,
    no one can do anything to stop their love. Maria and Tony begin
    meeting in secret, planning to run away. Then the Sharks and Jets plan
    a rumble under the highway--whoever wins gains control of the streets.
    Maria sends Tony to stop it, hoping it can end the violence. It goes
    terribly wrong, and before the lovers know what's happened, tragedy
    strikes and doesn't stop until the climactic and heartbreaking ending.
    __END__

## preproc
$input ~~ s:g/  | \'s | \-\- / /;
$input .= lc;
$input .= trim;

## freq analysis
my %freq;
for $input.split(/\s+/)
          .Bag
          .kv -> $key, $val {
    %freq{$val}.push: $key;
}

## out
for %freq.keys.sort({ $^a  $^b }) {
    say $_.fmt("%-5s") ~ %freq{$_}.sort.join("\n     ") ~ "\n";
}

One thought on “Finding Common Basis with a Bag of Sharks

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s