DNA Fragments Are Multiplying Exponentially in North-East Africa

Wherein we learn not to trust the titles at face value, as the headlines, while true, may not accurately reflect the meaning of the contents within

THE WEEKLY CHALLENGE – PERL & RAKU #90


episode one:
“Fire Up The Peptide Synthesizer, this Party’s About to Get Lit!”


TASK #1

DNA Sequence

Submitted by: Mohammad S Anwar

DNA is a long, chainlike molecule which has two strands twisted into a double helix. The two strands are made up of simpler molecules called nucleotides. Each nucleotide is composed of one of the four nitrogen-containing nucleobases cytosine (C), guanine (G), adenine (A) and thymine (T).

You are given DNA sequence,
GTAAACCCCTTTTCATTTAGACAGATCGACTCCTTATCCATTCTCAGAGATGTGTTGCTGGTCGCCG.

Write a script to print nucleiobase count in the given DNA sequence. Also print the complementary sequence where Thymine (T) on one strand is always facing an adenine (A) and vice versa; guanine (G) is always facing a cytosine (C) and vice versa.

To get the complementary sequence use the following mapping:

T => A
A => T
G => C
C => G

Method

Perl was not designed to process genetic data, but it has been argued that it saved the human genome project. With that heady accolade towering above us, it seems right that at some point among these challenges we should do a bit of token dabbling in the dark arts wound up in our DNA.

For all of that, the actual challenge put forth is pretty simple: report the nucleotide counts for the four bases in the given fragment, and find its complementary sequence. I think we should throw in a total count for all the bases as well, just to give a fuller report and allow us to more easily check our work.

Perl Solution

To swap out one character for another throughout a string in Perl there exists a fine tool for the job, the translation operator tr///. Borrowed from the stream editor sed, this can also be called y///, as it would be over there.

As all this does is find instances of a set of characters and map them over to another set of characters in a 1-to-1 manner, it’s very fast, much faster than firing up the whole regex engine. It also returns the number of changes it makes, which again makes it ideal to count the occurrences of an individual character.

We will base our whole solution around this operator.

At first glance it may seem inefficient to translate the whole array 5 times, but this does not take into account the remarkable speed of the operation. Why not break the string apart into an array and process every element only once, counting and transforming the values, say with a hash lookup (very fast itself) along the way?

I considered this notion and to assuage my curiosity wrote that routine up, while composing a wrapper for the original multiple translations version as well. Using Benchmark to compare the subroutines, I got this:

           Rate array trans
array   54810/s    --  -97%
trans 1787160/s 3161%    --

Which is to say the translation version is 30 times faster, give or take. And before anyone asks, extending the string length only widens the gap. So yes, it is remarkably fast.

=====================================================
Dec 11, 2020 at 3:19:00 PM
~/Code/PWC/90_1_dna.pl
-----------------------------------------------------
seq:    GTAAACCCCTTTTCATTTAGACAGATCGACTCCTTATCCATTCTCAGAGATGTGTTGCTGGTCGCCG
length  : 67
thymine : 22
adenine : 14
guanine : 13
cytosine: 18
comp:   CATTTGGGGAAAAGTAAATCTGTCTAGCTGAGGAATAGGTAAGAGTCTCTACACAACGACCAGCGGC

It’s a nice little report. The nucleotide counts add up as expected. Considering that it does one thing, and does that one thing extremely well, I don’t think the translation operator gets the love it truly deserves.

use warnings;
use strict;
use feature ":5.26";

## ## ## ## ## MAIN:

my $seq = 'GTAAACCCCTTTTCATTTAGACAGATCGACTCCTTATCCATTCTCAGAGATGTGTTGCTGGTCGCCG';

say "seq:    ", $seq;

say "length  : ", length $seq;
say "thymine : ", $seq =~ tr/T/1/;
say "adenine : ", $seq =~ tr/A/2/;
say "guanine : ", $seq =~ tr/G/3/;
say "cytosine: ", $seq =~ tr/C/4/;

$seq =~ tr/1234/ATCG/;

say "comp:   ", $seq;
Raku Solution

In Raku, we can do the accounting quickly by breaking the string up into a list and using a Bag type to gather our characters into a hash of keys with the counts as values. The rest of the processing is done with the report phase itself, inside a single heredoc.

And about that heredoc: we will use the more advanced interpolation of Raku to place code blocks { ... } inside the text, and these are evaluated and the results inserted at the appropriate spots before output, which is kind of a great feature. The TR/// (capitalized) operator non-destructively translates the sequence string, held here in the topic, returning the result and leaving the original unchanged. This nicely sidesteps the fact that as-is the input $seq remains immutable.

unit sub MAIN (Str $seq = 'GTAAACCCCTTTTCATTTAGACAGATCGACTCCTTATCCATTCTCAGAGATGTGTTGCTGGTCGCCG') ;

my %bases = $seq.comb.Bag;


## report
say qq:to/END/;
    seq:    $seq

    length  : {$seq.chars}
    thymine : %bases{'T'}
    adenine : %bases{'A'}
    guanine : %bases{'G'}
    cytosine: %bases{'C'}

    comp:   {TR/TAGC/ATCG/ with $seq}

    END

episode two:
“Who’d Think Peasants Would Be So Good at Binary Math?”


TASK #2

Ethiopian Multiplication

Submitted by: Mohammad S Anwar

You are given two positive numbers $A and $B.

Write a script to demonstrate Ethiopian Multiplication using the given numbers.

Method

Egyptian Multiplication, or Russian Peasant Multiplication, here called Ethiopian Multiplication, is means to multiply whole numbers that does not involve any knowledge of multiplication tables, only requiring the user to half one value and double the other. As such it can be performed with tokens or marks, such a bag of pebbles or beans, in addition to being able to be performed using pencil and paper. Because it is simple and effective, apparently it’s still in use in some markets today. Perhaps Ethiopia. I’m not sure exactly.

The process is straightforward. It’s a little clearer laid out on a piece of paper, but I’ll describe the method involving tokens to show that even paper isn’t necessary. Start with two piles of tokens for the two numbers you wish to multiply. Take one of the piles and divide it into two equal parts by alternately placing tokens into one pile and then the other. If there’s a token left over place it back in the column as an indicator. That means the number of tokens in the pile was odd. Take up one of the two piles for the next round and return the other to the bag.

In the other column make a new pile twice the size of the previous by counting through the tokens in place and replicating each token twice in the new entry. Keep laying out the two piles next to each other, a column of increasing, doubling piles lined up with a column of either single tokens or blank spaces.

Continue this way until the halving pile is reduced to a single token.

Now look down the column of doubled amounts. If a pile has a single token next to it keep it, if it does not, remove it and return the tokens to the bag. The result of the multiplication will be the remaining tokens from the doubling column combined.

In unary notation, where one dot equals one unit, you will have something like this:

11 × 7

starting    ending      doubling
pile        pile        pile

.....       .           .....    [1] started with 11 pebbles 
.....                   ..           divided into 5 + 5 + 1 
.                                    Took 5, left 1

.....       .           .....    [2] started with 5 pebbles
                        ..           divided into 2 + 2 + 1
                        .....        Took 2, left 1     
                        ..
                        
..                      .....    [3] started with 2 pebbles 
                        ..           divided into 1 + 1
                        .....        Took 1, left none
                        ..
                        .....
                        ..
                        .....
                        ..
              
.           .           .....    [4] started with 1 pebble.
                        ..           We are done.
                        .....
                        ..
                        .....
                        ..
                        .....
                        ..
                        .....
                        ..
                        .....
                        ..
                        .....
                        ..
                        .....
                        ..

Remove any doubling piles without a pebble next to them and 
put them back in the bag, in this case pile [3]. This is what  
is left when we combine the remaining piles in the doubling 
column:

.....   .....
.....   .....
.....   .....
.....   .....
.....   .....
.....   .....
.....   .....
.....   ..

There are 77 pebbles left.

In case it wasn’t clear, we gathered and combined the contents of piles [1], [2] and [4], and left the indicators where they lay.

7 + 14 + 56 = 77, which is the correct answer to 11 × 7.

Neat, huh? The reason this works is because we’re essentially converting the values into binary, and doing grammar-school long multiplication on that. Wait, what? Who does multiplication in binary like that? Well, no one, but the technique remains the same whatever base you work in. In decimal we’d multiply 7 × 11 by saying 7 × 1 = 7, then add 7 × 1 and shift the position 1, to make the result 70. Add the two rows and you get 77. The position shifting is the same as multiplying by 100, then 101, 102, 103 etc. Most people just do this step without thinking too much about exactly what they’re doing, but that is what they are in fact doing.

       7     
    × 11    
    —————
       7    = 7 x 1 x 10^0   = 7
    + 70    = 7 x 1 x 10^1   = 70
    —————
      77    total

So now let’s do the same thing in binary. No really. It’s perhaps not the most familiar thing to visualize, so I’ve done the work for you, all laid out in long form, with annotations and decimal equivalents:

    0111    = 7
  × 1011    = 11
  ——————
     111    = 0111 x 1 x 2^0   = 7
    1110    = 0111 x 1 x 2^1   = 14
  000000    = 0111 x 0 x 2^2   = 0
+ 111000    = 0111 x 1 x 2^3   = 56
————————
 1111100    (carry row is written in above the total)
+ 112221    sum from above 
————————
 1001101    total              = 77

This is the same example above done in long multiplication in binary. Notice the method is, as we said, the same no matter the base. We multiply each digit in turn, adding an additional 0 place for each as we go, then sum the results. Because it’s binary, the options for the multiplication step are to either write the number, for times 1, or alternately write 0, for times 0. You can plainly see the piles of 7, 14 and 56 laid out here. Notice that we’re still doing single-digit multiplication, but what we’ve done by converting to binary is reduced our multiplication table to a much simpler form, yielding either the identity or zero in all cases. It hardly looks like multiplication at all any more, but technically it is.

The test for oddness, of leaving a token if the division is uneven, is really an indicator that that bit in the binary representation is 1. Shifting to the right is the same as halving a number in binary, and a number with a 1 in the least significant digit will be odd. So the test isn’t so much about oddness, but rather that once bitshifted a certain number of places the number is odd, which is the same as saying the bit as a certain number of places from the right is a 1. Doesn’t that just work out nicely?

Perl Solution

Because the challenge description was phrased to “demonstrate Ethiopian Multiplication”, I’ve decided that rather than just implement the algorithm, to instead write a little report detailing the process for any numbers input, including the intermediate steps and commentary:

===========================================================
Dec 12, 2020 at 4:53:57 AM
~/Code/PWC/90_2_ethiopian_multiplication.pl
-----------------------------------------------------------

Input: 149 × 127

These numbers we will call m and n.
We will halve m, ignoring any remainder,
and double n until m reaches 1. At each
step along the way, if m is odd, we will
add n to the sum. When m reaches 1, the
running sum is the product of m × n

     m       n          sum

   149 |   127 |        127
-------+-------+------------
    74 |   254 |        127
    37 |   508 |        635
    18 |  1016 |        635
     9 |  2032 |       2667
     4 |  4064 |       2667
     2 |  8128 |       2667
     1 | 16256 |      18923
----------------------------
  Product:            18923

To demonstrate the coding, I’ve delivered the algorithm three ways as subroutines beneath the report. The first repeats the original form, stripped of commentary: this follows the verbal description of the method in the clearest way. For the second, I’ve replaced most of the operators with bitwise versions; the doubling and halving is done by bit shifting, and the conditionals are replaced by $m & 1, as a test for oddness, and $m ^ 1 as a test for not equal to 1.

In the third version We’ve gone all out to mimic the binary nature of the algorithm, and used the actual bits, 1 or 0, of the multiplier, viewed as binary, to determine whether to add in a given amount to the total. By keeping track of the bit indexes we can shift the multiplicand the appropriate amount whenever a quantity is to be added, and don’t need to keep track of intermediary doubling if it’s not to be used. I considered going even further and replacing the addition and subtraction with bitwise versions as well, but decided the idea was getting rather far afield and a bit pathological. This last version, though true to the technique, is hardly clear in its action already, and replacing addition with a recursive routine using bitwise ANDs and carry bits was just too much. It would be an instructive model, however, of how some computers have historically done multiplication. So from North Africa and Russian peasants to modern digital computing we have gone full circle.

Math is just math and does not care at all who’s using it. It’s very egalitarian that way, which is nice.

use warnings;
use strict;
use feature ":5.26";

## ## ## ## ## MAIN:

## explanation in the heredoc

my $m = shift @ARGV // 149;
my $n = shift @ARGV // 127;

my $acc = 0;
$acc = $n unless $m % 2 == 0;

say< 1) {
    $m = int $m/2;
    $n *= 2;
    $acc += $n unless $m % 2 == 0;
    printf " %5d | %5d | %10d\n", $m, $n, $acc;
}

say      '----------------------------';
printf "  Product:       %10d\n", $acc;


## ## ## ## ## SUBS:

## stripped version sans narrative
sub eth_mult {
    my ($m, $n) = @_;
    my $acc;
    $m % 2 == 0 ? $acc = 0 : $acc = $n;

    while ($m > 1) {
        $m = int $m/2;
        $n *= 2;
        $acc += $n unless $m % 2 == 0;
    }
    return $acc;
}

## the same function using bitwise operators
## to do the halving, doubling and comparisons
sub eth_mult_bitwise {
    my ($m, $n) = @_;
    my $acc = $m & 1 ? $n : 0;

    while ($m ^ 1) {
        $m = $m>>1;
        $n = $n<<1;
        $m & 1 and $acc += $n;
    }
    return $acc;
}

## the same function basing the switch on
## individual binary bits
sub eth_mult_binary {
    my ($m, $n) = @_;
    my $mb = sprintf "%b", $m;
    my $acc;

    for my $idx (reverse 0..length $mb) {
        next unless substr $mb, $idx, 1;
        $acc += $n<<(length($mb) - $idx - 1);
    }
    return $acc;
}
Raku Solution

In Raku, we’ll just skip directly to the scripts and leave the report to the Perl side of things. We will follow the method two ways this time.

The first method is a fairly straight port of the first Perl subroutine, benefitting from the built in boolean modulus %% operator and the .floor routine. For the second way, however, we’ve tried to do things in a much more Raku-esque manner. In it we start with a list of the index positions of the bits in the multiplier viewed as binary, then apply a function mapped across the indices that looks at the digit and if it’s a 1, then it substitutes the multiplier to its output list, bitshifted by the inverted position number. In other words working through bit positions 7 through 0 the shifts would be 0 through 7. To follow the original method we will look at the highest bit index, the least significant bit, first, so reversing the index list is the first order of business.

If the bit at the position is 0, on the other hand, we can use a control flow next to skip to the next element without producing an output for that pass. We could alternately have listed 0 for the element to the same effect, but being able to put control structures inside map and grep blocks is pretty cool stuff and I felt like doing it this way. When the mapping is done the resulting list is summed to product the output, which is the same as tabulating up the column.

I’ve decided I really like the “air” in a well-formatted Raku function. They just look nice.

unit sub MAIN (Int $m where $m > 0 = 149, 
               Int $n where $n > 0 = 127) ;

say eth_mult($m, $n);
say eth_mult_binary($m, $n);

## same algorithm served 2 ways:

sub eth_mult ($m is copy, $n is copy) {
## follows the multiplication method closest
## divide $m down by 2, double $n up until $m = 1
## add $n to accumulator every time unless $m is even
    my $acc;
    $acc = $m %% 2 ??  0
                   !!  $n;

    while $m > 1 {
        $m = ($m/2).floor;
        $n *= 2;
        $acc += $n unless $m %% 2;
    }
    return $acc;
}


sub eth_mult_binary ($m, $n) {
## more Raku-esque and mimics binary long multiplication
## 1. list the binary bit positions of the multiplier 
##        in reverse order 
## 2. apply mapping:
##       if bit at position is a 1 (TRUE)
##           then add multiplicand bitshifted by 
##           inverse position number
##       if 0 (FALSE)
##           then skip to next element
## 3. sum list

    my $acc = (0..$m.base(2).chars-1)
                .reverse
                .map( {  $m.base(2).substr($_,1).Int
                           ?? $n+<($m.base(2).chars - $_ - 1)
                           !! next } )
                .sum;

    return $acc;
}

One thought on “DNA Fragments Are Multiplying Exponentially in North-East Africa

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s