NLP – Programming Excursions in Perl and Raku

Wherein we discover once we’ve put everything away in its correct spot we can no longer find anything. Such is life.

THE WEEKLY CHALLENGE – PERL & RAKU #181 Task 1

“Ending a sentence with a preposition is something up with which I will not put.”

— Winston Churchill

Sentence Order

Submitted by: Mohammad S Anwar

You are given a paragraph.

Write a script to order each sentence alphanumerically and print the whole paragraph.

Example

Input:
    All he could think about was how it would all end. There was
    still a bit of uncertainty in the equation, but the basics
    were there for anyone to see. No matter how much he tried to
    see the positive, it wasn't anywhere to be seen. The end was
    coming and it wasn't going to be pretty.

Ouput:
    about All all could end he how it think was would. a anyone
    basics bit but equation, for in of see still the the There
    there to uncertainty was were. anywhere be he how it matter
    much No positive, see seen the to to tried wasn't. and be
    coming end going it pretty The to was wasn't.

Background

One of the biggest obstacles facing any new reader is that the words, simply put, are all over the place. We might start a phrase using a word near the front of the alphabet, then immediately jump haphazardly to near the end, then whiplash back to the front again only to land somewhere in the middle. To a casual observer this placement could reasonably be considered random. It seems completely nonsensical. It’s also exhausting, both mentally and physically.

Even thinking about this state of affairs has left me emotionally drained and dangerously unstable. This affront cannot stand.

So what can we do to make this reading thing people do more… user-friendly?

Well, one thing that comes to mind is we could minimize the dictionary page gaps when looking up adjacent words in a sentence. We start with all the “A” words and end with “zymurgy”, should we be unlucky enough to have to walk that far. Then, for the next sentence — the next idea — we simply take a casual stroll back to the beginning of the alphabet, casually traveling at our own pace, taking as much advantage as required, thanks to the broad latitude in pausing allowed by the terminal punctuation mark.

That totally sounds way better. Let’s do that thing.

METHOD

So we’ll need to split our paragraph up two ways: once to break out the individual sentences, then again when we’re within the sentence to sort the words. Using split on the input paragraph, we can define a regular expression to match out the terminal punctuation: period, question mark or exclamation point.

/[.?!]/

Remember that a dot inside a character class is just a dot, not a wildcard.

But, you ask, what of that really long word, “alphanumerically” in the challenge description? This does imply the possible existence of numbers, and with them decimal points. And those decimal points do look suspiciously like periods. This could cause a problem should, say, we ever need to define pi within our sentence. Or discuss converting inches to centimeters:

“One international inch, in 1959, was defined to be exactly 2.54 centimeters.”

Fortunately punctuation that actually ends a sentence is invariably followed by a space, so if we include that in our delimiter we’re good again.

/[.?!]\s/

However as-is our sentence splitting does not simply split on but removes and discards the terminal punctuation! Holy crap! How did we miss that?

Wait, wait, take a deep breath and hold it. That’s right. Now release, and feel the rage flow out of your being, attracted but the heat of the sun, which gathers it and returns it to Earth on the healing solar wind. Can we continue now? Take another moment if you must, this is important. Together? Good.

So, the reason the world will not end today is that if we surround the delimiter match with capture parentheses, the matched delimiter is returned as an element in the split array. Then, because every proper sentence has some sort of terminal punctuation, we end up with pairs of sentences and closing marks.

There’s only one loose end remaining. This being the last line of the input, which likely has a period without a space as there’s nothing beyond to separate it from. There may not even be a terminal linefeed. Fortuantely we have a metacharacter for that: the anchor $ signifying the end of the string.

Ultimately what we end up with is

split /([.?!](?:\s|$))/, $input

which does exactly what we need. I love it when a plan comes together.

The last piece of the puzzle is to rearrange the words in each sentence irrespective of any capitalization. We need to isolate each whole sentence element in the first array and split it again, on simple whitespace this time, and sort the resulting list. To perform the sort in a case-insensitive manner the normal way we used to do things would be to convert to lower- or uppercase and use that as a common basis. With the introduction of Unicode, however, we have an improvement in “casefolding”, being a generalized way of comparing strings regardless to case, no matter the specific meaning to that concept in non-Latin alphabets. Using fc() is a bit less hacky as the function is designed specifically to perform case-insensitive comparisons.

PERL 5 SOLUTION

use warnings;
use strict;
use utf8;
use feature ":5.26";
use feature qw(signatures);
no warnings 'experimental::signatures';



my $input = q(All he could think about was how it would all end. There was still a bit of uncertainty in the equation, but the basics were there for anyone to see. No matter how much he tried to see the positive, it wasn't anywhere to be seen. The end was coming and it wasn't going to be pretty.);

my @sentences = split /([.?!](?:\s|$))/, $input;

while (my ($sent, $punct) = splice @sentences, 0, 2) {
    my @w = sort { fc($a) cmp fc($b) } split /\s/, $sent;
    print "@w", $punct;
}
say '';

The Perl Weekly Challenge, that idyllic glade wherein we stumble upon the holes for these sweet descents, is now known as

The Weekly Challenge – Perl and Raku

It is the creation of the lovely Mohammad Sajid Anwar and a veritable swarm of contributors from all over the world, who gather, as might be expected, weekly online to solve puzzles. Everyone is encouraged to visit, learn and contribute at

https://theweeklychallenge.org

Tag: NLP

An Orderly Sentencing