AlgoDiscoDev Software Development

Vocabulary tests for English speaking Yoruba language learners

< pada

"...and why not start with the problems on your doorstep?"

- rhetorical questioner

Summary of this project

This project is now deployed at yorubasystems.com. The orginal scope of this project was for a simple, online vocabulary quiz. This page documents the interesting parts of my research and software development process.

Background

The system of writing the Yoruba language using latin-based characters was not the first, but for historical reasons, has become our de facto standard.

Unfortunately, the bus on which unicode code-points for latin characters were being standardised, left about 20 years ago.

Thankfully, some smart unicode people developed the 'combining diacritical marks' workaround (which I won't go into on this page). Being a workaround however, and not a core standard, it's implementation is far from universal.

The Challenge

The web is now the way most of the world consumes it's written information - so, not so much the printed media anymore. Web and Information standards are the foundations of the stacks on which we're all building. But what happens when a language is not yet a standard part of that foundation?

So, my main 'interesting challenge' would be discovering for myself whether I could create a web application whose source data was from a character set for which some unicode code-points do not currently exists. Eh?

Research

Unicode Normal Form

Ok, just a bit about combining diacritical marks.

The latin character é has the unicode code-point U+00e9. It's base character e has the code-point U+0065. It's possible to encode é using the combined code sequenceU+0065U+0301, where the U+0301 is for combining the acute mark. However, since a code-point is already available for é, that would not be in Unicode Normal Form. HTML validators would complain.

It's only possible to achieve Unicode Normal Form where a unicode code-point exists.

Read this wikipedia article to learn more about Unicode character composition and decomposition.

Interpreting Unicode Characters

In the Linux world, the bash interpreter will take an unix hex unicode code-point from stdin and give us back the corresponding unicode character in stdout:

                        Ctrl + Shift + u 00e9 # outputs: é
                    
                        damola@host:~$ echo -e '\u00e9'
                        # (as an escape sequence) outputs: é
                    

Developers need to be aware of which unicode hex formats are understood by the various interpreters that make up their web applications. For the Yoruba language character set, developers then need to use reliable pipelines to encode, decode, compress, encrypt, input, output, store and transmit data between these different interpreters.

To be transmitted safely across the Internet, data needs to be encoded using schemes like base64 or Unicode, then decoded by your application.

Some of the default escape sequence formats I found were:

unix hex unicode refs used by bash interpreters: \u00E9

xml unicode code-point used by most interpreters: &#x00E9;

html cer used by browsers (which are interpreters): &#233;

unix-like unicode used by php interpreters: \u{00E9}

Then, there's the font support to consider. For reasons of efficiency, some font families like monospace are specifically designed to implement only core standard unicode. Even correctly decoded combining diacritical marks are therefore rendered incorrectly by these fonts.

As explained in the Mozzilla docs, some HTML5 elements like <samp> are typically rendered using the browser's default monospaced font.

Probably one for further research.

At this point in the research, my excitement levels had become 'quite high'. The opportunites to develop services in this space seemed massive.

Software Development

Prototyping as a BASH Shell Program

Next, I had the idea that the simplest form of application, in which the reliable interpreting of these weird Yorùbá characters could be demonstrated, would be a language quiz app. Now doing software development, I created a yoruba language quiz bash shell program as a kind of prototype of how such an application would handle JSON source data like this:

{ "englishPhrase": "'hat'", "yorubaPhrase": "'fil\\u{00E0}'", "category": "", "shortAudioFile": "" }

Terminal Font Configuration

Of the 13 pre-installed font styles that can be configured in the Ubuntu Terminal, 7 are considered Yorùbá-safe. That is, they're able to render Yorùbá character symbols without errors.

  • Noto Mono
  • Courier New
  • FreeMono
  • Noto Color Emoji
  • Tlwg Mono
  • Tlwg Typo
  • Ubuntu Mono

It worked as expected. From that point, I was confident enough to start developing a web application, briefly summarised as follows:

Application Architecture

Classic, monolithic, LAMP stack architecture would do for now. So, something like an Ubuntu Linux server operating system would be running an Apache http server, on which PHP and MySQL/MariaDB modules would need to be installed. All very tightly coupled, I know.

Database Design

Entity Relationship modelling used to create a Data Model. For an application with static data, so nothing changed or created by users, I could have got away with reading source data from a JSON structured data file. Up to a point, this might even perform much better than database reads. However, by this point I was beginning to develop 'big ideas' about the destiny of this web app. No point delaying. Only full CRUD functionality would do for where this application was going! A MySQL relational database, now!

Application Software Development

UML techniques used to model the quiz domain.

Implementation

This 'minimum viable product' server-side web app implemented the Front Controller pattern, with one main page as the single entry point of control. The application was coded and database created within a local development environment. PHPUnit was installed locally to the project using the Composer dependency manager, and used to conduct suites of unit tests on our classes and object methods as they were being developed.

Deployment on AWS Cloud

Based on the application architecture I'd decided upon, but also as a bit of a side-challenge, it looked like I'd be utilising the IaaS, public cloud offerings provided by AWS. A Single EC2 hypervisor instance was launched to virtualise an Ubuntu Linux server OS, on which an Apache web server, configured with various modules, was built by hand (slow, but satisfying). So, web and database servers running on the same instance as the application itself should be enough for now. Or so I thought. I won't go into that now.

Securing the Web App

Being a server-side web application gives an inherent layer of security. The most sensitive application code stays on the server. Additionally, the Apache 'Web Application Firewall' module mod_security was installed as an Application Layer inspector of web server traffic.

Monitoring the App

  • Web server logs /var/log/apache/yorubasystems.com
  • Ubuntu server logs /var/log/syslog
  • AWS Cloudwatch logs

Site Analytics and Testing

None yet.

Hero message from the ṢàngóTech Network.

"Designing software systems is very nice."

Let's discuss this...