Archive for the ‘Programming’ Category

Computer Terminology part 1: Strategy

Tuesday, April 27th, 2010

One of the main motivations for developing the translation system was to automatically generate localised user interfaces for computer software. This task is likely to make heavy use of computer-related terminology, making that topic an obvious one to address at an early stage. I’ve refrained from doing this previously because there were some basic questions about predicates which needed to be settled first, and it was easier to do that using well-ordered sets of concepts such as numbers and colours. Now that I’ve gained some experience I think it is feasible to attempt something more ambitious.

Identification of the required predicates has been largely ad hoc, and I am expecting substantial additions to be necessary in the future. Writing definitions has been time-consuming but mostly straightforward. I doubt they are up to dictionary standard, but they should be adequate to indicate what is intended.

More difficult has been deciding how to divide the resulting predicates between namespaces. I’ve explored two alternative approaches:

  • A subtype-supertype hierarchy, in which members of a given namespace share an is-a relationship with their parent.
  • A topic-based hierarchy, in which predicates with fundamental semantic differences can and should share the same namespace if they relate to the same subject area.

For example, the first method would place the user interface elements ‘window’, ‘icon’ and ‘menu’ into one namespace, and the user actions ‘click’ and ‘drag’ into another, because the former are user interface elements whereas the latter are actions. The second method would more likely place all of these in a single namespace because they all relate to the topic of computer user interfaces.

The first method is attractive because it is relatively objective, but I’ve found that it has three major drawbacks:

  1. The resulting namespaces are often very small (and therefore numerous).
  2. The namespaces can be difficult to name concisely.
  3. Homonyms, which namespaces were supposed to resolve, are quite likely to end up in the same namespace.

The third point is arguably the most serious one, because there is no point having namespaces unless they provide a way to specify which sense of a word is intended. For example, using the supertype-subtype approach, there isn’t a large difference between hanging out the washing and hanging a person: both are actions which involve suspending something from a rope or line. However the topics with which they are associated are entirely different: housekeeping versus criminal justice.

Another question that arose was how far to go in creating predicates to represent concepts that could be expressed in terms of other predicates. Here are some examples where I think a good case can be made one way or the other:

  • Although it is true to say that a ‘laser printer’ is a printer that contains a laser, there is much more to its meaning than that. A full description would be impracticable, on grounds of both verbosity and fragility. It therefore makes good sense for there to be a separate predicate corresponding to the concept of a laser printer.
  • The alternative would be to create a predicate that captures the full, specialised meaning of the word ‘laser’ as used in the term ‘laser printer’. However, such a predicate would be so specialised that it could, I suspect, only be used to modify the concept of a printer. If the predicate language was being developed for semantic analysis then this added orthogonality might be useful, but I don’t think a translation system has any need for it.
  • A ‘printer cable’ is merely a cable for a printer. If a separate predicate were created for this concept then, for consistency, it would also be necessary to add ‘keyboard cable’, ‘mouse cable’, ‘plotter cable’ and many others. The need to systematically replicate a complete class of predicates is a clear indication that orthogonality has been violated, and while I wouldn’t be above that if there were a clear practical benefit, I can’t see any justification for it in this instance.

The distinction here is between what would be an idiom and what is merely a collocation. Idioms are forbidden in BabelScript, because they violate the rule for combining predicates, so if a concept would be too complex to be described analytically then it needs to become a predicate in its own right.

However there is an exception to this rule. Even if a predicate is not strictly needed, it can be added anyway as an alias for the decomposed form. In this case the criterion is whether the convenience of having a single predicate to express a concept outweighs the clutter resulting from additional and unnecessary predicates. In the third example above I’ve said no, but this is very much a value judgment.

For many of the potential predicates I’ve considered the correct answer is unclear. For example:

  • Is a package manager merely a program for managing packages?
  • A compiler is (I would say), merely a program for compiling, but is it a sufficiently well-used concept to justify an alias?
  • Should constructors and destructors be expressed as functions for constructing and destroying?

The approach I’ve taken is to err on the side of caution and avoid creating predicates that are questionable. Adding new predicates is straightforward, especially if they are aliases, whereas removing them is more disruptive (because it breaks compatibility with texts that use them).

Dialects

Wednesday, October 14th, 2009

Currently the translation system supports only one dialect of any given language, so English means British English and Portuguese means European Portuguese. The United States and Brazilian dialects of these languages are sufficiently different (and popular) that they certainly ought to be supported too, but similar enough that writing entirely separate language descriptions would result in an undesirable amount of duplication.

What is needed is a mechanism which allows language description files to share common data. This would bring two benefits:

  • a reduction the amount of memory and disc space consumed;
  • automatic propagation of any corrections or improvements to a language to its dialects.

There are two ways in which sharing could be achieved:

  • by merging related dialects into a single language description file, then switching sections of that file in and out using some form of conditional notation; or
  • by requiring a separate language description file for each dialect, but allowing inheritance relationships between dialects such that only the differences need be specified.

Drawbacks of the first method are reduced modularity and readability. All dialects of a language would have to be loaded together, even if only one were needed. The language descriptions are likely to be quite complex enough handling one dialect, and if anything I would prefer to be looking at ways to break them down into smaller units rather than making them larger.

The main drawback of the second method is that it scales very poorly if the language can vary in several dimensions independently. For example, in Celtic languages the use of decimal versus vigesimal numbers is only loosely correlated with dialect and to a large extent is a matter of personal choice. You could write two language description files for each language, one for decimal and one for vigesimal, but then what happens when another issue is found where a similar choice is needed?

For these reasons I don’t think that either method provides a complete solution, so am inclined to implement both. This is not an unreasonable extravagance: many programming languages provide comparable facilities (such as #ifdef and #include in the C preprocessor).

Broadly speaking my intention is to use inheritance for regional dialects, and conditional rules for preferences which cut across those dialects. Inheritance is in the process of being implemented, and I will describe the syntax shortly. Conditional rules I don’t have a clear strategy for yet, but they are a less urgent requirement.

Migration to the GNU build system

Thursday, April 2nd, 2009

Originally I had expected that it would take several years at least to reach the point where a formal release was worthwhile. In terms of creating a general-purpose translation system I think that is still true, but the ability to generate cardinal numbers - now in over 40 languages - is a useful ability in its own right. For this reason I’m now looking to identify and resolve any issues which would stand in the way of a release.

Until very recently the project has been built using a bespoke set of makefiles. I’ve now removed these and replaced them with scripts for driving the GNU Autotools suite. There were two main drivers for this:

  • portability, in particular with respect to shared library implementations; and
  • providing a full set of standard makefile targets.

As a result of this, the procedure for building from the Subversion repository is now as follows:

./bootstrap
./configure
make

and to run the regression tests:

make check

There are some important documentation files missing from the distribution tarball, but it will now pass a distcheck test and on that basis can be declared to be ‘working’.

One important change to the code is that the path for locating language definition files is now compiled into the library: for a default prefix of /usr/local the path would be /usr/local/share/babel/languages. In order to allow the regression tests to run without installing the files first, the path can be overridden at runtime by setting the environment variable LIBBABELPATH.

The library API is not at all stable yet, so do not place any reliance on the soname. I intend to greatly cut down on the size of the public interface before releasing anything, quite possibly by hiding the entire C++ interface and providing a minimal one written in C. (Some of the C++ interface might then be re-exposed in the future, but only when it is much more stable than it is now.)

Design considerations for compiled language definitions

Friday, March 6th, 2009

As noted previously, one of the options for improving efficiency is to introduce a compilation phase whereby language definition files are converted into a binary format than can be loaded into memory much more quickly than they are at present, and which is suitable for sharing between processes. I’m now fully convinced that this is the right way forward, partly it is a less intrusive solution than a daemon, but mostly because I think a long delay while loading would be undesirable even if it were a one-off act at boot time.

This will require quite radical changes to the structure of the library, to the extent that it would probably qualify as a rewrite. It will also make future changes to the library more cumbersome to implement. For that reason I think it would be a mistake to introduce compilation right now: the library is evolving too quickly, and I don’t want to do anything that would make experimentation more difficult. What I do have are some initial thoughts about how compilation could be implemented.

I identified mmap and dlopen as possible mechanisms for sharing the compiled data between tasks. On reflection, I can’t see any real advantage to using dlopen unless the compiled language definitions are to contain directly-executable code. Although this probably would allow some performance gains, it is a level of complexity beyond what I envisioned and I hope it won’t be necessary. dlopen does have significant disadvantages in terms of portability and flexibility. In particular, a solution based on mmap could be very easily adapted to other methods for loading data into memory (or even using data stored in ROM, in the case of an embedded system). Shared libraries require much more infrastructure to be present.

The next choice is whether the compiled files should be architecture-dependent or -independent. The latter would allow the compiled files to be freely moved between machines, but at a cost: it would not be possible to map the data directly onto native data structures, and they would instead have to be accessed through a translation layer to handle issues such as endianness and alignment. I don’t believe this is a price worth paying because I can’t see any compelling need for a portable file format.

(Note this does not mean that the files absolutely have to be compiled on the machine on which they are to be used. For example, a GNU/Linux distribution could compile them once for each architecture then distribute them alongside other binaries. They would need to be updated on an ABI change, but then so would most other binaries.)

Key factors that will influence the design of the file format are that:

  • It needs to be position-independent, because mmap cannot be relied upon to load it at any fixed memory address. This means that it cannot contain ordinary pointers (but could contain offsets from the start of the file which would serve the same purpose).
  • Once created the files do not need to be modifiable. This will influence the data structures used. For example, a C++ std::map would typically be implemented as a balanced tree, but if the data is fixed then it would be simpler and more efficient to use an ordered array.
  • Related data items will benefit from being stored in close proximity to each other, in order to minimise the number of pages that need to be pulled into RAM.

The API will need changes too, because I certainly wouldn’t want to create a separate, long-lived object to proxy every item of data in the compiled language definition. In some cases it will be possible to take a pointer to the data and cast it into an object, requiring no extra memory to be allocated, but this won’t work if virtual functions or pointers are needed. A third possibility is for there to be short-lived proxy objects with pointer- instead of value-semantics. None of these options is ideal by itself, and my expectation is that some combination of them will be needed.

Efficiency

Saturday, February 7th, 2009

A word about efficiency, and in particular the amount of time needed to parse a language definition file. I’m already conscious that this is not entirely negligible, even for files of just a few hundred lines. Expand this to tens or hundreds of thousands of lines and the time to read it will almost certainly become unacceptable if it has to be done on a per-process basis. However I have at least two options to reduce the impact of this problem:

  • Run the translation system as a daemon.
  • Translate the language definition file into an intermediate form which is loaded using mmap of dlopen and is capable of being interrogated directly.

The first option has the virtue of being easy to implement across a wide range of operating systems. It would even be conceivable for the translation system to operate as a network service, in which case the overhead of keeping the language definition in memory could be spread between many machines. However this is a relatively heavyweight option, and has more chance of going wrong (due to the daemon dying, failing to start, failing to bind to its server socket, or any of the other reasons why daemons don’t always work as they should).

The second method is not dissimilar to the way PO files are translated to MO files for use by the GNU gettext library, although the format would need to be much more complex. As with MO files, the intermediate form would not need to be architecture-independent because each machine would generate its own copy from the language definition file (which, being a text file, is architecture-independent). However libbabel would have to change significantly for this to work because standard containers such as std::map and std::list can’t be copied bit-for-bit into a file: new data structures would be needed, preferably ones with good locality of reference.

I’m inclined to favour the second option on platforms where it is feasible, as it better follows the principle of ‘not paying for what you don’t use’, and because mapping a read-only file into memory is less of a burden than allocating an equivalent volume of writable memory. However, no decision is needed at this time, and for now it is enough to know that options exist.

UTF-8 support added

Saturday, January 3rd, 2009

I wrote in November about the limitations of the standard regex engine when processing UTF-8 encoded text. These have now become sufficiently annoying to make an upgrade worthwhile. The regex engine I’ve decided to use is the one provided by the PCRE library.

The issue that brought this to a head was the realisation that, while adding accented characters to a set was relatively straightforward, subtracting them was not (other than by inverting the set and listing all allowed characters explicitly, which can be rather cumbersome).

Currently it is possible to switch between the two regex engines at compile time by setting the makefile variable USE_PCRE to true (1) or false (0). This is only a temporary measure: as the language definitions come to depend on UTF-8 support the older engine will soon become unusable, at which point it will be removed to avoid clutter.

An important topic that I have not addressed yet is now Unicode input should be normalised. (I don’t think that there is any doubt that it should be normalised, but there is more than one algorithm that could be used and it isn’t yet clear to me which is preferable.) There may also be a need to support character classes (both the standard ones, and perhaps also use-defined character classes to handle language-dependent groupings such as ‘vowels’ or ‘consonants’). Two-level morphology remains an option for the future, but definitely not the near future.

Using UTF-8 with a regex engine that is not Unicode-aware

Monday, November 24th, 2008

The regular expression engine provided by POSIX - which I’m currently using in the C++ library - does not provide any explicit support for Unicode. Fortunately, thanks to the properties of UTF-8, for the most part it doesn’t need to.

All of the characters that have special meanings within a regular expression pattern have Unicode values of less than 128, and therefore have the same representation in both ASCII and UTF-8. Furthermore, a UTF-8 byte stream will not contain these codes for any other purpose. Other characters will be represented by a sequence of two or more codes in the range 128-255, but so long as the regex engine is 8-bit clean it will be able to match these when they occur literally in the pattern.

Where a difference can be seen is that patterns which attempt to count characters in some way will actually be counting bytes. For this reason, codes greater than 127 within a bracket expression (a list of possible characters in square brackets) will not have the intended effect, nor will the wildcard (dot) character if there is any finite limit on the number of repetitions.

The workaround for bracket expressions is to use branches instead (pipe characters), thereby presenting the multi-byte sequences as strings instead of characters. This is what I’ve done in the language definition files that are currently being written.

I definitely don’t intend for this to be a permanent feature of the system. The replacement may be a regex engine that is Unicode-aware, in which case there are obvious benefits to ensuring forward compatibility (which the above workaround does: it will function correctly whether or not the engine supports Unicode).

Alternatively, I am looking seriously at replacing regular expressions with a system known as two-level morphology. This would be entirely incompatible, but has the advantage of being reversible (and therefore equally applicable to text generation and analysis). My main reservations concern its efficiency and readability, but given the considerable effort that will be needed to create the language definitions, anything that would widen their potential use has to be worth considering.