Archive for March, 2009

Living organisms

Sunday, March 29th, 2009

At this point it would be very useful to have some ordinary, countable nouns in the language definition files. One very large category of countable nouns, which I’ve been lining up for some time, is that of living organisms.

My stated policy is to use existing naming standards when defining predicates unless there is a very good reason not to. In this case, the obvious standard to turn to is that of Linnaean taxonomy - the source of what are popularly known as the ’scientific’ or ‘latin’ names of plant and animal species.

In addition to providing unique names for individual species, this serves the important purpose of grouping species into larger categories. It is necessary to this because many common names refer to groups rather than individual species. Examples include ‘rodent’ (the order Rodentia) and ‘insect’ (the class Insecta).

The specific syntax I have in mind is to place everything in the namespace ‘bio’ (biology), then have secondary namespaces for each taxonomic rank. These would be followed by the taxon name itself (making further use of the namespace separator in the case of binomial or trinomial taxa). For example:

bio:species:suricatta:suricatta (meerkat)
bio:ordo:primates (primate)
bio:infraclassis:marsupialia (marsupial)

Note the use of latin names for ranks, non-abbreviated genus names, and all-lowercase script. The result of applying one of these predicates is true to the extent that its argument is composed of instances of the specified taxon. It is not true for subordinate taxa, nor (without appropriate qualification) for parts or derivatives of the relevant organisms (you can make beef from a cow but not vice versa). Quantity, gender and age are unspecified.

The Linnaean system is not a problem-free choice, for several reasons:

  • Linnaean names sometimes change in response to new discoveries.
  • Biologists do not always on how a species should be named or classified.
  • The relationship between Linnaean names and common names can be less than straightforward.

However, in my opinion the alternatives are even less attractive:

  • Competing scientific systems tend to map even less neatly onto common names than the Linnaean system, due to their emphasis on genotype (genetic code) rather than phenotype (physical form).
  • Common names could be used in the source language, but to maintain any pretense of rigour they would need to be defined somehow - probably using the Linnaean system. Furthermore, while they would undoubtedly translate well into languages which draw their semantic boundaries in the same place, the problems caused by any deviation from this ideal would be magnified (there then being two sets of idiosyncrasies to resolve instead of one).

For these reasons I’m satisfied that the Linnaean system provides the best available basis for the naming scheme, but with two reservations which mean that it will be necessary to use it selectively.

The first is that, unlike chemical elements, it would be neither practicable nor desirable to achieve exhaustive coverage: there are simply too many species out there, not to mention other ranks such as classes, legions, infraorders, superfamilies and subspecies. Secondly, in order to provide sufficient coverage without using an unreasonably large number of names, it will be necessary for the depth of coverage to be non-uniform.

To illustrate why the coverage cannot reasonably be uniform, compare the subspecies Ursus arctos horribilis (grizzly bear) with the phylum Nematoda (nematode worm). Phyla and subspecies are almost at opposite ends of the taxonomic scale, but their corresponding common names are both at the limit of detail that would be expressed in normal speech. (If anything, ‘grizzly bear’ has much more right to be considered a common name than ‘nematode worm’.)

To put this in perspective, the phylum to which grizzly bears belong is Chordata, but that includes all mammals, birds, reptiles, amphibians and fish. Not all of these have distinct common names, but there are several hundred at least which do. For comparison there are between 80,000 and 500,000 species of nematode worm, but as Wikipedia says they are ‘very difficult to distinguish’ and non-specialists generally don’t. If any fixed taxonomic rank were chosen as a cut-off then the system would clearly either fail to provide sufficient detail, or include a huge amount of detail that was not linguistically significant.

Similar issues arise for any fixed set of intermediate ranks. My default policy will be to draw names from what are called the ‘major ranks’ (kingdom, phylum/division, class, order, family, genus and species), but not from minor ranks unless there is a good reason to.

An example of a species for which these ranks map relatively well to common names is Vulpes vulpes:

bio:species:vulpes:vulpes (red fox)
bio:genus:vulpes (fox)
bio:familia:canidae (dog)
bio:ordo:carnivora (carnivore)
bio:classis:mammalia (mammal)
bio:subphylum:vertebrata (vertebrate)
bio:phylum:chordata (chordate)
bio:regnum:animalia (animal)

Even here, some difficulties can be seen. The common name for members of the phylum Chordata is ‘chordate’, but this is not a word that non-specialists would be likely to use in everyday speech. The more common name ‘vertebrate’ corresponds to the sub-phylum Vertebrata, which is a minor rank, but if we were to include all minor ranks then many would not have common names at all (or names sufficiently obscure that they would only be meaningful to biologists).

Because of this selectivity it won’t be possible to simply pick a species or genus name and use it in BabelScript source text. Instead there will need to be a list of names that are allowed (just as there will be for most other types of predicate). That also solves the problem of what happens when species are renamed or reclassified, or when there is disagreement as to what the name or classification should be: so far as the translation system is concerned, the list will be definitive.

I’m undecided what to do about common names which map very poorly onto the hierarchy. An example is the concept of ‘fish’, which at best maps to a group of vertebrates with no more in common with each other than with amphibians, reptiles and mammals. At worst it includes an assortment of crustaceans and other organisms which are entirely unrelated. One option is to resurrect obsolete terms such as the class Pisces, which are no longer used scientifically but which map well to the corresponding linguistic concepts. Another would be to revert to common names in those cases.

Update 2009-07-13: it is apparently permissible to use the same name for both a plant and an animal taxon. That means it will be necessary to further distinguish the predicate names (preferably without adding another level, as they are already quite long enough).

Letter Apostrophes

Saturday, March 21st, 2009

Attempting to process some Ukrainian text raised the problem that the apostrophe character - or to be more precise, the typewriter apostrophe (U+0027) that was inherited from ASCII - is not currently allowed to occur within BabelCode/BabelScript identifiers.

However, this turns out not to be a problem at all, because there is another character called ‘letter apostrophe’ (U+02BC) specifically intended for use when it has the function of a letter, as opposed to a separator or a delimeter. This is already allowed to occur within an identifier, therefore no change to the library has been necessary.

For use as a punctuation mark the preferred character is apparently U+2019. This too is already allowed to occur within an identifier, as will be necessary when (for example) forming possessives in English.

Demonstration of Cardinal Number Generation

Friday, March 20th, 2009

There are now twenty-two languages for which cardinal numbers can now be generated. You could already see this for yourself by downloading and compiling the source code, but not everyone will want or be able to do this. I’ve therefore created a small web application to demonstrate the capability. To try it, click here. You simply choose a language, a number to start from and a step size, and it will display up to twenty results in order.

The languages currently supported are Catalan (ca), Danish (da), German (de), English (en), Spanish (es), Estonian (et), Basque (eu), Finnish (fi), Faroese (fo), French (fr), Hungarian (hu), Italian (it), Icelandic (is), Lithuanian (lt), Latvian (lv), Norwegian (nb), Dutch (nl), Polish (pl), Portuguese (pt), Romanian (ro), Portuguese (pt) and Swedish (sv). Known limitations of the system as it stands are noted at the bottom of the page.

Agglutination and Hyphenation (part 2)

Wednesday, March 11th, 2009

I’ve described how two or more morphemes can be merged into a single word by applying the ‘agglutinate’ tag, but few languages are so purely agglutinative for this to be sufficient: in most cases there are at least some combinations which require further processing to create the correct surface form. For example, when the Swedish words ‘ett’ and ‘tusen’ are merged the result is ‘ettusen’, dropping one of the letters ‘t’ in order to avoid creating a treble-’t’. Languages that use complex methods for combining morphemes are described as being fusional.

It is unclear (to me) exactly how powerful the mangling mechanism needs to be, and I won’t know until I have gained a great deal more experience working with fusional languages. However, it seems likely that the mechanisms at work will be similar to those which occur during inflection - after all, inflection can be considered to be a form of compounding. For this reason I’ve decided to implement mangling using the same mechanism as inflection. In addition to regular expression chains, which form the basis of the mechanism, this allows the mangling to be influenced by tags, provides access to variables, and allows rules to be ordered using the keywords ‘depends’ and ‘forces’.

There is a possibility that this decision is overkill, but re-use of an existing mechanism brings two very important advantages:

  • It re-uses code in the translation engine, helping to keep its size and complexity to a minimum.
  • It re-uses BabelCode syntax, meaning that there is less to learn.

For this reason I would like the mechanisms to remain unified if at all possible (meaning that if any improvements are necessary, they should apply equally to both inflection and compounding).

One obstacle to unification was the fact that inflectional rules are designed to operate on a single word, and are not by themselves able to join two words together. My first solution to this problem was for the substitution chain to contain pairs of regular expression patterns. Both patterns would have to match for the rule to apply, and both would be able to supply substrings for use in the replacement string. This would have worked, but would have required substantial changes to the syntax and would not have achieved my objective of code reuse.

What I’ve done is therefore to join the components together before the mangling takes place, but separated a plus sign in the first instance so that the regular expression can see where the boundary is located. (A plus sign is the character customarily used to represent a morpheme boundary.) Afterwards, the plus sign is deleted.

Compounding rules are distinguished from normal inflection rules by replacing the keyword inflection with either prefix or suffix. Compounding of morphemes A and B to form AB causes the prefix rules associated with A and suffix rules associated with B to be applied. In the more complex case where AB is compounded with CD, it is the prefix rules of B and the suffix rules of C which are applied.

Implementing this arrangement is not entirely trivial. The inflection process begins with an uninflected morpheme, so it is can be easily looked up in the language definition to see which rules are applicable. The compounding process begins with morphemes which may have been inflected, and may already have been partially compounded, so a lookup at that stage would not work. The way I have solved this is to preserve a copy of the uninflected text for use during the compounding phase. Inflection does not change the structure of the text, so it is easy enough to navigate the two structures in parallel in order to match up pre- and post-inflected forms.

Agglutination and Hyphenation (part 1)

Monday, March 9th, 2009

The number systems of English, French, Spanish and Portuguese are largely analytic in nature, meaning that they draw on a small, fixed set of words and do not create new ones. (By ‘word’ I mean a sequence of letters with no intervening spaces or punctuation marks.) There are exceptions, for example the English words ‘thirteen’ through to ‘nineteen’ and ‘twenty’ through to ‘ninety’ can be decomposed further, but the important point is that the number of words is small enough to be enumerated.

This is not the case for many other languages. For example, in German, every non-negative number up to one million is represented by a unique word. Some of these, such as:

neunhundertneunundneunzigtausendneunhundertneunundneunzig (999,999)

are very long, but they are formed according to a regular pattern from a small, fixed set of morphemes. In fact, the pattern is broadly similar to the ones found in English, French, Spanish and Portuguese: it is only the process of concatenation which makes the result appear so very different. Languages which do this are called either ‘agglutinative’ or ‘fusional’, depending on the amount of mangling needed to join the words. German performs little mangling, so is towards to the agglutinative end of the spectrum.

In order to support these word-formation processes, a compounding mechanism has now been implemented within the translation system. It is invoked by adding the tag ‘agglutinate’ to the phrase that is to be merged into a single word. For example:

(acht zehn)[agglutinate]

would be rendered as ‘achtzehn’.

Currently one instance of this tag at the root of a subtree will affect every word beneath it. I’m unsure whether this is the best behaviour - it would be possible for the tag to merge only the rightmost node of the left subtree and the leftmost node of the right subtree - but I’m going to leave it that way unless and until I see evidence that more flexibility is needed. My justification for the current rule is that if compounding cannot follow the phrase hierarchy then the hierarchy is wrong.

In the interests of economy, hyphenation is performed using the same mechanism but a different tag: ‘hyphenate’. I’m open to the possibility of adding further methods of compounding using other symbols, and there may even be a case for compounding methods to be defined in the language definition file (which would avoid the need for ‘agglutinate’ and ‘hyphenate’ to be hard-coded in the translation engine).

Compounding is performed after the inflection phase so that individual morphemes can be inflected if required. Typically it is only the final morpheme that is subject to inflection, but I think there are enough exceptions to this rule not to want to hard-code it (especially when considering hyphenation as well as agglutination). If it makes a difference, the inflectional rules can be told in advance which morphemes are going to be compounded - I would rather do that than have to write inflectional rules able to handle compound words.

Design considerations for compiled language definitions

Friday, March 6th, 2009

As noted previously, one of the options for improving efficiency is to introduce a compilation phase whereby language definition files are converted into a binary format than can be loaded into memory much more quickly than they are at present, and which is suitable for sharing between processes. I’m now fully convinced that this is the right way forward, partly it is a less intrusive solution than a daemon, but mostly because I think a long delay while loading would be undesirable even if it were a one-off act at boot time.

This will require quite radical changes to the structure of the library, to the extent that it would probably qualify as a rewrite. It will also make future changes to the library more cumbersome to implement. For that reason I think it would be a mistake to introduce compilation right now: the library is evolving too quickly, and I don’t want to do anything that would make experimentation more difficult. What I do have are some initial thoughts about how compilation could be implemented.

I identified mmap and dlopen as possible mechanisms for sharing the compiled data between tasks. On reflection, I can’t see any real advantage to using dlopen unless the compiled language definitions are to contain directly-executable code. Although this probably would allow some performance gains, it is a level of complexity beyond what I envisioned and I hope it won’t be necessary. dlopen does have significant disadvantages in terms of portability and flexibility. In particular, a solution based on mmap could be very easily adapted to other methods for loading data into memory (or even using data stored in ROM, in the case of an embedded system). Shared libraries require much more infrastructure to be present.

The next choice is whether the compiled files should be architecture-dependent or -independent. The latter would allow the compiled files to be freely moved between machines, but at a cost: it would not be possible to map the data directly onto native data structures, and they would instead have to be accessed through a translation layer to handle issues such as endianness and alignment. I don’t believe this is a price worth paying because I can’t see any compelling need for a portable file format.

(Note this does not mean that the files absolutely have to be compiled on the machine on which they are to be used. For example, a GNU/Linux distribution could compile them once for each architecture then distribute them alongside other binaries. They would need to be updated on an ABI change, but then so would most other binaries.)

Key factors that will influence the design of the file format are that:

  • It needs to be position-independent, because mmap cannot be relied upon to load it at any fixed memory address. This means that it cannot contain ordinary pointers (but could contain offsets from the start of the file which would serve the same purpose).
  • Once created the files do not need to be modifiable. This will influence the data structures used. For example, a C++ std::map would typically be implemented as a balanced tree, but if the data is fixed then it would be simpler and more efficient to use an ordered array.
  • Related data items will benefit from being stored in close proximity to each other, in order to minimise the number of pages that need to be pulled into RAM.

The API will need changes too, because I certainly wouldn’t want to create a separate, long-lived object to proxy every item of data in the compiled language definition. In some cases it will be possible to take a pointer to the data and cast it into an object, requiring no extra memory to be allocated, but this won’t work if virtual functions or pointers are needed. A third possibility is for there to be short-lived proxy objects with pointer- instead of value-semantics. None of these options is ideal by itself, and my expectation is that some combination of them will be needed.

Generating numbers in Spanish and Portuguese

Monday, March 2nd, 2009

Spanish and Portuguese are closely-related languages, and it is instructive to compare how their number systems work. Both are fairly regular decimal systems, using words that are for the most part phonologically similar. However they are sufficiently different - both orthographically and syntactically - for there to be little or no usable commonality between their respective implementations within the translation system (or at least, no more so than there is, for example, between Spanish and English).

Spanish has single words for all numbers up to 30, after which additive compounds are used. In Portuguese, single words stop at 30. In both cases, the words used for the numbers 16 and upwards are highly regular in form (but currently no attempt is made to exploit that regularity). Both languages also use single words for multiples of 100 up to 1000. Larger values are expressed as multiples of a thousand, million, billion or upwards.

The rules for additive combination are noticeably different:

  • in Spanish, where tens and units are added together this is done using the word ‘y’ (’and’). In all other cases the components are concatenated.
  • In Portuguese, tens and units are always added to other components using ‘e’ (’and’). Also, the final two components of the number are always added using ‘e’. In other cases the components are concatenated.

Some examples to illustrate how these rules differ from each other and from English:

  • 101 = ‘ciento uno’ (es), ‘cento et um’ (pt), ‘one hundred and one’ (en)
  • 199 = ‘ciento noventa y nueve’ (es), ‘cento e noventa e nove’ (pt), ‘one hundred and ninety nine’ (en)
  • 1100 = ‘mil cien’ (es), ‘mil e cem’ (pt), ‘one thousand one hundred’ (en)

Both languages have feminine forms for some numbers, which agree with the noun being counted if there is one. Also in both languages the number 100 has two forms: a shorter one for when it acts directly on a noun (including nouns such as million or billion), and a longer one for when it does not. Neither of these language features have been fully implemented yet, however the latter is present to the extent necessary to support simple counting.

I’ve not encountered any use of hyphens for expressing cardinal numbers in either language.

Both Spainish and (European) Portuguese use the ‘long scale’ for large numbers, so ‘billón’ (es) and ‘bilião’ (pt) translate to 1012 as opposed to 109. Note, however, that Brazilian Portuguese uses the ’short scale’. This will require a slightly different set of decomposition rules, a fact to be born in mind when support for dialects is introduced.

It’s worth saying that definitive rules have been difficult to come by in some cases, and a number of conflicting examples have been seen. (It doesn’t help that, as in English, large numbers are rarely written out in full unless they are round ones.) I’ve done my best to identify and implement a defensible set of rules, but I’m open to correction by those more familiar with these languages. (The same applies to any other language or topic.)

Enhanced decomposition rules

Sunday, March 1st, 2009

The numerical decomposition rules that I described previously have now been used to generate cardinal numbers in English, French, Spanish and Portuguese, and so far have worked well. Italian and Welsh have been looked at, and in both cases the decomposition phase is expected to be straightforward, but there are other issues (compounding and mutation respectively) which will need to be addressed before a full implementation is attempted.

Where I have run into difficulty is the translation of ordinal numbers. Some method is needed to distinguish ordinals from cardinals, and I intend to use a predicate (provisionally called ordinal) which takes a cardinal as its argument and modifies it to yield the corresponding ordinal. Unfortunately, numerical decomposition rules cannot currently ’see’ this modifier so they cannot behave differently for ordinals and cardinals.

I did find one way around this problem, but it involves a certain amount of cheating. The words ‘first’, ’second’, ‘third’ and upwards would be given readings of 1, 2 and 3 just like their cardinal counterparts, but for a category called pre-ordinal. The predicate ordinal would map to a made-up placeholder, which would participate in part-of-speech selection but afterwards be deleted. There would be a production rule in which the placeholder acts on the category pre-ordinal to give the category ordinal. If this is the only production rule which consumes a pre-ordinal then it follows that the words ‘first’, ’second’ and ‘third’ will only be chosen when ordinals have been called for in the source text. Conversely, if this is the only production rule matching the placeholder, then the words ‘one’, ‘two’ and ‘three’ will not be chosen unless ordinals have been called for.

Cunning though this plan might be, I’m not keen on it for two reasons. Firstly, it assumes that ordinals and cardinals will always be decomposed in the same way. That appears to be true of English, and could well be true for most languages, but I would prefer not to hard-code assumptions of this nature into the translation system. Secondly, it involves declaring that ‘one’ and ‘first’ are merely different parts of speech with the same semantic content. That is not true: what ought to happen is for ‘one’ to mean ‘1‘ and ‘first’ to mean ‘(ordinal 1)‘. While this can’t be done using the existing numerical decomposition rule syntax, I said at the time they might need to be extended and that is what I now propose to do.

Clearly decomposition rules need to be made conditional on something other than the value of the number. Tags are an obvious possibility, but as previously noted, few if any tags will have been set at this stage of the translation process. One answer to that would be to provide an opportunity to set some tags before the decomposition rules are applied. An extra translation phase would be needed for this (probably very similar to the existing agreement phase).

Alternatively, decomposition rules could incorporate a pattern in much the same way that transformation and agreement rules do already. This would allow them to take account not just of the number to be decomposed, but of nearby structure too. For example, an English ordinal could be decomposed into tens and units with the rule:

decomposition (ordinal $x) { ((eval:ge 10) $x) } = ((internal:add (ordinal ((eval:mod 10) $x))) ((eval:mul 10) ((eval:div 10) $x)))

Application needs to start at the root of the tree and work downwards in order for this rule to take priority over the corresponding one for cardinals. All subtrees will need to be checked, not merely those consisting of a number, so the decomposition phase will become more computationally expensive (but no more so than the transformation and agreement phases are already). In the first instance I’m not going to require that variables be numbers (although it would be more efficient to perform this check within the pattern rather than waiting for non-numbers to fail the list of conditions).

Although matching on a pattern undoubtedly counts as a significant extension of the rule syntax (and is not therefore to be done lightly), I am comfortable with the idea because it follows a path that is already well-trodden (and consequently requires very little new code). It is certainly less intrusive than an entirely new translation phase, which in any case would need to support very similar functionality.

(Indeed, a case could be made that it is actually a simplification, by making numerical decomposition more like other rule types. A topic I may investigate in the future is whether there is any further scope for convergence between rule types.)