Archive for October, 2009

Fallbacks

Thursday, October 22nd, 2009

One of the main challenges that the translation system must address is how to behave when there is no word in the target language to express one of the concepts requested. I intend to address this by providing fallback translations which are available across all languages. There will be two types of fallback:

  • Fallbacks to a description. For example, an ‘aeroplane’ might be described as a ‘flying machine’, or a ‘desk’ as a ‘writing table’. These are not perfect replacements, but in a language with limited vocabulary there may be no better way to convey the required meaning.
  • Fallbacks to a reading. Sometimes it is better to borrow a foreign word than to attempt any form of translation. Good examples of this would be animal names such as ‘kangaroo’ or ‘meerkat’ (which made their way into English through this mechanism). Any useful description would be unreasonably long, and falling back to ‘marsupial’ or ‘mongoose’ is unlikely to be helpful.

It is also worth mentioning one other type of construct which, though not strictly a fallback, has similar behaviour in that it can be translated without the need for an explicit reading:

  • Aliases. These are predicates which can be exactly expressed in terms of other, more primitive predicates. The alias exists only as a convenience for those writing source texts and language descriptions. Any analysis occurs after decomposition into primitives.

The natural place for descriptive fallbacks to be specified is in the corresponding predicate definition. (One consequence of this is that descriptive fallbacks will only be able to replace atomic predicates, not compounds. I think this is a reasonable restriction.) I’ve chosen the following syntax:

predicate vehicle:aeroplane
{
  fallback ((for-purpose-of flying) machine);
};

The intended behaviour is straightforward: if no reading is found then the predicate is replaced by the fallback expression and an attempt made to translate that. I would like to allow multiple fallbacks (to be tried sequentially until one translates successfully) provided that this is not overly difficult to implement.

Originally I had intended to make fallback readings part of the predicate definition too, but eventually decided that there is no need: ordinary readings already provide all of the functionality that is needed:

reading meerkat[noun] = zoo:species:suricata:suricatta;

Should an attempt be made to classify fallback readings into parts of speech (as in the example above), or should they simply be tagged as ‘foreign words’ which are somehow outside the grammar of the target language? My expectation is that most fallback readings will be noun-like, but if there are differences then it must surely be useful for the target language to know about them (and if there are none then ‘foreign word’ is simply an alias for noun).

This will make it necessary for parts of speech used in fallback readings to be standardised across languages, so French would use tags such as ‘noun’ and ‘adjective’ as opposed to ’substantif’ and ‘adjectif’. Fortunately I’ve being doing this anyway (albeit largely for my own convenience rather than as part of any grand plan). The same will apply to other types of tag such as ’singular’ and ‘plural’.

Should there be any attempt to create default inflection rules for fallback readings? I think probably yes. When words are borrowed from one language to another they often retain their original morphology in the first instance. At the very least it can’t do any harm to know which language the fallback was taken from. If a language wants to follow its own rules regardless of this information then it can override the default.

I’m aware that different languages have different parts of speech, and that while most languages have words which pass for nouns, adjectives, verbs and adverbs, that does not mean they have the same semantics or behaviour as English nouns, adjectives, verbs and adverbs. I also appreciate that inflecting completely alien words could prove to be difficult in some languages (although knowing that they are alien should help). However this is about producing something when the alternative is to provide nothing, so perfection is not a requirement.

Should languages omit readings if there is a suitable fallback reading? That’s a tricky question. On the one hand, to say no results in a large amount of duplication within the language description files. This is surely undesirable. On the other hand I do think that lists of differences will be more difficult to write, check and maintain, and that automatic propagation of changes is not desirable in the way that it is between dialects.

A good example of this is chemical elements in Danish. There is very nearly a 50-50 split between native and international names, so any saving would not be of the same order as (for example) British versus US English where there are only three differences. If a full list is given then it is possible to check that every element has been considered, whereas a partial list cannot be checked for completeness without redoing much of the research used to compile it in the first place.

For these reasons I’m minded to stick with full lists for the time being. However when compilation is introduced it would certainly be possible for any redundant readings to be automatically removed by the compiler, and I see no harm in that. Finally, I would only intend to duplicate readings for words which have been or are being assimilated into the language in question. Where a language has no word for a concept, the fallback will not be duplicated.

Dialects (Inheritance)

Saturday, October 17th, 2009

There are several decisions to be made regarding implementation of the inheritance mechanism. First and most obviously, there needs to be some form of command to indicate that one language is derived from another. I considered two alternatives for the syntax:

  • a Java-style extends keyword, which would form part of a header at the start of the language description, or
  • a Perl-style use statement, which would be placed in the body of the language description.

The problem with a use statement is that the natural expectation would be for it to be allowed anywhere in the language description: later declarations would override, earlier ones would be overridden. This is easy enough to implement: create a master-index (for each declaration type) listing all of the declarations that have not been overridden. The only snag is that I might not want to generate such an index (if optimising for start-up time and/or memory usage), in which case implementation becomes much more fiddly. I could live with this if there was an compelling need for mid-file inclusion, but I’m not convinced there is one.

That leaves the extends keyword. This already exists and is used for specifying inheritance relationships between morphological paradigms, so using it between languages would be a natural extension. It forces the point of inclusion to be at the start of the language description, so avoids the implementation issue described above. The specific syntax I have in mind is of the form:

language "en_US" extends "en";

Should multiple inheritance be supported? I don’t see any reason why not, but I can’t say I have any particular use in mind for it either. On that basis I’ve decided to allow only single inheritance in the first instance, but keep open the option of adding multiple inheritance later if needed.

Having established a hierarchy, it is then necessary to decide in detail how parent and child languages interact with each other. The general rules are that:

  • Everything defined in the parent is imported into the child.
  • Where search order is important (as is the case when objects are referred to by name), the child is searched first.

The specific effect on each type of object is currently as follows:

  • Tags could be made to override earlier definitions, but I can’t think of any good reason why you would want to do this: the effect on the parent language would likely be both profound undesirable. I’ve therefore forbidden tag definitions which conflict with an existing name.
  • Predicate definitions override any previous definition of the same name. This is potentially quite awkward to implement, because they are indexed both by name and by meaning. If the meaning of a predicate is overridden then searches of the latter index do not necessarily yield the correct result. Unlike tags, predicates definitely need to be able to override each other, but their meanings probably don’t. I’ve therefore forbidden predicate definitions which alter the meaning of an existing predicate, but allowed the fallback and any other attributes to be changed.
  • Paradigms and morpheme definitions override any previous definition of the same name. These are indexed only by name, so implementation is straightforward.
  • Grammatical productions are effectively cumulative. The dialect is searched before the parent language, but since there is currently no way to indicate that a production does not occur, it is not possible to override in any meaningful way.
  • Decomposition rules are searched in the same order as productions, but only the first match is applied. This makes overriding is possible in most cases, but because decompositions are applied recursively it requires significant trickery to turn one off entirely. (Replacing the pattern by itself won’t work.)
  • Agreement rules are searched in the same order again, but all matching rules are applied. This makes overriding difficult or impossible.
  • Transformations behave in a similar way to decomposition rules, but since they are not applied recursively to any single point in the text they are somewhat easier to override.
  • Readings don’t override each other within a given language description file, but they do override any readings with the same meaning in any parent language. (Coincidently, this is very similar to how inheritance and overloading interact with each other in C++.) There is some potential for unexpected behaviour, but I think the alternatives would be worse.

Of these I would consider the behaviour of tags, predicates and readings to be satisfactory, but intend to look further at decomposition rules, productions, agreement rules and transformations. I think it is clear that inheritance works best with objects that have names. I’m reluctant to give every rule a name, but grouping them together into named ‘translation phases’ would probably suffice (and bring other benefits too).

Finally, a word about when inheritance should be used. The answer is: whenever it makes good engineering sense to do so. It does not matter whether linguists would classify the difference as one of language or dialect, or the manner in which it evolved historically. The most important considerations are:

  • whether (and to what extent) inheritance reduces duplication, and
  • whether it makes sense for changes to the parent language to propagate into the child.

Update 2009-10-18: I’ve now implemented Brazilian Portuguese and American English using inheritance. In addition to some changed spellings, this includes the American style of writing numbers in hundreds up to 9999, and omitting the ‘and’ in compound numbers. I’m not completely convinced regarding the last of these points: some speakers are quite adamant that use of ‘and’ is wrong, but evidence of usage in reasonably formal contexts is not so conclusive. I’d welcome comment as to whether I’ve got this right or wrong.

Dialects

Wednesday, October 14th, 2009

Currently the translation system supports only one dialect of any given language, so English means British English and Portuguese means European Portuguese. The United States and Brazilian dialects of these languages are sufficiently different (and popular) that they certainly ought to be supported too, but similar enough that writing entirely separate language descriptions would result in an undesirable amount of duplication.

What is needed is a mechanism which allows language description files to share common data. This would bring two benefits:

  • a reduction the amount of memory and disc space consumed;
  • automatic propagation of any corrections or improvements to a language to its dialects.

There are two ways in which sharing could be achieved:

  • by merging related dialects into a single language description file, then switching sections of that file in and out using some form of conditional notation; or
  • by requiring a separate language description file for each dialect, but allowing inheritance relationships between dialects such that only the differences need be specified.

Drawbacks of the first method are reduced modularity and readability. All dialects of a language would have to be loaded together, even if only one were needed. The language descriptions are likely to be quite complex enough handling one dialect, and if anything I would prefer to be looking at ways to break them down into smaller units rather than making them larger.

The main drawback of the second method is that it scales very poorly if the language can vary in several dimensions independently. For example, in Celtic languages the use of decimal versus vigesimal numbers is only loosely correlated with dialect and to a large extent is a matter of personal choice. You could write two language description files for each language, one for decimal and one for vigesimal, but then what happens when another issue is found where a similar choice is needed?

For these reasons I don’t think that either method provides a complete solution, so am inclined to implement both. This is not an unreasonable extravagance: many programming languages provide comparable facilities (such as #ifdef and #include in the C preprocessor).

Broadly speaking my intention is to use inheritance for regional dialects, and conditional rules for preferences which cut across those dialects. Inheritance is in the process of being implemented, and I will describe the syntax shortly. Conditional rules I don’t have a clear strategy for yet, but they are a less urgent requirement.