Dialects (Inheritance)
There are several decisions to be made regarding implementation of the inheritance mechanism. First and most obviously, there needs to be some form of command to indicate that one language is derived from another. I considered two alternatives for the syntax:
- a Java-style
extendskeyword, which would form part of a header at the start of the language description, or - a Perl-style
usestatement, which would be placed in the body of the language description.
The problem with a use statement is that the natural expectation would be for it to be allowed anywhere in the language description: later declarations would override, earlier ones would be overridden. This is easy enough to implement: create a master-index (for each declaration type) listing all of the declarations that have not been overridden. The only snag is that I might not want to generate such an index (if optimising for start-up time and/or memory usage), in which case implementation becomes much more fiddly. I could live with this if there was an compelling need for mid-file inclusion, but I’m not convinced there is one.
That leaves the extends keyword. This already exists and is used for specifying inheritance relationships between morphological paradigms, so using it between languages would be a natural extension. It forces the point of inclusion to be at the start of the language description, so avoids the implementation issue described above. The specific syntax I have in mind is of the form:
language "en_US" extends "en";
Should multiple inheritance be supported? I don’t see any reason why not, but I can’t say I have any particular use in mind for it either. On that basis I’ve decided to allow only single inheritance in the first instance, but keep open the option of adding multiple inheritance later if needed.
Having established a hierarchy, it is then necessary to decide in detail how parent and child languages interact with each other. The general rules are that:
- Everything defined in the parent is imported into the child.
- Where search order is important (as is the case when objects are referred to by name), the child is searched first.
The specific effect on each type of object is currently as follows:
- Tags could be made to override earlier definitions, but I can’t think of any good reason why you would want to do this: the effect on the parent language would likely be both profound undesirable. I’ve therefore forbidden tag definitions which conflict with an existing name.
- Predicate definitions override any previous definition of the same name. This is potentially quite awkward to implement, because they are indexed both by name and by meaning. If the meaning of a predicate is overridden then searches of the latter index do not necessarily yield the correct result. Unlike tags, predicates definitely need to be able to override each other, but their meanings probably don’t. I’ve therefore forbidden predicate definitions which alter the meaning of an existing predicate, but allowed the fallback and any other attributes to be changed.
- Paradigms and morpheme definitions override any previous definition of the same name. These are indexed only by name, so implementation is straightforward.
- Grammatical productions are effectively cumulative. The dialect is searched before the parent language, but since there is currently no way to indicate that a production does not occur, it is not possible to override in any meaningful way.
- Decomposition rules are searched in the same order as productions, but only the first match is applied. This makes overriding is possible in most cases, but because decompositions are applied recursively it requires significant trickery to turn one off entirely. (Replacing the pattern by itself won’t work.)
- Agreement rules are searched in the same order again, but all matching rules are applied. This makes overriding difficult or impossible.
- Transformations behave in a similar way to decomposition rules, but since they are not applied recursively to any single point in the text they are somewhat easier to override.
- Readings don’t override each other within a given language description file, but they do override any readings with the same meaning in any parent language. (Coincidently, this is very similar to how inheritance and overloading interact with each other in C++.) There is some potential for unexpected behaviour, but I think the alternatives would be worse.
Of these I would consider the behaviour of tags, predicates and readings to be satisfactory, but intend to look further at decomposition rules, productions, agreement rules and transformations. I think it is clear that inheritance works best with objects that have names. I’m reluctant to give every rule a name, but grouping them together into named ‘translation phases’ would probably suffice (and bring other benefits too).
Finally, a word about when inheritance should be used. The answer is: whenever it makes good engineering sense to do so. It does not matter whether linguists would classify the difference as one of language or dialect, or the manner in which it evolved historically. The most important considerations are:
- whether (and to what extent) inheritance reduces duplication, and
- whether it makes sense for changes to the parent language to propagate into the child.
Update 2009-10-18: I’ve now implemented Brazilian Portuguese and American English using inheritance. In addition to some changed spellings, this includes the American style of writing numbers in hundreds up to 9999, and omitting the ‘and’ in compound numbers. I’m not completely convinced regarding the last of these points: some speakers are quite adamant that use of ‘and’ is wrong, but evidence of usage in reasonably formal contexts is not so conclusive. I’d welcome comment as to whether I’ve got this right or wrong.