Archive for the ‘Rules’ Category

Computer Terminology part 1: Strategy

Tuesday, April 27th, 2010

One of the main motivations for developing the translation system was to automatically generate localised user interfaces for computer software. This task is likely to make heavy use of computer-related terminology, making that topic an obvious one to address at an early stage. I’ve refrained from doing this previously because there were some basic questions about predicates which needed to be settled first, and it was easier to do that using well-ordered sets of concepts such as numbers and colours. Now that I’ve gained some experience I think it is feasible to attempt something more ambitious.

Identification of the required predicates has been largely ad hoc, and I am expecting substantial additions to be necessary in the future. Writing definitions has been time-consuming but mostly straightforward. I doubt they are up to dictionary standard, but they should be adequate to indicate what is intended.

More difficult has been deciding how to divide the resulting predicates between namespaces. I’ve explored two alternative approaches:

  • A subtype-supertype hierarchy, in which members of a given namespace share an is-a relationship with their parent.
  • A topic-based hierarchy, in which predicates with fundamental semantic differences can and should share the same namespace if they relate to the same subject area.

For example, the first method would place the user interface elements ‘window’, ‘icon’ and ‘menu’ into one namespace, and the user actions ‘click’ and ‘drag’ into another, because the former are user interface elements whereas the latter are actions. The second method would more likely place all of these in a single namespace because they all relate to the topic of computer user interfaces.

The first method is attractive because it is relatively objective, but I’ve found that it has three major drawbacks:

  1. The resulting namespaces are often very small (and therefore numerous).
  2. The namespaces can be difficult to name concisely.
  3. Homonyms, which namespaces were supposed to resolve, are quite likely to end up in the same namespace.

The third point is arguably the most serious one, because there is no point having namespaces unless they provide a way to specify which sense of a word is intended. For example, using the supertype-subtype approach, there isn’t a large difference between hanging out the washing and hanging a person: both are actions which involve suspending something from a rope or line. However the topics with which they are associated are entirely different: housekeeping versus criminal justice.

Another question that arose was how far to go in creating predicates to represent concepts that could be expressed in terms of other predicates. Here are some examples where I think a good case can be made one way or the other:

  • Although it is true to say that a ‘laser printer’ is a printer that contains a laser, there is much more to its meaning than that. A full description would be impracticable, on grounds of both verbosity and fragility. It therefore makes good sense for there to be a separate predicate corresponding to the concept of a laser printer.
  • The alternative would be to create a predicate that captures the full, specialised meaning of the word ‘laser’ as used in the term ‘laser printer’. However, such a predicate would be so specialised that it could, I suspect, only be used to modify the concept of a printer. If the predicate language was being developed for semantic analysis then this added orthogonality might be useful, but I don’t think a translation system has any need for it.
  • A ‘printer cable’ is merely a cable for a printer. If a separate predicate were created for this concept then, for consistency, it would also be necessary to add ‘keyboard cable’, ‘mouse cable’, ‘plotter cable’ and many others. The need to systematically replicate a complete class of predicates is a clear indication that orthogonality has been violated, and while I wouldn’t be above that if there were a clear practical benefit, I can’t see any justification for it in this instance.

The distinction here is between what would be an idiom and what is merely a collocation. Idioms are forbidden in BabelScript, because they violate the rule for combining predicates, so if a concept would be too complex to be described analytically then it needs to become a predicate in its own right.

However there is an exception to this rule. Even if a predicate is not strictly needed, it can be added anyway as an alias for the decomposed form. In this case the criterion is whether the convenience of having a single predicate to express a concept outweighs the clutter resulting from additional and unnecessary predicates. In the third example above I’ve said no, but this is very much a value judgment.

For many of the potential predicates I’ve considered the correct answer is unclear. For example:

  • Is a package manager merely a program for managing packages?
  • A compiler is (I would say), merely a program for compiling, but is it a sufficiently well-used concept to justify an alias?
  • Should constructors and destructors be expressed as functions for constructing and destroying?

The approach I’ve taken is to err on the side of caution and avoid creating predicates that are questionable. Adding new predicates is straightforward, especially if they are aliases, whereas removing them is more disruptive (because it breaks compatibility with texts that use them).

Progress towards Unification

Thursday, December 31st, 2009

I’ve now created data structures to represent features, conjuncts and disjuncts, and can perform unification in a few simple cases — but not yet generally in the presence of disjuncts. The latter is hard to do efficiently, so in the first instance I plan to use an inefficient but simple algorithm (probably expanding both inputs to disjunctive normal form prior to unification).

Also not implemented yet are paths, constituent sets and patterns. An open question is whether to implement paths explicitly, or instead use named variables to tie branches of the tree together. So far as I can see, these options provide essentially the same functionality, however I suspect that variables may be more convenient to implement given that the data structures I am using only allow you to move down the tree, not upwards.

I’m now confident that unification can provide all of the current functionality of the translation system except (perhaps) for morphological operations and left-right agreement. Furthermore, it would do so in a significantly more flexible and elegant manner than the ad-hoc mechanisms that exist at present. My main concern remains that of efficiency. I’m also uncertain as to how best to support dialects.

One mechanism which could be eliminated is the use of namespaces. For example, the predicate zoo:species:tyrannosaurus:rex could be replaced by:

[type=animal,rank=species,genus=tyrannosaurus,species=rex]

This is more verbose, but self-describing (a bit like the difference between X.500 and the DNS) and able to be processed in ways that an opaque predicate name cannot.

Unification

Monday, November 30th, 2009

I’ve begun experimenting with some changes to the internal data structures so that the translation system could support unification. This is a concept borrowed from formal logic which is has become popular as a natural language processing technique. A good introduction can be found in Functional Unification Grammar: A Formalism for Machine Translation by Martin Kay.

The most notable difference this will bring to the translation system is that functional unification grammars are declarative in nature, whereas the languages descriptions I am using currently are (for the most part) procedural. A declarative approach has two important advantages:

  • The language description need not be closely tied to any particular processing task, or to any particular implementation method. This makes it more likely to be reusable for other purposes. For example, a grammar described exclusively in terms of unification would be reversible.
  • Because the whole grammar applies in parallel (as opposed to individual rules being applied in series), the order in which components of the grammar are specified is unimportant.

Against this I expect there will be a price to be paid in terms of efficiency and/or complexity. Declarative languages tend not to be naturally efficient, by which I mean that when implemented simplistically they tend to be very inefficient. It is often possible to claw back some or all of this loss by using more sophisticated algorithms, but the programming effort required to do that can be considerable.

To obtain the full benefit of a unification grammar it would be necessary to eliminate all of the existing rule types, but that isn’t practicable in the short term. For this reason I’m going to provide the means to translate phrases into functional descriptions and back again. This is straightforward enough:

  • If the phrase is atomic then the corresponding token becomes the value of an appropriately-named feature (eg. ‘atom’).
  • If the phrase is composite then the left- and right-hand sides each become features (eg. ‘lhs’ and ‘rhs’).
  • Each tag becomes a feature with a value of ‘true’.

Because functional descriptions can only grow, it may be necessary to use different feature names for the conversions to and from functional descriptions (depending on what is being attempted at the time).

Most likely it will be agreement rules that I attempt to replace first. This will require at least three significant changes to the way the language description is written:

  • In an agreement rule the absence of a tag can be used to indicate with certainty that a particular condition is false, whereas in a unification grammar the absence of a feature means that a value is unknown: to indicate falsehood it is necessary to explicitly specify a value of ‘false’.
  • When propagating attributes such as gender it will not be necessary to provide a separate rule for each of the possible values.
  • Agreement rules match only when they need to change something. A unification grammar must provide rules to match every valid input (otherwise the input would not be allowed), and normally you would want only one rule to match each input.

The changes are strictly experimental for now, and are being developed in a branch so that they can be easily abandoned should that be necessary. The main criteria will be whether they improve readability and maintainability of the language description files, and how much they add to memory and processing requirements. If they are successful then ideally they would replace all of the current phases from part-of-speech discovery through to transformation. Alternatively, they may simply be used to add to the existing rule types available when writing a language description.

Tags in Predicate Definitions

Sunday, November 1st, 2009

There are several stages of the translation process during which tags can be applied, but currently almost all of the raw information on which they are based is provided by the lexicon. Some of that information is language-independent, and I want to move it into the predicate dictionary so that it can be shared between languages. Examples of the type of information I have in mind are the fact that a predicate represents:

  • a colour, or
  • a substance, or
  • an animal.

As I’ve previously indicated, language-independence need not be absolute: if a particular language needs to handle a particular predicate differently then it can be overridden. I wouldn’t want to over-use this facility, because attributes which are substantially language-dependent belong in the lexicon, but it will simplify the handling of edge cases and idiosyncrasies.

The syntax for attaching tags to a predicate is as follows:

predicate foo
{
  tag bar,baz,quux;
};

Multiple tag statements are permitted as an alternative to listing them on one line. Tags are applied as the very first stage of the translation process. I’ve also made some changes to later stages so that tags survive up to and beyond word selection.

What can’t easily be achieved at this point is tagging of composite predicates prior to word selection, so while the system can be told that zoo:genus:vulpes has the characteristic of being animate, it is not able to deduce that (colour:red zoo:genus:vulpes) is equally animate. This is an issue I intend to address soon.

Dialects (Inheritance)

Saturday, October 17th, 2009

There are several decisions to be made regarding implementation of the inheritance mechanism. First and most obviously, there needs to be some form of command to indicate that one language is derived from another. I considered two alternatives for the syntax:

  • a Java-style extends keyword, which would form part of a header at the start of the language description, or
  • a Perl-style use statement, which would be placed in the body of the language description.

The problem with a use statement is that the natural expectation would be for it to be allowed anywhere in the language description: later declarations would override, earlier ones would be overridden. This is easy enough to implement: create a master-index (for each declaration type) listing all of the declarations that have not been overridden. The only snag is that I might not want to generate such an index (if optimising for start-up time and/or memory usage), in which case implementation becomes much more fiddly. I could live with this if there was an compelling need for mid-file inclusion, but I’m not convinced there is one.

That leaves the extends keyword. This already exists and is used for specifying inheritance relationships between morphological paradigms, so using it between languages would be a natural extension. It forces the point of inclusion to be at the start of the language description, so avoids the implementation issue described above. The specific syntax I have in mind is of the form:

language "en_US" extends "en";

Should multiple inheritance be supported? I don’t see any reason why not, but I can’t say I have any particular use in mind for it either. On that basis I’ve decided to allow only single inheritance in the first instance, but keep open the option of adding multiple inheritance later if needed.

Having established a hierarchy, it is then necessary to decide in detail how parent and child languages interact with each other. The general rules are that:

  • Everything defined in the parent is imported into the child.
  • Where search order is important (as is the case when objects are referred to by name), the child is searched first.

The specific effect on each type of object is currently as follows:

  • Tags could be made to override earlier definitions, but I can’t think of any good reason why you would want to do this: the effect on the parent language would likely be both profound undesirable. I’ve therefore forbidden tag definitions which conflict with an existing name.
  • Predicate definitions override any previous definition of the same name. This is potentially quite awkward to implement, because they are indexed both by name and by meaning. If the meaning of a predicate is overridden then searches of the latter index do not necessarily yield the correct result. Unlike tags, predicates definitely need to be able to override each other, but their meanings probably don’t. I’ve therefore forbidden predicate definitions which alter the meaning of an existing predicate, but allowed the fallback and any other attributes to be changed.
  • Paradigms and morpheme definitions override any previous definition of the same name. These are indexed only by name, so implementation is straightforward.
  • Grammatical productions are effectively cumulative. The dialect is searched before the parent language, but since there is currently no way to indicate that a production does not occur, it is not possible to override in any meaningful way.
  • Decomposition rules are searched in the same order as productions, but only the first match is applied. This makes overriding is possible in most cases, but because decompositions are applied recursively it requires significant trickery to turn one off entirely. (Replacing the pattern by itself won’t work.)
  • Agreement rules are searched in the same order again, but all matching rules are applied. This makes overriding difficult or impossible.
  • Transformations behave in a similar way to decomposition rules, but since they are not applied recursively to any single point in the text they are somewhat easier to override.
  • Readings don’t override each other within a given language description file, but they do override any readings with the same meaning in any parent language. (Coincidently, this is very similar to how inheritance and overloading interact with each other in C++.) There is some potential for unexpected behaviour, but I think the alternatives would be worse.

Of these I would consider the behaviour of tags, predicates and readings to be satisfactory, but intend to look further at decomposition rules, productions, agreement rules and transformations. I think it is clear that inheritance works best with objects that have names. I’m reluctant to give every rule a name, but grouping them together into named ‘translation phases’ would probably suffice (and bring other benefits too).

Finally, a word about when inheritance should be used. The answer is: whenever it makes good engineering sense to do so. It does not matter whether linguists would classify the difference as one of language or dialect, or the manner in which it evolved historically. The most important considerations are:

  • whether (and to what extent) inheritance reduces duplication, and
  • whether it makes sense for changes to the parent language to propagate into the child.

Update 2009-10-18: I’ve now implemented Brazilian Portuguese and American English using inheritance. In addition to some changed spellings, this includes the American style of writing numbers in hundreds up to 9999, and omitting the ‘and’ in compound numbers. I’m not completely convinced regarding the last of these points: some speakers are quite adamant that use of ‘and’ is wrong, but evidence of usage in reasonably formal contexts is not so conclusive. I’d welcome comment as to whether I’ve got this right or wrong.

Agreement Rules with Conditions

Monday, May 25th, 2009

In many languages it is possible for the spelling of one word to have an effect on adjacent words. For example:

  • In English, the indefinite article ‘a’ becomes ‘an’ before a vowel.
  • In Welsh, the conjunction ‘a’ becomes ‘ac’ before a vowel.
  • In French, the definite articles ‘le’ and ‘la’ become ‘l” (l-apostrophe) before a vowel.

The left-right agreement rules described previously go some way towards meeting this requirement, but their pattern-matching ability is limited to whole words only: they cannot look inside a word to make decisions according to how it is spelt. It would be possible to manually enter all the required tags into the lexicon, but I would prefer an automated solution for obvious reasons.

The method I’ve adopted is to extend the agreement rule syntax to allow conditions to be specified, much as decomposition rules do already. The syntax will be a little different:

agreement <direction> <pattern>
  where <condition>
  and <condition>
  ...
  and <condition>;

(I think this is more self-explanatory than the current decomposition rule syntax, and if decomposition remains as a distinct rule type - see below - then I will be changing it to match.)

The conditions themselves will use the existing expression syntax, but the the addition of a new operator called eval:matches. This takes two arguments, a regular expression and a token, and returns true if and only if the token matches the regular expression.

A suitable rule for distinguishing ‘a’ from ‘an’ in English might then be:

agreement rightward (a[+vowel-after] $x)
  where ((eval:match “^[aeiou]“) $x);

Alternatively, the word itself could be tagged as having an initial vowel, then this information transferred to the preceding word using a ordinary agreement rule:

agreement upward $x[+initial-vowel]
  where ((eval:match “^[aeiou]“) $x);
agreement rightward
  (a[+vowel-after] $x[initial-vowel]);

Once a word has been tagged it can be altered as necessary using an inflectional rule. The second method would be appropriate when there are several ways in which the surface form can be affected by an initial vowel, so as to avoid performing the same regular expression match more than once.

In the interests of orthogonality I intend to allow conditions to be applied to transformation rules too using the same syntax. Interestingly this would make the behaviour of transformation rules and decomposition rules very similar, the main difference being that they are applied at different stages of the translation process. If some mechanism were introduced for explicitly defining translation stages (which has the potential to be a very useful feature in its own right) then decomposition rules - as a distinct rule type - may become unnecessary.

Left-Right Agreement Rules

Tuesday, May 5th, 2009

One language feature which I have not yet been able to implement in a satisfactory manner is that of initial consonant mutation as found in Welsh and other Celtic languages. The difficulty lies not with the morphological process itself, which is for the most part straightforward, but rather the decision as to when to perform it.

For example, one of the ways in which the aspirate mutation is triggered in Welsh is when a word is preceded by ‘a’ or ‘ac’ (meaning ‘and’). This cannot easily be expressed using agreement rules (as currently implemented) for two reasons:

  • Agreement rules act on the text as a tree rather than a sequence. Words which are linearly adjacent to each other may be arbitrarily far apart in terms of tree structure. It follows that adjacency cannot be expressed by any fixed set of agreement rules.
  • Agreement rules are applied before transformations (necessarily so, because transformations often depend on tags that have been set by agreement rules). One of the most common uses of a transformation rule is to rearrange the order of the text, and therefore change which words are adjacent to each other.

It is possible to work around these restrictions to a limited extent by working backwards from the surface forms that are of interest and determining which intermediate forms could produce them, but this approach is both tedious and deeply unsatisfactory. If mutation (or any other phenomenon) occurs because words are adjacent (as opposed to having a particular structural relationship) then that is how the corresponding rules should be expressed.

I’m therefore satisfied that a new type of rule is justified. It needs to work in much the same way as an agreement rule, but using a pattern which is linear rather than tree-structured. It needs to be applied after any transformations have been completed (so that it can see the final word order) but prior to inflectional rules (so that it is able to influence them).

Agreement rules can already be marked as ‘upward’ or ‘downward’. Since these new rules will be so similar, I think extending this syntax to allow ‘leftward’ or ‘rightward’ is appropriate. The pattern will follow exactly the same syntax as now, but with the constraint that it must have the form of a list rather than an arbitrary tree. For example, the mutation rule described above might be expressed as:

agreement rightward (a $x[+aspirate]);

Longer patterns have additional space-separated terms but no extra parentheses. (Internally these are trees such that the left-hand side is atomic and the right-hand side is another list.)

Unlike upward and downward I doubt it will make any difference whether leftward or rightward rules are applied first, so I am going to arbitrarily say rightward first, leftward second.

Agglutination and Hyphenation (part 2)

Wednesday, March 11th, 2009

I’ve described how two or more morphemes can be merged into a single word by applying the ‘agglutinate’ tag, but few languages are so purely agglutinative for this to be sufficient: in most cases there are at least some combinations which require further processing to create the correct surface form. For example, when the Swedish words ‘ett’ and ‘tusen’ are merged the result is ‘ettusen’, dropping one of the letters ‘t’ in order to avoid creating a treble-’t’. Languages that use complex methods for combining morphemes are described as being fusional.

It is unclear (to me) exactly how powerful the mangling mechanism needs to be, and I won’t know until I have gained a great deal more experience working with fusional languages. However, it seems likely that the mechanisms at work will be similar to those which occur during inflection - after all, inflection can be considered to be a form of compounding. For this reason I’ve decided to implement mangling using the same mechanism as inflection. In addition to regular expression chains, which form the basis of the mechanism, this allows the mangling to be influenced by tags, provides access to variables, and allows rules to be ordered using the keywords ‘depends’ and ‘forces’.

There is a possibility that this decision is overkill, but re-use of an existing mechanism brings two very important advantages:

  • It re-uses code in the translation engine, helping to keep its size and complexity to a minimum.
  • It re-uses BabelCode syntax, meaning that there is less to learn.

For this reason I would like the mechanisms to remain unified if at all possible (meaning that if any improvements are necessary, they should apply equally to both inflection and compounding).

One obstacle to unification was the fact that inflectional rules are designed to operate on a single word, and are not by themselves able to join two words together. My first solution to this problem was for the substitution chain to contain pairs of regular expression patterns. Both patterns would have to match for the rule to apply, and both would be able to supply substrings for use in the replacement string. This would have worked, but would have required substantial changes to the syntax and would not have achieved my objective of code reuse.

What I’ve done is therefore to join the components together before the mangling takes place, but separated a plus sign in the first instance so that the regular expression can see where the boundary is located. (A plus sign is the character customarily used to represent a morpheme boundary.) Afterwards, the plus sign is deleted.

Compounding rules are distinguished from normal inflection rules by replacing the keyword inflection with either prefix or suffix. Compounding of morphemes A and B to form AB causes the prefix rules associated with A and suffix rules associated with B to be applied. In the more complex case where AB is compounded with CD, it is the prefix rules of B and the suffix rules of C which are applied.

Implementing this arrangement is not entirely trivial. The inflection process begins with an uninflected morpheme, so it is can be easily looked up in the language definition to see which rules are applicable. The compounding process begins with morphemes which may have been inflected, and may already have been partially compounded, so a lookup at that stage would not work. The way I have solved this is to preserve a copy of the uninflected text for use during the compounding phase. Inflection does not change the structure of the text, so it is easy enough to navigate the two structures in parallel in order to match up pre- and post-inflected forms.

Agglutination and Hyphenation (part 1)

Monday, March 9th, 2009

The number systems of English, French, Spanish and Portuguese are largely analytic in nature, meaning that they draw on a small, fixed set of words and do not create new ones. (By ‘word’ I mean a sequence of letters with no intervening spaces or punctuation marks.) There are exceptions, for example the English words ‘thirteen’ through to ‘nineteen’ and ‘twenty’ through to ‘ninety’ can be decomposed further, but the important point is that the number of words is small enough to be enumerated.

This is not the case for many other languages. For example, in German, every non-negative number up to one million is represented by a unique word. Some of these, such as:

neunhundertneunundneunzigtausendneunhundertneunundneunzig (999,999)

are very long, but they are formed according to a regular pattern from a small, fixed set of morphemes. In fact, the pattern is broadly similar to the ones found in English, French, Spanish and Portuguese: it is only the process of concatenation which makes the result appear so very different. Languages which do this are called either ‘agglutinative’ or ‘fusional’, depending on the amount of mangling needed to join the words. German performs little mangling, so is towards to the agglutinative end of the spectrum.

In order to support these word-formation processes, a compounding mechanism has now been implemented within the translation system. It is invoked by adding the tag ‘agglutinate’ to the phrase that is to be merged into a single word. For example:

(acht zehn)[agglutinate]

would be rendered as ‘achtzehn’.

Currently one instance of this tag at the root of a subtree will affect every word beneath it. I’m unsure whether this is the best behaviour - it would be possible for the tag to merge only the rightmost node of the left subtree and the leftmost node of the right subtree - but I’m going to leave it that way unless and until I see evidence that more flexibility is needed. My justification for the current rule is that if compounding cannot follow the phrase hierarchy then the hierarchy is wrong.

In the interests of economy, hyphenation is performed using the same mechanism but a different tag: ‘hyphenate’. I’m open to the possibility of adding further methods of compounding using other symbols, and there may even be a case for compounding methods to be defined in the language definition file (which would avoid the need for ‘agglutinate’ and ‘hyphenate’ to be hard-coded in the translation engine).

Compounding is performed after the inflection phase so that individual morphemes can be inflected if required. Typically it is only the final morpheme that is subject to inflection, but I think there are enough exceptions to this rule not to want to hard-code it (especially when considering hyphenation as well as agglutination). If it makes a difference, the inflectional rules can be told in advance which morphemes are going to be compounded - I would rather do that than have to write inflectional rules able to handle compound words.

Enhanced decomposition rules

Sunday, March 1st, 2009

The numerical decomposition rules that I described previously have now been used to generate cardinal numbers in English, French, Spanish and Portuguese, and so far have worked well. Italian and Welsh have been looked at, and in both cases the decomposition phase is expected to be straightforward, but there are other issues (compounding and mutation respectively) which will need to be addressed before a full implementation is attempted.

Where I have run into difficulty is the translation of ordinal numbers. Some method is needed to distinguish ordinals from cardinals, and I intend to use a predicate (provisionally called ordinal) which takes a cardinal as its argument and modifies it to yield the corresponding ordinal. Unfortunately, numerical decomposition rules cannot currently ’see’ this modifier so they cannot behave differently for ordinals and cardinals.

I did find one way around this problem, but it involves a certain amount of cheating. The words ‘first’, ’second’, ‘third’ and upwards would be given readings of 1, 2 and 3 just like their cardinal counterparts, but for a category called pre-ordinal. The predicate ordinal would map to a made-up placeholder, which would participate in part-of-speech selection but afterwards be deleted. There would be a production rule in which the placeholder acts on the category pre-ordinal to give the category ordinal. If this is the only production rule which consumes a pre-ordinal then it follows that the words ‘first’, ’second’ and ‘third’ will only be chosen when ordinals have been called for in the source text. Conversely, if this is the only production rule matching the placeholder, then the words ‘one’, ‘two’ and ‘three’ will not be chosen unless ordinals have been called for.

Cunning though this plan might be, I’m not keen on it for two reasons. Firstly, it assumes that ordinals and cardinals will always be decomposed in the same way. That appears to be true of English, and could well be true for most languages, but I would prefer not to hard-code assumptions of this nature into the translation system. Secondly, it involves declaring that ‘one’ and ‘first’ are merely different parts of speech with the same semantic content. That is not true: what ought to happen is for ‘one’ to mean ‘1‘ and ‘first’ to mean ‘(ordinal 1)‘. While this can’t be done using the existing numerical decomposition rule syntax, I said at the time they might need to be extended and that is what I now propose to do.

Clearly decomposition rules need to be made conditional on something other than the value of the number. Tags are an obvious possibility, but as previously noted, few if any tags will have been set at this stage of the translation process. One answer to that would be to provide an opportunity to set some tags before the decomposition rules are applied. An extra translation phase would be needed for this (probably very similar to the existing agreement phase).

Alternatively, decomposition rules could incorporate a pattern in much the same way that transformation and agreement rules do already. This would allow them to take account not just of the number to be decomposed, but of nearby structure too. For example, an English ordinal could be decomposed into tens and units with the rule:

decomposition (ordinal $x) { ((eval:ge 10) $x) } = ((internal:add (ordinal ((eval:mod 10) $x))) ((eval:mul 10) ((eval:div 10) $x)))

Application needs to start at the root of the tree and work downwards in order for this rule to take priority over the corresponding one for cardinals. All subtrees will need to be checked, not merely those consisting of a number, so the decomposition phase will become more computationally expensive (but no more so than the transformation and agreement phases are already). In the first instance I’m not going to require that variables be numbers (although it would be more efficient to perform this check within the pattern rather than waiting for non-numbers to fail the list of conditions).

Although matching on a pattern undoubtedly counts as a significant extension of the rule syntax (and is not therefore to be done lightly), I am comfortable with the idea because it follows a path that is already well-trodden (and consequently requires very little new code). It is certainly less intrusive than an entirely new translation phase, which in any case would need to support very similar functionality.

(Indeed, a case could be made that it is actually a simplification, by making numerical decomposition more like other rule types. A topic I may investigate in the future is whether there is any further scope for convergence between rule types.)