Archive for February, 2009

Bi-directional agreement rules

Thursday, February 26th, 2009

Agreement rules are applied recursively, currently starting at the leaf nodes and working upwards towards the root. As I mentioned at the time they were introduced, this allows indirect propagation to occur upwards but not downwards. (Direct propagation is what can be achieved by applying a single rule once; indirect propagation requires two or more such steps.) So far this has not been a problem, but I think it’s almost certain to become one.

The solution I was toying with was to apply the agreement rules repeatedly until a stable solution was obtained. On reflection, I have two concerns about this idea. The first and very obvious one is that there might not be a stable solution. Worse, it would be very difficult to check in advance whether a given set of rules could fail to give a stable solution - certainly it would not be obvious by inspection. Secondly, there are useful techniques which work when rules are executed once but not when they are executed twice or more (and especially if you don’t know in advance how many times or in what order).

An example that relies on once-only execution is setting a tag at level n on condition A then clearing it at leven n-1 on condition B. This implements the condition ‘A and not B’, but with the qualification that if level n is the root (so n-1 does not exist) then it reduces to condition A. I’ve yet to find any other way of achieving this effect that works in all cases.

For this reason, I think a compromise is in order. Rather than allow an arbitrary number of iterations, my intention is to perform two passes: one upwards and one downwards. My reasoning is that this is (minimally) sufficient to allow information to propagate from any point in the tree to any other point in the tree. (Note that would not be so if the downward pass were formed first, so the order is important.)

So that rules are applied once only, they will be marked to indicate whether they refer to the upward or downward pass. The syntax will be:

agreement upward phrase;
agreement downward phrase;

For order of application I see three options:

  1. Leave the order undefined, as at present.
  2. Leave the order undefined, but require that the result not be dependent upon it.
  3. Apply all rules simultaneously to any given node, and require that they not conflict with each other.
  4. Define the order in which rules are applied.

The difficulty with option 1 is that it provides no assurance that language definition files are portable from one implementation (or version) of the translation system to another. Options 2 and 3 ensure portability, but only if the specified restrictions are enforced, and it turns out that option 3 is rather easier to enforce than option 2. Option 4 would ensure portability too, and in some respects would be more powerful than option 3, but I think the rulesets would be less readable, and I’ve not (yet) seen any evidence that a more powerful syntax is needed.

For these reasons I’m inclined to favour option 3. The method for detecting conflicts will be to iterate through all rules making note of how the tags attached to each node will change, but not actually making those changes immediately. Each node is then re-examined and the changes executed. A conflict is raised if any tag is required to be both added to and removed from a node. As a further refinement it would be helpful to mark nodes during the first phase that will need to be re-visited during the second phase, in order to avoid having to re-visit every node.

[Update 2009-02-27: taking this policy to its logical conclusion, implementation-dependent behaviour is something which ought to be eliminated across the whole of the translation system so far as is possible. This probably won't always be done by applying rules in parallel - for example, the inflection system already has a mechanism for specifying the order in which rules are applied - but I would expect for the most part that the same set of options will be available.]

Generating numbers in French

Saturday, February 21st, 2009

The French number system is more complex than that of English, partly due to its use of vigesimal in some numbers, and partly due to minor irregularities relating to the use of plurals and the word ‘et’. None of these have proved difficult to implement, however a somewhat larger number of rules have been needed.

In French there are three different ways in which a number acting as a multiplier can be pluralised:

  • The number ‘mille’ (1000) is never pluralised (or at least, not visibly so).
  • The numbers ‘vingt’ (20) and ‘cent’ (100) may be pluralised when they appear at the end of a number, but not otherwise.
  • Other multipliers such as ‘million’ (1000000) and ‘milliard’ (1000000000) may be pluralised wherever they appear.

(When I say ‘may be pluralised’ I mean that they will be iff there is more than one of them.)

I’ve chosen to handle ‘mille’ by allowing it to be marked as plural, but making the plural inflection do nothing. In this sense I’m treating it like the English word ’sheep’: it isn’t uncountable, because you can have ‘two sheep’ or ‘deux mille’, but it is invariable. Whether this is linguistically correct I don’t know, but it appears to produce the correct behaviour.

For ‘vingt’ and ‘cent’ I’ve introduced a special tag (drops-plural) to distinguish them, and added two agreement rules: one to make them plural when used as a multiplier, and one to convert them back to singular if they are followed by another number. Note that this does depend on the rules being applied from the bottom of the tree upwards, and if that were to change then a different method of implementation may be needed.

The components of a compound number may be separated by hyphens,by spaces, or by the conjunction ‘et’ (’and’). The examples I’ve looked at have not been entirely consistent as to what should be used when, but an acceptable policy would appear to be:

  • use ‘et’ for the numbers 21, 31, 41, 51, 61 and 71 (surrounded by spaces, not hyphens);
  • otherwise use hyphens for numbers less than 100;
  • use spaces elsewhere.

(These rules are intended to apply both to complete numbers and to components of a larger number.)

Insertion of ‘et’ has been handled by creating two special-purpose tags for words which can precede or follow ‘et’. When both are present in the correct order, combined using internal:add and with no substructure, the ‘et’ is inserted by a transformation.

Hyphens are not implemented at present because there is no mechanism available to support them (other than enumeration, which would be feasible but undesirable). This is an important issue in its own right, but I’m not ready to address it yet.

The transition from simple numbers (’seize’, 16) to compounds (’dix-sept’, 17) presents no difficulty, and is implemented in essentially the same way as the corresponding English transition (20 to 21) but with a lower threshold.

It will be necessary to insert a ‘de’ following multipliers ‘million’, ‘milliard’ and upwards when they are used to qualify a noun, however that is outside the scope of the grammar that has been written so far.

Tag Declarations

Thursday, February 19th, 2009

Currently all tags are treated equally by the translation system, irrespective of what they are used for. This means that any tags not needed to generate the parts of speech that had been chosen for a particular translation are deleted. Until very recently this was not a problem, mainly because agreement had not been implemented, and therefore there were no end-to-end tests involving both part-of-speech assignment and inflection. This has now changed and there is a need for a more subtle approach.

To show why there is an issue, consider the tagging of nouns in a language which has grammatical gender. The reading for each noun will have at least two tags: one to indicate that it is a noun, and one to indicate the gender. The noun tag will still be there for use by the agreement, transformation and inflection phases, but the gender tag will not because it isn’t necessary to satisfy the applicable productions.

Unwanted parts of speech do need to be discarded because a word may allow several. (This could be avoided by making each a separate reading, but the amount of bloat would be considerable.) The later phases need to know which part of speech has been chosen, because the required behaviour often depends on that decision (particularly with regard to inflection).

The simplest solution is to tell the translation engine which tags represent parts of speech and which do not. The syntax is:

tag tag-name = tag-type;

where tag-type is either part-of-speech or attribute. For example:

tag noun = part-of-speech;
tag plural = attribute;

This scheme can be very easily extended to allow further tag types, or even user-defined tag types should the need arise.

A secondary benefit of having tag declarations is that they can be made compulsory. This is no great hardship, because in a full-scale language definition the number of tags will be small compared to the number of words, and it has a good chance of catching at load-time any tags that have been mis-spelled.

(I’m considering whether to make morpheme declarations compulsory too. The cost is higher in that case, but the benefit of catching errors early may make it worthwhile.)

Agreement

Wednesday, February 18th, 2009

Currently the translation system has the means to inflect a word according to number, gender, and any other criteria required by the target language. What it lacks is a way to decide which inflections are needed in a given situation.

Typically it is necessary for the inflection of one word to match the characteristics of some other word with which it is associated. For example, the French definite article may be ‘le’, ‘la’ or ‘les’ depending on whether the noun it refers to is masculine singular, feminine singular or plural. This phenomenon is known as ‘agreement’, and it occurs to at least some extent in most languages.

Agreement isn’t something that can be built into the source text because some of the relevant attributes (notably gender) are language-dependent. For example, the French and Italian words for ‘house’ are grammatically feminine (’maison’,'casa’), whereas the Welsh and Irish Gaelic words are masculine (’tŷ’, ‘teach’). In German, Dutch, and of course English, they are neuter (’haus’,'huis’,'house’). The only place where this information can come from is the lexicon for the relevant target language.

Attributes which are subject to agreement can and will be represented by tags - that is one of the main purposes for which tags were introduced (the other being representation of parts of speech). What is needed is a means by which those tags can propagate from one word to another. The extent of that propagation must be guided by the grammatical structure of the sentence (so, for example, the gender of a noun might propagate to an attached adjective or determiner, but not to adjectives or determiners attached to other nouns).

The mechanism I’ve implemented to do this is similar to a transformation rule, in that there is a pattern which the text must match if the rule is to apply. However there is no replacement text, so changes to the structure of the text are not allowed. Instead, instructions for adding or removing tags within the existing structure are embedded within the pattern.

I’ve embellished the tag syntax, allowing a plus or minus sign or an exclamation mark to appear as a prefix. The meanings are:

  • no prefix: the tag must be present for the pattern to match;
  • exclamation mark: the tag must be absent for the pattern to match;
  • plus sign: the tag is added if the pattern matches
  • minus sign: the tag is removed if the pattern matches

The number of nodes to which a tag needs to propagate may be arbitrarily large. For example, a noun may be acted upon by a number of adjectives, determiners and other qualifiers. Any or all of these may depend on the grammatical gender of the noun, therefore the gender must be able to propagate to all of them. This can’t be done directly from the noun without either (a) writing a separate rule for each possible combination of qualifiers or (b) extending the pattern syntax to match structures of variable size. Neither of these options is at all palatable, so I intend to handle this case by performing the propagation indirectly. The rules then need only to specify how tags are copied between neighbouring words (or at least, words with a fixed relationship between them).

Note that agreement rules are not limited to verbatim copying. For example, if a tag needs to be copied from one end of a chain of words to the other, but without being added to the words in between, then a different tag name can be used for the intermediate steps. Similarly, if a verb needed to agree with the number or gender of both its subject and object then clearly a name change is needed in order to prevent the two from interfering with each other.

Agreement rules are applied recursively, starting at the leaf nodes of the text and working upwards towards the root. The order in which they are applied is not currently defined, on the assumption that they will not interfere with each other. Both of these decisions are subject to review once more experience has been gained. One particular limitation I’m aware of is that indirect propagation can occur upwards under these rules but not downwards. If this becomes an issue then some form of iterative application may be necessary, repeated until a stable result is obtained.

Finally, there is a problem with the part-of-speech selection process because this removes tags that are not needed by the active production rule, including tags that are needed by the agreement rules. The way to solve this will be to distinguish between tags which represent parts of speech and those with other purposes.

Generating numbers in English

Saturday, February 14th, 2009

The first number system that I’ve attempted to implement is that of English. Though not perfectly regular, it is one of the less complicated systems in existence, so if the translation system is to have any chance of working generically then it should have no great difficulty implementing it.

English numbers less than one thousand are typically expressed as a sum of hundreds, tens and units, with the exception that the values eleven to nineteen are single numbers (both in isolation and as part of a bigger number). Larger values are expressed as a number of thousands, millions, billions and upwards. Any component with a value of zero is omitted (hence ‘one million and one’, with no mention of the lack of tens, hundreds or thousands).

Implementation of this structure is straightforward using the numerical decomposition rules that were recently added to the language definition syntax. The following specific rules are needed:

  • Numbers in the range 20 to 99 that are not already multiples of ten have their tens and units separated.
  • Numbers in the range 100 to 999 have their hundreds separated from the remainder and expressed as a multiple.
  • Numbers in the range 1000 to 999999 have their thousands separated from the remainder and expressed as a multiple.
  • Similarly for millions, billions, trillions and upwards.

I am reluctantly defaulting to the ’short scale’ for all dialects of English, where a billion is 109.

There is one other valid method of decomposition:

  • Numbers in the range 1100 to 1999 (or 9999 in American English) which are not part of a larger number may alternatively have their hundreds (not thousands) separated from the remainder.

According to the Chicago Manual of Style this is the preferred format where it is allowed. That makes sense because fewer words are needed, and it is natural to prefer the more concise form where a choice is available. I intend to implement it, but for now it has been omitted in the name of simplicity.

For syntactic reasons many of the rules need to be split into two parts: one to handle the case where the remainder is non-zero, and one for when it is zero (and hence omitted).

As I’ve noted previously, it is neither necessary nor generally desirable for the decomposition rules to produce output which directly corresponds to the surface form: decomposition is just the first phase of the translation process, so it needs to produce output suitable for input to subsequent phases. For example, the surface form (prior to linearisation) that the current implementation produces for the number 256 is:

((2 100) and) (50 6))

but the output of the decomposition phase is:

((internal:add ((internal:add 6) 50)) (internal:mul internal:hundred) 2).

Note the use of internal:add and internal:mul to represent, in the abstract, combination by addition and multiplication respectively. Also internal:hundred, which is needed to distinguish between hundreds that have already been decomposed and those which have not.

Following the decomposition phase, parts of speech are selected given the available readings and the set of allowed grammatical productions. There is a choice to be made here: the grammar can be made sufficiently prescriptive to distinguish comprehensively between allowed utterances and those which are not, or just prescriptive enough to organise phrases that are actually produced by the decomposition rules. The former would be preferable if the productions might also be used for parsing, but as they refer to the untransformed text there is little prospect of that happening. The latter option has therefore been chosen on the grounds of brevity and readability.

The only other rules needed are a set of transformations to convert instances of internal:add and internal:mul into surface form. In most cases this is done just by concatenating the two values. The exception (in British but not American English) is the addition of a one- or two- digit value and one of three digits or more, which is done using ‘and’ (as in ‘two hundred and fifty six’). The transformations are able to distinguish between these situations by looking at how the two parts of the number are tagged.

Currently each multiplier (hundreds, thousands, millions and so on) needs an explicit pair of decomposition rules to support it, even though the progression (for thousands and upwards) is entirely regular. Because of this the ruleset has to stop somewhere, and I’ve chosen trillions as they are the largest multipliers used in everyday speech. It would be nice to remove this restriction and state the rule more generally. This could be done with a small extension to the rule syntax, but each multiplier would still need to be listed in the lexicon, and for most purposes the current limit is entirely adequate.

One form which I haven’t addressed yet is the use of the indefinite article in place of a leading ‘one’, as in “there are a hundred billion stars in the Galaxy”. I’m not sure to what extent this is ever truly obligatory, but it does appear to be preferable in some cases. It is also possible for ‘one’ to be omitted following the definite article. (To see the distinction between these cases, compare ‘a hundred’ and ‘two hundred’ with ‘the hundred’ and ‘the two hundred’.) The required behavior here may become clearer when more of the grammar has been written.

Finally, there will be a need to select between singular and plural according to the size of a number. I think this will best be done by assigning a singular or plural tag to the number itself, and then using agreement rules to propagate those tags to other relevant predicates. However, as there are no facilities for doing this yet, nothing has been implemented.

Numerical decomposition rules

Sunday, February 8th, 2009

I said previously that I would be looking at how natural languages represent numbers, and would use that information to decide what rule syntax is needed to represent the necessary transformations. Here are my findings. Most of the raw information was obtained from Number Systems of the World, which describes how numbers are formed in 69 different languages.

Not all languages allow large values to be represented, but those that do normally have a small number of words (or possibly morphemes) which are combined together to create a much larger number of values. The most common principles used to perform this combination are addition and multiplication. A much smaller number (including Ainu, Yoruba, and optionally Latin) use subtraction too. Finally, languages such as Danish multiply by a fraction in some cases, which effectively amounts to division.

Most of the number systems I’ve looked at use base 10 (decimal) either exclusively or almost so. Base 20 (vigesimal) was the next most popular radix, often as part of a hybrid system in combination with base 10. Some systems could be interpreted either as base 10, or as an alternating system of base 5 and base 2. Other radixes are rare, though as I’ve previously noted there are instances of base 6 (Ndom) and base 15 (Huli).

I’m disregarding any isolated exceptions to a more general rule. For example, the number 18 in Welsh is traditionally deunaw (two nines) and in Breton it is triwec’h (three sixes), but since neither forms part of a larger pattern there is very little point in distorting the numerical rules to accommodate them: they can simply be treated as exceptions (which the translation system as a whole is already well able to handle). Similarly, whereas a theorist would presumably want to treat the English numbers thirteen to nineteen as morphological compounds, they are small enough in number (and sufficiently irregular) that a good case can be made for enumerating them rather than attempting anything more sophisticated.

The strategy I’m going to adopt is to take any number in need of translation and decompose it into a numerical expression which has the same structure to its natural language counterpart. For example:

  • the French number “vingt et un” means literally ‘twenty and one’, so would correspond to the expression 20+1;
  • the English number “six thousand” corresponds to 6×1000;
  • the German number “zweihundertzweiunddreißig” corresponds to 2×100+2+30.

The rules used to achieve this will be called ‘numerical decomposition rules’, and will begin with the keyword decomposition. Following this will be a formal name to represent the number to be processed, then a set of conditions which that number must satisfy for the rule to apply, then an expression which will replace the number if the rule does apply. For example:

decomposition $x { ((eval:ge 10) $x) } = ((internal:add ((eval:mod 10) $x)) ((eval:mul 10) ((eval:div 10) $x)))

would decompose any number greater or equal to ten into a number of tens and a number of units. The predicate internal:add is used to indicate that these are added together to obtain the value. Unlike eval:add this does not evaluated by the translation engine, and is merely a symbol that can be matched by later rules. It may ultimately have a corresponding surface form (such as “and” in English), or it may be removed entirely by a transformation.

Any number of conditions may be specified as a comma-separated list. Multiple conditions are often needed, for example when deciding whether a number should be broken down into tens and units in English. This is true for numbers less than one hundred (eg. “forty two“), except for numbers less than twenty (eg. “nineteen“), and except for exact multiples of ten (eg. “seventy“). These three conditions could be combined into a single one if a logical-and operator were provided, but the use of prefix notation makes this significantly more difficult to read than allowing the three to be stated separately.

Although almost anything could be placed on the right hand side of the decomposition rule, I think it will be best not to try to approach the surface form too closely at such an early stage of the translation process. That is the reasoning behind defining the predicates internal:add and internal:mul to represent the abstract relationship between two numbers. Deciding how to represent this relationship is a linguistic problem, not a numerical one, and there is no need for the decomposition rule to attempt it.

There may be a need to introduce further selection criteria to handle the case where the numerical representation depends on the context, for example what is being counted. This could be awkward, because numerical decomposition occurs too early in the translation process to make full use of tags. I’m therefore going to wait and see whether any necessary transformations can be performed using existing translation mechanisms, but the option is there if it is needed.

Efficiency

Saturday, February 7th, 2009

A word about efficiency, and in particular the amount of time needed to parse a language definition file. I’m already conscious that this is not entirely negligible, even for files of just a few hundred lines. Expand this to tens or hundreds of thousands of lines and the time to read it will almost certainly become unacceptable if it has to be done on a per-process basis. However I have at least two options to reduce the impact of this problem:

  • Run the translation system as a daemon.
  • Translate the language definition file into an intermediate form which is loaded using mmap of dlopen and is capable of being interrogated directly.

The first option has the virtue of being easy to implement across a wide range of operating systems. It would even be conceivable for the translation system to operate as a network service, in which case the overhead of keeping the language definition in memory could be spread between many machines. However this is a relatively heavyweight option, and has more chance of going wrong (due to the daemon dying, failing to start, failing to bind to its server socket, or any of the other reasons why daemons don’t always work as they should).

The second method is not dissimilar to the way PO files are translated to MO files for use by the GNU gettext library, although the format would need to be much more complex. As with MO files, the intermediate form would not need to be architecture-independent because each machine would generate its own copy from the language definition file (which, being a text file, is architecture-independent). However libbabel would have to change significantly for this to work because standard containers such as std::map and std::list can’t be copied bit-for-bit into a file: new data structures would be needed, preferably ones with good locality of reference.

I’m inclined to favour the second option on platforms where it is feasible, as it better follows the principle of ‘not paying for what you don’t use’, and because mapping a read-only file into memory is less of a burden than allocating an equivalent volume of writable memory. However, no decision is needed at this time, and for now it is enough to know that options exist.

Numerical operations

Monday, February 2nd, 2009

Having established that the translation system needs a way to manipulate numbers, a method is needed for representing those operations within the language definition file.

The obvious choice would be to adopt the symbols used by most programming languages (’+', ‘-’, ‘*’, ‘/’ and so on), but this would mean devoting quite a large proportion of the available symbol repertoire to functionality which isn’t really at the heart of the translation system. There are also several areas where this would clash with existing or planned usage. Hyphens (aka minus signs) are already allowed within identifiers, and it is quite likely that I will want to use plus signs to represent morpheme boundaries. Round, square and curly brackets all have other uses, as do equals signs. These problems are surmountable, but I have no great appetite for the disruption or compromises that would be necessary. Numerical operations simply aren’t important enough to make it worthwhile.

Much more palatable would be a solution based on the existing syntax for compound predicates. This turns out to be quite feasible provided that one is willing to tolerate:

  1. a certain amount of verbosity, and
  2. the use of prefix rather than infix notation.

The idea is to create a predicate to represent each one of the required arithmetic operations. I’m going to go with:

eval:add
eval:sub
eval:mul
eval:div
eval:mod

The namespace prefix eval indicates that these predicates are to be evaluated numerically by the translation system (as opposed to being translated into the target language). Each of the operators takes a pair of integer arguments, for example ((eval:mul 9) 6). The first argument (the one most closely bound to the operator) corresponds to the right hand argument of an infix operator, so ((eval:div 2) 6) means six divided by two.

If the operation cannot be performed for any reason (for example, because the arguments are not integers, or there has been an overflow or a division by zero) then the result is the predicate eval:error.

Alongside the arithmetic operators are the following comparisons:

eval:eq
eval:ne
eval:lt
eval:ge
eval:gt
eval:le

The result of these operations is either eval:true or eval:false.

Currently all numerical operations are limited in magnitude to the range provided by the C++ data type unsigned long long. There is no particular justification for this restriction (beyond short-term convenience), and I would hope to remove it at some point in the future.

I can’t see any need for expressions to be evaluated except than when performing a numerical decomposition, and until some other need presents itself that is the only place they will be evaluated.