Archive for the ‘Translation’ Category

Tags in Predicate Definitions

Sunday, November 1st, 2009

There are several stages of the translation process during which tags can be applied, but currently almost all of the raw information on which they are based is provided by the lexicon. Some of that information is language-independent, and I want to move it into the predicate dictionary so that it can be shared between languages. Examples of the type of information I have in mind are the fact that a predicate represents:

  • a colour, or
  • a substance, or
  • an animal.

As I’ve previously indicated, language-independence need not be absolute: if a particular language needs to handle a particular predicate differently then it can be overridden. I wouldn’t want to over-use this facility, because attributes which are substantially language-dependent belong in the lexicon, but it will simplify the handling of edge cases and idiosyncrasies.

The syntax for attaching tags to a predicate is as follows:

predicate foo
{
  tag bar,baz,quux;
};

Multiple tag statements are permitted as an alternative to listing them on one line. Tags are applied as the very first stage of the translation process. I’ve also made some changes to later stages so that tags survive up to and beyond word selection.

What can’t easily be achieved at this point is tagging of composite predicates prior to word selection, so while the system can be told that zoo:genus:vulpes has the characteristic of being animate, it is not able to deduce that (colour:red zoo:genus:vulpes) is equally animate. This is an issue I intend to address soon.

Fallbacks

Thursday, October 22nd, 2009

One of the main challenges that the translation system must address is how to behave when there is no word in the target language to express one of the concepts requested. I intend to address this by providing fallback translations which are available across all languages. There will be two types of fallback:

  • Fallbacks to a description. For example, an ‘aeroplane’ might be described as a ‘flying machine’, or a ‘desk’ as a ‘writing table’. These are not perfect replacements, but in a language with limited vocabulary there may be no better way to convey the required meaning.
  • Fallbacks to a reading. Sometimes it is better to borrow a foreign word than to attempt any form of translation. Good examples of this would be animal names such as ‘kangaroo’ or ‘meerkat’ (which made their way into English through this mechanism). Any useful description would be unreasonably long, and falling back to ‘marsupial’ or ‘mongoose’ is unlikely to be helpful.

It is also worth mentioning one other type of construct which, though not strictly a fallback, has similar behaviour in that it can be translated without the need for an explicit reading:

  • Aliases. These are predicates which can be exactly expressed in terms of other, more primitive predicates. The alias exists only as a convenience for those writing source texts and language descriptions. Any analysis occurs after decomposition into primitives.

The natural place for descriptive fallbacks to be specified is in the corresponding predicate definition. (One consequence of this is that descriptive fallbacks will only be able to replace atomic predicates, not compounds. I think this is a reasonable restriction.) I’ve chosen the following syntax:

predicate vehicle:aeroplane
{
  fallback ((for-purpose-of flying) machine);
};

The intended behaviour is straightforward: if no reading is found then the predicate is replaced by the fallback expression and an attempt made to translate that. I would like to allow multiple fallbacks (to be tried sequentially until one translates successfully) provided that this is not overly difficult to implement.

Originally I had intended to make fallback readings part of the predicate definition too, but eventually decided that there is no need: ordinary readings already provide all of the functionality that is needed:

reading meerkat[noun] = zoo:species:suricata:suricatta;

Should an attempt be made to classify fallback readings into parts of speech (as in the example above), or should they simply be tagged as ‘foreign words’ which are somehow outside the grammar of the target language? My expectation is that most fallback readings will be noun-like, but if there are differences then it must surely be useful for the target language to know about them (and if there are none then ‘foreign word’ is simply an alias for noun).

This will make it necessary for parts of speech used in fallback readings to be standardised across languages, so French would use tags such as ‘noun’ and ‘adjective’ as opposed to ’substantif’ and ‘adjectif’. Fortunately I’ve being doing this anyway (albeit largely for my own convenience rather than as part of any grand plan). The same will apply to other types of tag such as ’singular’ and ‘plural’.

Should there be any attempt to create default inflection rules for fallback readings? I think probably yes. When words are borrowed from one language to another they often retain their original morphology in the first instance. At the very least it can’t do any harm to know which language the fallback was taken from. If a language wants to follow its own rules regardless of this information then it can override the default.

I’m aware that different languages have different parts of speech, and that while most languages have words which pass for nouns, adjectives, verbs and adverbs, that does not mean they have the same semantics or behaviour as English nouns, adjectives, verbs and adverbs. I also appreciate that inflecting completely alien words could prove to be difficult in some languages (although knowing that they are alien should help). However this is about producing something when the alternative is to provide nothing, so perfection is not a requirement.

Should languages omit readings if there is a suitable fallback reading? That’s a tricky question. On the one hand, to say no results in a large amount of duplication within the language description files. This is surely undesirable. On the other hand I do think that lists of differences will be more difficult to write, check and maintain, and that automatic propagation of changes is not desirable in the way that it is between dialects.

A good example of this is chemical elements in Danish. There is very nearly a 50-50 split between native and international names, so any saving would not be of the same order as (for example) British versus US English where there are only three differences. If a full list is given then it is possible to check that every element has been considered, whereas a partial list cannot be checked for completeness without redoing much of the research used to compile it in the first place.

For these reasons I’m minded to stick with full lists for the time being. However when compilation is introduced it would certainly be possible for any redundant readings to be automatically removed by the compiler, and I see no harm in that. Finally, I would only intend to duplicate readings for words which have been or are being assimilated into the language in question. Where a language has no word for a concept, the fallback will not be duplicated.

Merging the predicate dictionary into the language description

Wednesday, September 30th, 2009

I’m going to eliminate any formal distinction between the predicate dictionary and the language description. Instead there will be only one type of file, which can hold any of the currently supported types of declaration (predicate, morpheme, reading and so on).

Of course, I certainly don’t want to copy and paste the same set of predicate declarations into many separate files, but that won’t be necessary. There will need to be a way for one language to inherit declarations from another language in order to efficiently support dialects. The same mechanism, once it has been implemented, can be used across all languages to share a common set of predicate declarations.

One important difference between this and the current arrangement is that it will provide a basis for predicate declarations to be overridden. This will allow information to be associated with a predicate even if it is not strictly language-independent.

For example, Polish treat nouns differently according to whether they are animate or inanimate. For the most part animacy is defined as you would expect it to be, but there are marginal cases which are decided by convention (plants are generally inanimate, but viruses, bacteria and fungi are animate), and more than a few outright exceptions (units of currency, such as the złoty, are animate). To the extent that the classification is based on objective criteria it can and should be shared between languages, but exceptions rightly belong within the relevant language description.

Implementing this capability will not be a great burden. Arguably it simplifies the translation system slightly, and it avoids the annoyance (within the internal C++ API) of having to explicitly instantiate the predicate dictionary and provide a reference to it when constructing a language object.

Names of Colours part 4: Implementation

Monday, August 31st, 2009

I’ve now had some experience working with the colour predicates described previously, and so far they have proved to be satisfactory. Certainly I have not yet had cause to wish that they were defined differently. However there are a number of constructions which cannot currently be translated, due in large part to how the word selection algorithm works. Here is an outline of what works, what doesn’t, and how the situation could be improved.

Unqualified hues present no difficulty provided that the readings given cover all of the allowed predicates. This can and often does result in several readings for the same colour name. For example, both colour:azure and colour:blue have been translated as ‘blue’ in English.

Hues qualified with colour:dark, colour:light or colour:bright also work as intended provided that they are bound together as a compound predicate, for example:

(colour:dark colour:orange) ⇒ ‘brown’

With a few additions to the language description this can be expanded to:

((colour:dark colour:orange) bio:genus:vulpes) ⇒ ‘brown fox’.

However, other permutations of these predicates have a less desirable surface form:

((colour:orange colour:dark) bio:genus:vulpes) ⇒ ‘orange dark fox’
(colour:dark (colour:orange bio:genus:vulpes)) ⇒ ‘dark orange fox’
(colour:orange (colour:dark bio:genus:vulpes)) ⇒ ‘orange dark fox’

The question is, should the translation system do better with these inputs, or should these inputs be avoided?

A partial answer to this question is that it shouldn’t matter whether colour:dark or colour:orange is specified first, because the effect of these predicates on the membership function is linear. By this I mean that if f(x) represents darkness and g(x) represents orangeness then:

f(g(x)) ∝ f(x) × g(x)

Since multiplication is commutative it follows that:

f(g(x)) ∝ g(f(x))

This does not mean that (colour:dark colour:orange) and (colour:orange colour:dark) should necessarily produce the same output, but the surface forms should at least be of similar quality, which is clearly not the case at present.

One solution would be to provide two readings for the word ‘brown’, but this would be inelegant, and scales poorly if more than two predicates were involved. The alternatives are to improve the word selection algorithm so as to recognise when permutations are equivalent to each other, or to force the predicates into a particular canonical order.

Many languages have a preferred order for adjectives, so some reordering will be needed whether or not it is a requirement for word selection. For those languages which don’t have a preferred order, there is no reason why one can’t be imposed anyway. Even for those languages which use adjective order to indicate emphasis, there is no need to preserve the original order of the predicates, because that would not be a correct way to deduce what should be emphasised.

However I can see one situation where reordering won’t help. Where a noun encompasses the meaning of one or more adjectives (such as ‘lamb’ or ‘ewe’ in place of ’sheep’) there is no guarantee that the predicates replaced will be canonically adjacent to each other (for example ‘young black sheep’). For this reason I think impovements to the word selection algorithm will be needed, even if canonicalisation is introduced too.

Regarding the question of associativity, (colour:dark (colour:orange bio:genus:vulpes)) is certainly acceptable: it merely applies the three predicates in sequence. As they are all descriptive and do not contradict each other there is no reason why this shouldn’t happen, but it is not something that the translation system can handle currently.

The colour in isolation, however, must be expressed as (colour:dark colour:orange) (or vice versa), because there is no other way in which two predicates can be combined. It follows that all of these forms need to be matched. The options are similar to before, except that there will almost certainly need to be changes to word selection (because the current system cannot replace a set of predicates which do not form a subtree).

It has occurred to me that I may be making life unnecessarily difficult for myself by using an explicit binary tree structure as opposed to something more akin to the list structures used in Lisp. In the latter case there is a terminator at the end of the list, so there is no structural difference when colour:dark and colour:orange are applied to each other or applied to something else. This would be a radical change that would affect the whole translation system, but I think it is worth considering.

Geography: Adjectival Forms

Sunday, April 26th, 2009

Many geographical names have a corresponding adjectival form, for example:

  • Africa (African)
  • Mongolia (Mongolian)
  • Cornwall (Cornish)
  • Liverpool (Liverpudlian)

The syntactic behaviour of these terms is straightforward but their semantics are not:

  • ‘Australian wine’ is wine which originated from Australia (indicating origin).
  • ‘Chinese food’ is food of a type which originated from China (indicating origin, but of the type rather than the food itself).
  • an ‘American state’ is one which forms part of the United States of America (indicating inalienable possession).
  • ‘Russian gold reserves’ are gold reserves owned by Russia (indicating alienable possession).
  • the ‘English victory at Agincourt’ was a victory by England (indicating the identity of the agent).
  • the ‘French defeat at Waterloo’ was a defeat of France (indicating the identity of the patient).

(I’ve excluded from this list idiomatic usage such as ‘Chinese whispers’ or ‘Spanish practices’ because idioms can only be handled as special cases: it is simply not possible to deduce their meaning analytically [1]. Also excluded is usage referring to the language rather than the location, such as ‘Italian verb’, because the meaning would not then be expressed in terms of the geographical predicate.)

This is not an issue of ambiguity. On the contrary, in any given context the meaning of the adjective is usually well-defined even if (and this is the important point) more than one of the options is physically plausible. This is particularly apparent in the case of nouns like ‘defeat’ where there can be both an agent and a patient. In the examples given above it is perfectly clear as a matter of language who occupies which role: it is not necessary to know military history to work it out.

To make matters even more complicated, the role can depend on more than just the noun. For example, when referring to the ‘English defeat of the Spanish Armada’ it is clear that England is the actor, not the patient. (One way of explaining this effect would be to take the view that ‘English’ is not acting directly on the noun ‘defeat’, but rather on the noun phrase ‘defeat of the Spanish Armada’. Since this is a different concept with different characteristics, the fact that it casts the modifying adjective in a different role is unsurprising.)

I don’t think it is feasible to fully address issues like these in the lexicon. For starters I have no particular desire to list half a dozen separate readings against each adjective. Even if I did, this would not give the correct behaviour because in addition to the allowed usage it would also permit a wide variety of invalid usage.

My tentative solution is therefore to introduce a level of indirection so that only one reading is needed, and so that there is more opportunity for rules to influence the word selection process. (At present the only rules that execute prior to word selection are decomposition rules, but it was always likely that would change.) The specific mechanism I’m proposing is as follows:

  • A predicate is introduced for internal use within language description files when specifying the meaning of adjectival forms such as ‘English’ and ‘French’. I’m going to call this adjective:genitive.
  • This predicate will not have any fixed meaning, but rather, will represent the difference between the adjectival form and the noun (’English’ vs ‘England’, ‘French’ vs ‘France’).
  • Further predicates are introduced to represent more specific relationships, such as alienable possession, in cases where use of the adjectival form would be permissible. I’m going to give these names of the form adjective:possessive:alienable.
  • There are no readings for the specific predicates, so generation occurs only by means of fallbacks.
  • The first fallback is to adjective:genitive, so if a suitable adjectival form exists then it is used.
  • There is a second fallback is to an appropriate preposition (such as ‘of’ or ‘by’) for use when the adjectival form is unsuitable or nonexistent.

This is a fairly complicated arrangement, but having looked at the alternatives I’m satisfied that it is warranted. It will not be possible to implement it until word selection is upgraded to allow fallbacks. One redeeming characteristic is that need not stand in the way adding lexicon entries: these can be defined in terms of adjective:genitive even if there is no supported method for generating it.

[1] You might ask why terms such as these need to be handled at all if the aim is text generation rather than analysis. There is a good reason: to prevent the term from being generated in inappropriate circumstances if its idiomatic meaning would override the natural one. Also, I want the lexicon to be usable in both directions even if the grammar is not, simply because the cost is low and the potential future benefit is large.

When to count in hundreds rather than thousands

Sunday, April 19th, 2009

In English numbers such as 1100 can be expressed in two ways: ‘one thousand one hundred’ or ‘eleven hundred’. Both forms are acceptable, but ‘eleven hundred’ appears to be significantly more common (presumably because it is shorter). The same is true in several other languages.

In most languages that allow it at all, use of the hundreds form stops (or becomes much less common) above 1999. Also, it tends not to be used when the number of hundreds is itself an exact multiple of ten (that is to say, in the range 1000 to 1099, 2000 to 2099 and so on).

Some sources claim that the hundreds form should be used only for exact multiples of a hundred and not for intermediate values. However this certainly doesn’t hold true for dates, and I can see no good reason why it should make a difference for other usage. (It is true that non-multiples are more likely to be written using digits rather than words, and this makes examples hard to find, but I’ve seen enough of them to be confident that they are acceptable.)

I’ve been attempting to find out how this issue is handled in some of the languages supported by the translation system, and my results so far can be found below. Where a clear preference can be identified then I think the translation system should follow it. In marginal or uncertain cases I’m going to err on the side of the thousands form (mainly on the grounds that it is the simpler of the two formats). I would greatly welcome input from those with personal knowledge of these or other languages.

English
Both forms are acceptable. In British English the hundreds form is preferred up to 1999; beyond that point I think it would be considered odd but comprehensible. In American English the hundreds form is preferred up to 9999.

Dutch and Afrikaans
Both forms are acceptable, with the hundreds form preferred up to 1999. (Wikipedia claims 9999 for Dutch, but the evidence I’ve found does not support this. See, for example, this wikibook, which gives ‘elf honderd elf’ for 1111 but ‘negenduizend negenhonderd negenennegentig’ for 9999.)

German
The hundreds form would appear to be dominant (and perhaps obligatory) for dates. For other types of usage it is unclear to what extent the hundreds form is permissible, but it is clearly not preferred.

Norwegian, Swedish and Danish
The very limited amount of evidence that I’ve seen so far suggests that both forms are permissible, but it is unclear which is preferable.

Icelandic, Faroese
Evidence is again very limited, but points towards a fairly strong preference for the hundreds form up to 1999.

French
Both forms are permissible, but the hundreds form is preferred (strongly for 1200-1600, and very strongly for 1100).

Spanish, Portuguese, Catalan, Italian, Romanian
Although the hundreds form is feasible in these languages, it would be considered incorrect (or at least non-standard).

Finnish
According to Numbers and Finnish Numerals the hundreds form is “possible” but “less common than in English”. Usage on the web appears to support this view, to the extent that the hundreds form is significantly less common than the thousands form.

Chinese, Japanese, Korean
The hundreds form would be difficult or impossible to write in these languages.

Sources:
Chicago University Press, The Chicago Manual of Style, 13th ed
Wikipedia, Names of Numbers in English
T. G. G. Valette, Dutch Conversation-Grammar
Carol Fehringer, A reference grammar of Dutch
Bruce C. Donaldson, A grammar of Afrikaans
Elke Gschossmann-Hendershot, Lois M. Feuerle, Schaum’s Outline of German Grammar
M. H. Offord, A Student Grammar of French
Glanville Price, A comprehensive French grammar
Sonia Celegatti Althoff, Portuguese Grammar
Elijah Clarence Hills et al, A Portuguese grammar
Max Wheeler, Alan Yates, Nicolau Dols, Catalan
E. Lemmi, A Theoretical and Practical Italian Grammar
Giuseppe Rampini, A grammar of the Italian language
Dana Cojocaru, Romanian Grammar
Lauri Karttunen, Numbers and Finnish Numerals

Handling descriptive names of species and subspecies

Saturday, April 4th, 2009

When naming closely-related types of animal it is common practice to make reference to their colour, size, geographic range, or other such attribute. For example, English has the following common names for members of the genus Ursus:

  • Brown Bear (Ursus arctos)
  • American Black Bear (Ursus americanus)
  • Polar Bear (Ursus maritimus)
  • Asiatic Black Bear (Ursus thibetanus)

These compounds are to a signficiant extent language-independent. For example, names for Ursus arctos (according to WikiSpecies) include:

  • Bruine beer (nl)
  • Braunbär (de)
  • Brun bjørn (da)
  • Brunbjörn (sv)
  • Brunbjørn (nb)
  • Ours brun (fr)
  • Orso Bruno (it)
  • Oso pardo (es)
  • Urso-pardo (pt)
  • Niedźwiedź brunatny (pl)

In the case of Ursus maritimus the correspondence between languages is less impressive:

  • Ijsbeer (nl)
  • Eisbär (de)
  • Isbjørn (da)
  • Isbjörn (sv)
  • Isbjørn (nb)
  • Ours blanc (fr)
  • Orso polare (it)
  • Oso polar (es)
  • Urso-polar (pt)
  • Niedźwiedź polarny (pl)

However, even an imperfect correspondence may be useful as a fallback translation:

  1. Where a fallback holds in a majority of cases, it will greatly reduce the number of language-specific readings that are needed.
  2. Even if the fallback holds only in a minority of cases, it may still be a good choice for languages which do not have an established common name for the species in question.

For example, if English had no name of its own for ‘polar bear’ then ‘white bear’ would be a reasonable enough translation, and while ‘ice bear’ sounds a bit odd it would probably be understood.

A possible explanation for why ‘white bear’ sounds better (IMO) than ‘ice bear’ is that ‘white’ is an adjective whereas ‘ice’ is functioning here as a noun adjunct. Adjectives can typically be used to qualify any adjective for which they make sense physically, whereas noun adjuncts tend to be more idiomatic in nature. For this reason, names containing noun adjuncts probably won’t make good fallbacks. (They would in any event be quite difficult to express as fallbacks, due to the unspecified semantic relationship between adjunct and noun.)

[Update 2009-04-08: apparently there is precedent for use of the term 'ice bear' in English, albeit in an alternate reality, and in reference to the panserbjørne as opposed to ordinary non-talking polar bears.]

What is needed, then, is a method for expressing fallbacks within the predicate dictionary. This is a feature I had intended to implement anyway for tasks such as describing a ‘computer’ as a ‘calculating machine’ in languages with no word for the former. Doing so serves the important task of reducing the number of predicates that a language must provide readings for in order to guarantee successful translation.

Note that this concept of a fallback is not the same as the current, partially-implemented mechanism for defining one predicate in terms of others. Fallbacks will not generally be equal in meaning, but merely an acceptable substitute if no better one is available.

One point I have not yet decided is to what extent, if ever, languages should provide readings which merely duplicate what would otherwise be provided by a fallback. If no reading is provided then the language description will become dependent on the content of the fallback. Where the choice of fallback is clear-cut that is probably OK, but often there will be several plausible options. I’m reluctant to guarantee that fallbacks will remain unchanged for all time when they are based on value judgements.

On the other hand, if readings are provided for everything then some of the potential benefits of fallbacks (reduced effort, smaller language definitions) will not be realised. I suspect that some form of compromise is in order, but it is not clear what the basis for that should be.

Generating numbers in Spanish and Portuguese

Monday, March 2nd, 2009

Spanish and Portuguese are closely-related languages, and it is instructive to compare how their number systems work. Both are fairly regular decimal systems, using words that are for the most part phonologically similar. However they are sufficiently different - both orthographically and syntactically - for there to be little or no usable commonality between their respective implementations within the translation system (or at least, no more so than there is, for example, between Spanish and English).

Spanish has single words for all numbers up to 30, after which additive compounds are used. In Portuguese, single words stop at 30. In both cases, the words used for the numbers 16 and upwards are highly regular in form (but currently no attempt is made to exploit that regularity). Both languages also use single words for multiples of 100 up to 1000. Larger values are expressed as multiples of a thousand, million, billion or upwards.

The rules for additive combination are noticeably different:

  • in Spanish, where tens and units are added together this is done using the word ‘y’ (’and’). In all other cases the components are concatenated.
  • In Portuguese, tens and units are always added to other components using ‘e’ (’and’). Also, the final two components of the number are always added using ‘e’. In other cases the components are concatenated.

Some examples to illustrate how these rules differ from each other and from English:

  • 101 = ‘ciento uno’ (es), ‘cento et um’ (pt), ‘one hundred and one’ (en)
  • 199 = ‘ciento noventa y nueve’ (es), ‘cento e noventa e nove’ (pt), ‘one hundred and ninety nine’ (en)
  • 1100 = ‘mil cien’ (es), ‘mil e cem’ (pt), ‘one thousand one hundred’ (en)

Both languages have feminine forms for some numbers, which agree with the noun being counted if there is one. Also in both languages the number 100 has two forms: a shorter one for when it acts directly on a noun (including nouns such as million or billion), and a longer one for when it does not. Neither of these language features have been fully implemented yet, however the latter is present to the extent necessary to support simple counting.

I’ve not encountered any use of hyphens for expressing cardinal numbers in either language.

Both Spainish and (European) Portuguese use the ‘long scale’ for large numbers, so ‘billón’ (es) and ‘bilião’ (pt) translate to 1012 as opposed to 109. Note, however, that Brazilian Portuguese uses the ’short scale’. This will require a slightly different set of decomposition rules, a fact to be born in mind when support for dialects is introduced.

It’s worth saying that definitive rules have been difficult to come by in some cases, and a number of conflicting examples have been seen. (It doesn’t help that, as in English, large numbers are rarely written out in full unless they are round ones.) I’ve done my best to identify and implement a defensible set of rules, but I’m open to correction by those more familiar with these languages. (The same applies to any other language or topic.)

Generating numbers in French

Saturday, February 21st, 2009

The French number system is more complex than that of English, partly due to its use of vigesimal in some numbers, and partly due to minor irregularities relating to the use of plurals and the word ‘et’. None of these have proved difficult to implement, however a somewhat larger number of rules have been needed.

In French there are three different ways in which a number acting as a multiplier can be pluralised:

  • The number ‘mille’ (1000) is never pluralised (or at least, not visibly so).
  • The numbers ‘vingt’ (20) and ‘cent’ (100) may be pluralised when they appear at the end of a number, but not otherwise.
  • Other multipliers such as ‘million’ (1000000) and ‘milliard’ (1000000000) may be pluralised wherever they appear.

(When I say ‘may be pluralised’ I mean that they will be iff there is more than one of them.)

I’ve chosen to handle ‘mille’ by allowing it to be marked as plural, but making the plural inflection do nothing. In this sense I’m treating it like the English word ’sheep’: it isn’t uncountable, because you can have ‘two sheep’ or ‘deux mille’, but it is invariable. Whether this is linguistically correct I don’t know, but it appears to produce the correct behaviour.

For ‘vingt’ and ‘cent’ I’ve introduced a special tag (drops-plural) to distinguish them, and added two agreement rules: one to make them plural when used as a multiplier, and one to convert them back to singular if they are followed by another number. Note that this does depend on the rules being applied from the bottom of the tree upwards, and if that were to change then a different method of implementation may be needed.

The components of a compound number may be separated by hyphens,by spaces, or by the conjunction ‘et’ (’and’). The examples I’ve looked at have not been entirely consistent as to what should be used when, but an acceptable policy would appear to be:

  • use ‘et’ for the numbers 21, 31, 41, 51, 61 and 71 (surrounded by spaces, not hyphens);
  • otherwise use hyphens for numbers less than 100;
  • use spaces elsewhere.

(These rules are intended to apply both to complete numbers and to components of a larger number.)

Insertion of ‘et’ has been handled by creating two special-purpose tags for words which can precede or follow ‘et’. When both are present in the correct order, combined using internal:add and with no substructure, the ‘et’ is inserted by a transformation.

Hyphens are not implemented at present because there is no mechanism available to support them (other than enumeration, which would be feasible but undesirable). This is an important issue in its own right, but I’m not ready to address it yet.

The transition from simple numbers (’seize’, 16) to compounds (’dix-sept’, 17) presents no difficulty, and is implemented in essentially the same way as the corresponding English transition (20 to 21) but with a lower threshold.

It will be necessary to insert a ‘de’ following multipliers ‘million’, ‘milliard’ and upwards when they are used to qualify a noun, however that is outside the scope of the grammar that has been written so far.

Generating numbers in English

Saturday, February 14th, 2009

The first number system that I’ve attempted to implement is that of English. Though not perfectly regular, it is one of the less complicated systems in existence, so if the translation system is to have any chance of working generically then it should have no great difficulty implementing it.

English numbers less than one thousand are typically expressed as a sum of hundreds, tens and units, with the exception that the values eleven to nineteen are single numbers (both in isolation and as part of a bigger number). Larger values are expressed as a number of thousands, millions, billions and upwards. Any component with a value of zero is omitted (hence ‘one million and one’, with no mention of the lack of tens, hundreds or thousands).

Implementation of this structure is straightforward using the numerical decomposition rules that were recently added to the language definition syntax. The following specific rules are needed:

  • Numbers in the range 20 to 99 that are not already multiples of ten have their tens and units separated.
  • Numbers in the range 100 to 999 have their hundreds separated from the remainder and expressed as a multiple.
  • Numbers in the range 1000 to 999999 have their thousands separated from the remainder and expressed as a multiple.
  • Similarly for millions, billions, trillions and upwards.

I am reluctantly defaulting to the ’short scale’ for all dialects of English, where a billion is 109.

There is one other valid method of decomposition:

  • Numbers in the range 1100 to 1999 (or 9999 in American English) which are not part of a larger number may alternatively have their hundreds (not thousands) separated from the remainder.

According to the Chicago Manual of Style this is the preferred format where it is allowed. That makes sense because fewer words are needed, and it is natural to prefer the more concise form where a choice is available. I intend to implement it, but for now it has been omitted in the name of simplicity.

For syntactic reasons many of the rules need to be split into two parts: one to handle the case where the remainder is non-zero, and one for when it is zero (and hence omitted).

As I’ve noted previously, it is neither necessary nor generally desirable for the decomposition rules to produce output which directly corresponds to the surface form: decomposition is just the first phase of the translation process, so it needs to produce output suitable for input to subsequent phases. For example, the surface form (prior to linearisation) that the current implementation produces for the number 256 is:

((2 100) and) (50 6))

but the output of the decomposition phase is:

((internal:add ((internal:add 6) 50)) (internal:mul internal:hundred) 2).

Note the use of internal:add and internal:mul to represent, in the abstract, combination by addition and multiplication respectively. Also internal:hundred, which is needed to distinguish between hundreds that have already been decomposed and those which have not.

Following the decomposition phase, parts of speech are selected given the available readings and the set of allowed grammatical productions. There is a choice to be made here: the grammar can be made sufficiently prescriptive to distinguish comprehensively between allowed utterances and those which are not, or just prescriptive enough to organise phrases that are actually produced by the decomposition rules. The former would be preferable if the productions might also be used for parsing, but as they refer to the untransformed text there is little prospect of that happening. The latter option has therefore been chosen on the grounds of brevity and readability.

The only other rules needed are a set of transformations to convert instances of internal:add and internal:mul into surface form. In most cases this is done just by concatenating the two values. The exception (in British but not American English) is the addition of a one- or two- digit value and one of three digits or more, which is done using ‘and’ (as in ‘two hundred and fifty six’). The transformations are able to distinguish between these situations by looking at how the two parts of the number are tagged.

Currently each multiplier (hundreds, thousands, millions and so on) needs an explicit pair of decomposition rules to support it, even though the progression (for thousands and upwards) is entirely regular. Because of this the ruleset has to stop somewhere, and I’ve chosen trillions as they are the largest multipliers used in everyday speech. It would be nice to remove this restriction and state the rule more generally. This could be done with a small extension to the rule syntax, but each multiplier would still need to be listed in the lexicon, and for most purposes the current limit is entirely adequate.

One form which I haven’t addressed yet is the use of the indefinite article in place of a leading ‘one’, as in “there are a hundred billion stars in the Galaxy”. I’m not sure to what extent this is ever truly obligatory, but it does appear to be preferable in some cases. It is also possible for ‘one’ to be omitted following the definite article. (To see the distinction between these cases, compare ‘a hundred’ and ‘two hundred’ with ‘the hundred’ and ‘the two hundred’.) The required behavior here may become clearer when more of the grammar has been written.

Finally, there will be a need to select between singular and plural according to the size of a number. I think this will best be done by assigning a singular or plural tag to the number itself, and then using agreement rules to propagate those tags to other relevant predicates. However, as there are no facilities for doing this yet, nothing has been implemented.