Archive for April, 2009

Geography: Adjectival Forms

Sunday, April 26th, 2009

Many geographical names have a corresponding adjectival form, for example:

  • Africa (African)
  • Mongolia (Mongolian)
  • Cornwall (Cornish)
  • Liverpool (Liverpudlian)

The syntactic behaviour of these terms is straightforward but their semantics are not:

  • ‘Australian wine’ is wine which originated from Australia (indicating origin).
  • ‘Chinese food’ is food of a type which originated from China (indicating origin, but of the type rather than the food itself).
  • an ‘American state’ is one which forms part of the United States of America (indicating inalienable possession).
  • ‘Russian gold reserves’ are gold reserves owned by Russia (indicating alienable possession).
  • the ‘English victory at Agincourt’ was a victory by England (indicating the identity of the agent).
  • the ‘French defeat at Waterloo’ was a defeat of France (indicating the identity of the patient).

(I’ve excluded from this list idiomatic usage such as ‘Chinese whispers’ or ‘Spanish practices’ because idioms can only be handled as special cases: it is simply not possible to deduce their meaning analytically [1]. Also excluded is usage referring to the language rather than the location, such as ‘Italian verb’, because the meaning would not then be expressed in terms of the geographical predicate.)

This is not an issue of ambiguity. On the contrary, in any given context the meaning of the adjective is usually well-defined even if (and this is the important point) more than one of the options is physically plausible. This is particularly apparent in the case of nouns like ‘defeat’ where there can be both an agent and a patient. In the examples given above it is perfectly clear as a matter of language who occupies which role: it is not necessary to know military history to work it out.

To make matters even more complicated, the role can depend on more than just the noun. For example, when referring to the ‘English defeat of the Spanish Armada’ it is clear that England is the actor, not the patient. (One way of explaining this effect would be to take the view that ‘English’ is not acting directly on the noun ‘defeat’, but rather on the noun phrase ‘defeat of the Spanish Armada’. Since this is a different concept with different characteristics, the fact that it casts the modifying adjective in a different role is unsurprising.)

I don’t think it is feasible to fully address issues like these in the lexicon. For starters I have no particular desire to list half a dozen separate readings against each adjective. Even if I did, this would not give the correct behaviour because in addition to the allowed usage it would also permit a wide variety of invalid usage.

My tentative solution is therefore to introduce a level of indirection so that only one reading is needed, and so that there is more opportunity for rules to influence the word selection process. (At present the only rules that execute prior to word selection are decomposition rules, but it was always likely that would change.) The specific mechanism I’m proposing is as follows:

  • A predicate is introduced for internal use within language description files when specifying the meaning of adjectival forms such as ‘English’ and ‘French’. I’m going to call this adjective:genitive.
  • This predicate will not have any fixed meaning, but rather, will represent the difference between the adjectival form and the noun (’English’ vs ‘England’, ‘French’ vs ‘France’).
  • Further predicates are introduced to represent more specific relationships, such as alienable possession, in cases where use of the adjectival form would be permissible. I’m going to give these names of the form adjective:possessive:alienable.
  • There are no readings for the specific predicates, so generation occurs only by means of fallbacks.
  • The first fallback is to adjective:genitive, so if a suitable adjectival form exists then it is used.
  • There is a second fallback is to an appropriate preposition (such as ‘of’ or ‘by’) for use when the adjectival form is unsuitable or nonexistent.

This is a fairly complicated arrangement, but having looked at the alternatives I’m satisfied that it is warranted. It will not be possible to implement it until word selection is upgraded to allow fallbacks. One redeeming characteristic is that need not stand in the way adding lexicon entries: these can be defined in terms of adjective:genitive even if there is no supported method for generating it.

[1] You might ask why terms such as these need to be handled at all if the aim is text generation rather than analysis. There is a good reason: to prevent the term from being generated in inappropriate circumstances if its idiomatic meaning would override the natural one. Also, I want the lexicon to be usable in both directions even if the grammar is not, simply because the cost is low and the potential future benefit is large.

Geography: Names of Continents

Thursday, April 23rd, 2009

Having decided to provide fallbacks for plant and animal species based on attributes such as geographical range, I now need predicates to represent those attributes - preferably before creating large numbers of readings rather than afterwards. I have begun with ones to represent the continents of the world.

The formats I have chosen for the predicate names is:

geog:continent:<name>

Unfortunately there is no consensus as to what criteria a body of land must satisfy in order to qualify as a separate continent. For example, some people consider Europe to be a continent in its own right, but to others it is merely part of a larger continent called Eurasia. A third view is that Europe, Asia and Africa are all part of a single continent called Afro-Eurasia.

I don’t feel that it is necessary or appropriate for the translation system to take sides on this issue. Europe, Eurasia and Afro-Eurasia are all useful concepts which people might want to talk about, whether or not they qualify as continents according to any particular definition. For this reason I have decided on an inclusive approach which is able to accomodate any of the plausible systems that are in common use. The full list is:

geog:continent:africa
geog:continent:afro-eurasia
geog:continent:america
geog:continent:antarctica
geog:continent:asia
geog:continent:australia
geog:continent:eurasia
geog:continent:europe
geog:continent:north-america
geog:continent:south-america

A few points to note:

  • I’ve chosen ‘America’ (as opposed to ‘The Americas’) as the basis for the corresponding predicate name because the namespace makes clear that it refers to the continent, not the country. This decision need not have any bearing on how the name is translated (and indeed, my expectation is that it would normally be rendered in English as ‘The Americas’).
  • Whilst ‘Oceania’ is undoubtedly a useful concept which ought be represented somehow, it isn’t a continent by any reasonable definition.
  • Similarly ‘Australasia’ is not a continent because it encompasses both Australia and New Zealand.
  • I’m deferring consideration of microcontinents, historical continents and mythical continents for the time being on the grounds that there are more important topics to address first.
  • The precise borders of each continent are unspecified, on the grounds that they are often hard to define and have no bearing on what the continent is called. However I am treating ‘continent’ as a term of physical geography, not a social or geopolitical one. On that basis Switzerland is unquestionably part of Europe (but outside the EU), whereas French Guiana is not (despite being part of the EU).

The semantics that I’m provisionally attaching to these predicates is that they are true when applied to the corresponding continent itself. This is a departure from previous policy, in that it associates the predicate with the noun rather than the adjective (so, for example, geog:continent:europe translates to ‘Europe’ rather than ‘European’). The reason for this approach is because of the difficulties associated with defining the adjectival form, about which I will say more soon.

When to count in hundreds rather than thousands

Sunday, April 19th, 2009

In English numbers such as 1100 can be expressed in two ways: ‘one thousand one hundred’ or ‘eleven hundred’. Both forms are acceptable, but ‘eleven hundred’ appears to be significantly more common (presumably because it is shorter). The same is true in several other languages.

In most languages that allow it at all, use of the hundreds form stops (or becomes much less common) above 1999. Also, it tends not to be used when the number of hundreds is itself an exact multiple of ten (that is to say, in the range 1000 to 1099, 2000 to 2099 and so on).

Some sources claim that the hundreds form should be used only for exact multiples of a hundred and not for intermediate values. However this certainly doesn’t hold true for dates, and I can see no good reason why it should make a difference for other usage. (It is true that non-multiples are more likely to be written using digits rather than words, and this makes examples hard to find, but I’ve seen enough of them to be confident that they are acceptable.)

I’ve been attempting to find out how this issue is handled in some of the languages supported by the translation system, and my results so far can be found below. Where a clear preference can be identified then I think the translation system should follow it. In marginal or uncertain cases I’m going to err on the side of the thousands form (mainly on the grounds that it is the simpler of the two formats). I would greatly welcome input from those with personal knowledge of these or other languages.

English
Both forms are acceptable. In British English the hundreds form is preferred up to 1999; beyond that point I think it would be considered odd but comprehensible. In American English the hundreds form is preferred up to 9999.

Dutch and Afrikaans
Both forms are acceptable, with the hundreds form preferred up to 1999. (Wikipedia claims 9999 for Dutch, but the evidence I’ve found does not support this. See, for example, this wikibook, which gives ‘elf honderd elf’ for 1111 but ‘negenduizend negenhonderd negenennegentig’ for 9999.)

German
The hundreds form would appear to be dominant (and perhaps obligatory) for dates. For other types of usage it is unclear to what extent the hundreds form is permissible, but it is clearly not preferred.

Norwegian, Swedish and Danish
The very limited amount of evidence that I’ve seen so far suggests that both forms are permissible, but it is unclear which is preferable.

Icelandic, Faroese
Evidence is again very limited, but points towards a fairly strong preference for the hundreds form up to 1999.

French
Both forms are permissible, but the hundreds form is preferred (strongly for 1200-1600, and very strongly for 1100).

Spanish, Portuguese, Catalan, Italian, Romanian
Although the hundreds form is feasible in these languages, it would be considered incorrect (or at least non-standard).

Finnish
According to Numbers and Finnish Numerals the hundreds form is “possible” but “less common than in English”. Usage on the web appears to support this view, to the extent that the hundreds form is significantly less common than the thousands form.

Chinese, Japanese, Korean
The hundreds form would be difficult or impossible to write in these languages.

Sources:
Chicago University Press, The Chicago Manual of Style, 13th ed
Wikipedia, Names of Numbers in English
T. G. G. Valette, Dutch Conversation-Grammar
Carol Fehringer, A reference grammar of Dutch
Bruce C. Donaldson, A grammar of Afrikaans
Elke Gschossmann-Hendershot, Lois M. Feuerle, Schaum’s Outline of German Grammar
M. H. Offord, A Student Grammar of French
Glanville Price, A comprehensive French grammar
Sonia Celegatti Althoff, Portuguese Grammar
Elijah Clarence Hills et al, A Portuguese grammar
Max Wheeler, Alan Yates, Nicolau Dols, Catalan
E. Lemmi, A Theoretical and Practical Italian Grammar
Giuseppe Rampini, A grammar of the Italian language
Dana Cojocaru, Romanian Grammar
Lauri Karttunen, Numbers and Finnish Numerals

Introduction of the ‘cardinal’ predicate

Thursday, April 16th, 2009

Once the library is released I will want to avoid making unnecessary changes to its public interface that are not backwards-compatible. The interface obviously includes the API of the library, but also extends to the syntax and semantics of the source language that the library accepts. Attempting to stabilise the whole of the source language at this stage would seriously hinder its development, so I need to think carefully before declaring that any language feature will be a permanent fixture.

(This is not to say that incompatible changes will never be made, but I don’t want to make a habit of it.)

One language feature that is worrying me is the way in which raw integers are interpreted as cardinals. This makes it quite difficult to prevent the decomposition rules for cardinals from acting on any integers in the source text: you need to either make sure that any alternative rules take precedence (as I’ve tended to do with ordinals) or accept that the decomposition will take place and work with the result.

So far this has not caused any insurmountable problems, and it may never do so, but I’m not yet convinced that raw integers as cardinals should become a permanent language feature. I need to provide something, though, because cardinal number generation will be one of the main features of the first release. The solution I’ve chosen is to qualify cardinal numbers with the predicate cardinal (just as ordinal numbers are qualified by ordinal).

This approach is future-proof because, in the event of a future change of policy, it will be very easy for the language descriptions to strip of the cardinal predicate and ignore it. (In fact that is precisely what the language descriptions do right now, but the point is that they are not currently required to work that way whereas they could be in the future.) The only commitment I am making is not to use the predicate cardinal for anything else in the global namespace, which is not a major burden.

Handling descriptive names of species and subspecies

Saturday, April 4th, 2009

When naming closely-related types of animal it is common practice to make reference to their colour, size, geographic range, or other such attribute. For example, English has the following common names for members of the genus Ursus:

  • Brown Bear (Ursus arctos)
  • American Black Bear (Ursus americanus)
  • Polar Bear (Ursus maritimus)
  • Asiatic Black Bear (Ursus thibetanus)

These compounds are to a signficiant extent language-independent. For example, names for Ursus arctos (according to WikiSpecies) include:

  • Bruine beer (nl)
  • Braunbär (de)
  • Brun bjørn (da)
  • Brunbjörn (sv)
  • Brunbjørn (nb)
  • Ours brun (fr)
  • Orso Bruno (it)
  • Oso pardo (es)
  • Urso-pardo (pt)
  • Niedźwiedź brunatny (pl)

In the case of Ursus maritimus the correspondence between languages is less impressive:

  • Ijsbeer (nl)
  • Eisbär (de)
  • Isbjørn (da)
  • Isbjörn (sv)
  • Isbjørn (nb)
  • Ours blanc (fr)
  • Orso polare (it)
  • Oso polar (es)
  • Urso-polar (pt)
  • Niedźwiedź polarny (pl)

However, even an imperfect correspondence may be useful as a fallback translation:

  1. Where a fallback holds in a majority of cases, it will greatly reduce the number of language-specific readings that are needed.
  2. Even if the fallback holds only in a minority of cases, it may still be a good choice for languages which do not have an established common name for the species in question.

For example, if English had no name of its own for ‘polar bear’ then ‘white bear’ would be a reasonable enough translation, and while ‘ice bear’ sounds a bit odd it would probably be understood.

A possible explanation for why ‘white bear’ sounds better (IMO) than ‘ice bear’ is that ‘white’ is an adjective whereas ‘ice’ is functioning here as a noun adjunct. Adjectives can typically be used to qualify any adjective for which they make sense physically, whereas noun adjuncts tend to be more idiomatic in nature. For this reason, names containing noun adjuncts probably won’t make good fallbacks. (They would in any event be quite difficult to express as fallbacks, due to the unspecified semantic relationship between adjunct and noun.)

[Update 2009-04-08: apparently there is precedent for use of the term 'ice bear' in English, albeit in an alternate reality, and in reference to the panserbjørne as opposed to ordinary non-talking polar bears.]

What is needed, then, is a method for expressing fallbacks within the predicate dictionary. This is a feature I had intended to implement anyway for tasks such as describing a ‘computer’ as a ‘calculating machine’ in languages with no word for the former. Doing so serves the important task of reducing the number of predicates that a language must provide readings for in order to guarantee successful translation.

Note that this concept of a fallback is not the same as the current, partially-implemented mechanism for defining one predicate in terms of others. Fallbacks will not generally be equal in meaning, but merely an acceptable substitute if no better one is available.

One point I have not yet decided is to what extent, if ever, languages should provide readings which merely duplicate what would otherwise be provided by a fallback. If no reading is provided then the language description will become dependent on the content of the fallback. Where the choice of fallback is clear-cut that is probably OK, but often there will be several plausible options. I’m reluctant to guarantee that fallbacks will remain unchanged for all time when they are based on value judgements.

On the other hand, if readings are provided for everything then some of the potential benefits of fallbacks (reduced effort, smaller language definitions) will not be realised. I suspect that some form of compromise is in order, but it is not clear what the basis for that should be.

Migration to the GNU build system

Thursday, April 2nd, 2009

Originally I had expected that it would take several years at least to reach the point where a formal release was worthwhile. In terms of creating a general-purpose translation system I think that is still true, but the ability to generate cardinal numbers - now in over 40 languages - is a useful ability in its own right. For this reason I’m now looking to identify and resolve any issues which would stand in the way of a release.

Until very recently the project has been built using a bespoke set of makefiles. I’ve now removed these and replaced them with scripts for driving the GNU Autotools suite. There were two main drivers for this:

  • portability, in particular with respect to shared library implementations; and
  • providing a full set of standard makefile targets.

As a result of this, the procedure for building from the Subversion repository is now as follows:

./bootstrap
./configure
make

and to run the regression tests:

make check

There are some important documentation files missing from the distribution tarball, but it will now pass a distcheck test and on that basis can be declared to be ‘working’.

One important change to the code is that the path for locating language definition files is now compiled into the library: for a default prefix of /usr/local the path would be /usr/local/share/babel/languages. In order to allow the regression tests to run without installing the files first, the path can be overridden at runtime by setting the environment variable LIBBABELPATH.

The library API is not at all stable yet, so do not place any reliance on the soname. I intend to greatly cut down on the size of the public interface before releasing anything, quite possibly by hiding the entire C++ interface and providing a minimal one written in C. (Some of the C++ interface might then be re-exposed in the future, but only when it is much more stable than it is now.)