Archive for June, 2009

First Release, and a Request for Assistance

Saturday, June 20th, 2009

When I first announced this project a year ago I indicated that a generally-usable release of the translation system was a long way off. That is still true so far as its ultimate goals are concerned, but the system need not be complete in order to be useful, and I believe the point has now been reached where a release is appropriate.

Of course the code has been released in one sense already, as its source code repository is publicly accessible, but that is not the same as having checked and tested releases with version numbers. Two types of download are available: a platform-independent source tarball, and a binary distribution for RISC OS. I hope to provide binary packages for Debian and Ubuntu at some point in the future.

The features supported by this release are the generation of cardinal numbers (in 54 languages) and ordinal numbers (in 24 languages). This can be done either by using the translation system directly as a library, or via one of the command-line utility programs with which it is supplied.

One major caveat is that the output for most languages has not been checked by anyone fluent in those languages. For this reason I am now asking for help from volunteers with the necessary language skills. This could be:

  • as a reviewer, with responsibility for checking the output produced for a given language;
  • as a consultant, willing to answer occasional questions about a language.

If you would be willing to help then please contact me (username=gdshaw, domain=riscpkg.org). At the time of writing, assistance with any language except English (but including non-UK dialects of English) would be useful.

Predicate Orthogonality

Sunday, June 14th, 2009

Words in natural languages often represent a combination of distinct concepts which could otherwise be expessed independently of each other. For the most part I intend to avoid this practice when defining predicates, keeping independent concepts separate unless there is good reason to do otherwise.

For example, a ‘colt’ is an animal which is (a) male, (b) under four years of age, and (c) a member of the subspecies Equus ferus caballus. Each of these concepts is capable of being changed independently of the others, so it makes sense for each to be represented by a separate predicate. This increases the length of the source text, but has two notable benefits:

  • It reduces the number of predicates that need to be provided by the source language.
  • By making the source text semantics more explicit, it reduces the likelihood that they will be unnecessarily overspecified.

The second point is an important one because different languages do not necessarily provide words that express the same combinations of concepts. For example, some languages might have a word for Equus ferus caballus, but not for male, female, young or old members of that subspecies. In that case the word selection algorithm needs to know whether it is sufficient to identify the animal as a ‘horse’ or necessary to describe it as a ‘young male horse’. For this to be possible the source text must distinguish between essential and optional attributes, and that is most easily achieved when the attributes in question are expressed separately.

(To be clear, the word selection algorithm isn’t able to support this behaviour yet, but I’m designing the source language on the assumption that it will need to in the future.)

Maximally independent sets of concepts can be said to be ‘orthogonal’, and typically that is what I’ll be aiming for. However it is possible to have too much of a good thing, and there are some concepts for which a strictly orthogonal approach may be more trouble than it is worth. Examples include:

  • the points of the compass, which could be separated into north-south and east-west components. This would reduce the number of predicates needed, but only by a small number, and at the expense of significant verbosity. On balance I think it will be simpler and easier just to list the available directions.
  • colours, for which I think some decomposition will be useful, but only up to a point. For example, I think it will be useful to separate hue from saturation and value (or lightness), but I would not want to define yellow as being half-way between green and red.

Finally, it is worth noting that expressions with substantial idiomatic content cannot and should not be decomposed into orthogonal components. For example, ‘North Dakota’ means more than simply the intersection of ‘North’ and ‘Dakota’: it refers to a very specific geopolitical entity. It would be difficult to define ‘North Dakota’ in terms of other predicates, and impossible to do so concisely, so the only viable option is to provide a separate predicate dedicated to that meaning.