UTF-8 support added

January 3rd, 2009

I wrote in November about the limitations of the standard regex engine when processing UTF-8 encoded text. These have now become sufficiently annoying to make an upgrade worthwhile. The regex engine I’ve decided to use is the one provided by the PCRE library.

The issue that brought this to a head was the realisation that, while adding accented characters to a set was relatively straightforward, subtracting them was not (other than by inverting the set and listing all allowed characters explicitly, which can be rather cumbersome).

Currently it is possible to switch between the two regex engines at compile time by setting the makefile variable USE_PCRE to true (1) or false (0). This is only a temporary measure: as the language definitions come to depend on UTF-8 support the older engine will soon become unusable, at which point it will be removed to avoid clutter.

An important topic that I have not addressed yet is now Unicode input should be normalised. (I don’t think that there is any doubt that it should be normalised, but there is more than one algorithm that could be used and it isn’t yet clear to me which is preferable.) There may also be a need to support character classes (both the standard ones, and perhaps also use-defined character classes to handle language-dependent groupings such as ‘vowels’ or ‘consonants’). Two-level morphology remains an option for the future, but definitely not the near future.

Ambiguity versus imprecision

December 22nd, 2008

One of the main design goals of the source language (BabelScript) is to avoid ambiguity, but this does not mean that information must be conveyed in perfect detail. On the contrary, it has become very clear as I’ve started writing predicate definitions that a degree of imprecision is often a necessity.

For example, consider the statement “the sky is blue”. What is the meaning of ‘blue’ in this sentence? The shade known as ’sky blue’ is an obvious possibility, but the writer hasn’t said that, and in reality the colour of the sky varies enourmously dependingly on the time of day, weather, and zenith angle. There is not enough information for the reader to deduce whether the sky is actually sky blue, or medium blue, or some other shade. Significantly, it is quite conceivable that the writer did not know the precise colour (or had not given the matter any thought).

Does this make the statement defective in some way? Not at all. There are details which it does not specify, but that is true of most writing. Indeed, natural language would be considerably less useful if it were not able to convey incomplete or imprecise information: it should be possible to make comment about the colour of the sky without first measuring it with a photometer.

Why, then, all of the fuss about avoiding ambiguity? The concern is to avoid creating predicates that require a deep understanding of the context before they can be translated.

An example is English word ‘revolting’, which can refer either to a cause of revulsion or to an act of rebellion. Individual sentences, such as “the peasants are revolting”, do not necessarily contain enough information to deduce what is intended, but the intent must be known prior to translation because other languages are likely to use different words for the two concepts.

Key differences between ambiguity and imprecision are that:

  1. Ambiguities must be resolved by the reader in order to correctly interpret the text, whereas imprecisions need not be.
  2. Ambiguities typically correspond to large (often qualitative) differences of meaning, imprecisions to smaller (usually quantitative) variations.
  3. Imprecision is a useful tool for conveying incomplete information, whereas ambiguity is useful only where the intent is to pun or dissemble.

If imprecision is considered useful then some means should be found for representing it within the source language. One way in which this can be done is through a class hierarchy, an example being the taxonomic classification of animals and plants. This gives the means to identify an individual species if that information is known (such as V. vulpes, the red fox), but also provides the option of stating only the genus (Vulpes, foxes), the family (Canidae, dogs), the order (Carnivora, carnivores) or the class (Mammalia, mammals). Some species are subdivided into subspecies, allowing even greater precision.

An alternative method is to take a predicate with a more precise meaning, then qualify it in some way to indicate that the precision should be reduced. This is a process which often happens in natural language (hence words such as reddish, smallish, and roundish), and is particularly useful where the imprecise concepts would otherwise be difficult to name.

Whichever method is chosen, the aim should be to approximate a similar level of granularity to that present in typical natural languages. For this reason, it may be appropriate for the degree of granularity to vary between different parts of a hierarchy. (Compare, for example, the phylum Chordata - which includes all mammals, birds, reptiles, amphibians and fish - with the phylum Nematoda - which consists entirely of nematode worms.)

Finally, there will be a need to identify what are known as ‘base level’ concepts. These represent the preferred level of detail if there is no good reason for specifying more or less. Exactly what form this will take has not been decided yet, however my working assumption is that it is a language-dependent phenonenum and therefore belongs in the language definition, not the predicate dictionary.

Predicate semantics: adjectives or abstract nouns?

December 16th, 2008

A topic I’ve touched on before, but which deserves a more thorough explanation, is exactly what semantics are attached to a predicate when it corresponds to a concept such as the colour green, or the metal iron, or the property of being triangular in shape. There are two possibilities that I’ve considered:

  • the predicate is true for anything that is green, or which is composed of iron, or which is triangular in shape;
  • the predicate is true only for the abstract entities of the colour green, or the element iron, or the shape of a triangle.

There is a distinction between the two, most clearly shown in the case of shapes: the difference between saying that “x is triangular” and “x is a triangle”. Since it results in different text there is a clear need to represent this distinction. That is straightforward enough, but the method needed to achieve it depends on the semantics chosen:

  • if a predicate is true for anything green then it can be qualified by a second attribute true for anything that is an abstract colour. The combination is true for anything that is the abstract colour green.
  • if a predicate is true for the abstract colour green then it can be qualified by a second attribute meaning ‘is-coloured’. The combination is true for anything that was coloured green.

I favour the first option for two reasons, one theoretical and one practical. The theoretical consideration is one of orthogonality: a desire to create predicates that are fully independent of each other. The property of being green is orthogonal to the property of being an abstract colour: being one neither implies nor prohibits the other. A predicate corresponding to the abstract colour green fails to separate these concepts. I therefore conclude that the first two are suitable candidates to be represented by atomic predicates, whereas the third is more naturally represented by a compound.

The practical issue is that it is much more common to talk about objects that are green than about the colour green itself. Given the choice, it is preferable for the more frequently used form to be the more concise one. Although this is ultimately just a matter of convenience I would attach significant weight to it: having to refer to a fox that ‘is coloured brown’ and with ‘movement that is quick’ would soon become very tiresome.

Fortunately both arguments lead in the same direction, so I think the decision is an easy one. There may sometimes be a need to explicitly spell out concepts such as ‘coloured green’, in which case a method will need to be found to express that, but my intention is that unqualified predicates such as ‘green’ or ‘brown’ will correspond semantically to adjectives and not to abstract nouns.

Using UTF-8 with a regex engine that is not Unicode-aware

November 24th, 2008

The regular expression engine provided by POSIX - which I’m currently using in the C++ library - does not provide any explicit support for Unicode. Fortunately, thanks to the properties of UTF-8, for the most part it doesn’t need to.

All of the characters that have special meanings within a regular expression pattern have Unicode values of less than 128, and therefore have the same representation in both ASCII and UTF-8. Furthermore, a UTF-8 byte stream will not contain these codes for any other purpose. Other characters will be represented by a sequence of two or more codes in the range 128-255, but so long as the regex engine is 8-bit clean it will be able to match these when they occur literally in the pattern.

Where a difference can be seen is that patterns which attempt to count characters in some way will actually be counting bytes. For this reason, codes greater than 127 within a bracket expression (a list of possible characters in square brackets) will not have the intended effect, nor will the wildcard (dot) character if there is any finite limit on the number of repetitions.

The workaround for bracket expressions is to use branches instead (pipe characters), thereby presenting the multi-byte sequences as strings instead of characters. This is what I’ve done in the language definition files that are currently being written.

I definitely don’t intend for this to be a permanent feature of the system. The replacement may be a regex engine that is Unicode-aware, in which case there are obvious benefits to ensuring forward compatibility (which the above workaround does: it will function correctly whether or not the engine supports Unicode).

Alternatively, I am looking seriously at replacing regular expressions with a system known as two-level morphology. This would be entirely incompatible, but has the advantage of being reversible (and therefore equally applicable to text generation and analysis). My main reservations concern its efficiency and readability, but given the considerable effort that will be needed to create the language definitions, anything that would widen their potential use has to be worth considering.

Chemical Elements

November 15th, 2008

The first group of predicates that I’m going to tackle are the chemical elements. This ought to be a relatively straightforward task: there are a manageable number of them, and they have very well-defined identities. The main issues are:

  1. deciding what the predicates should be called,
  2. defining their precise semantics,
  3. obtaining authoritative translations into various target languages,
  4. establishing what parts of speech those translations correspond to, and
  5. modelling any associated morphological behaviour.

The namespace I’ve chosen is chem:element, so the element with an atomic number of 6 would be chem:element:carbon. I have some reservations about using names with three components rather than two, but they read well and the length is not grossly excessive. So, for relatively low-frequency predicates such as these, I’m going to take the view that having an obvious meaning is more important than brevity.

As a matter of policy I want to use existing naming standards when forming predicates unless there is a very good reason for inventing something different, and for that reason the element names will be those preferred by the International Union of Pure and Applied Chemistry (IUPAC).

(This does not mean that IUPAC names will necessarily be used in the translated output, even if the target language is a dialect of English. If the standard names are wanted then they should be written as literals: use of a predicate implies that translation to the target language is desired.)

An alternative approach would have been to number the elements instead of naming them. This would have reduced the number of required predicates to one, but was rejected on the grounds that:

  1. names make the source text more readable, and
  2. if an element is missing from a language, fallback to the IUPAC name is (IMO) preferable to giving the atomic number.

(The case for numbering would have carried more weight if there had been some prospect of doing the same for other predicate groups, because that could have significantly reduced the size of the predicate dictionary, but most would be difficult or impossible to number - plants and animals, for example.)

The use of one of these predicates will indicate that its argument is composed of the element in question. It will not therefore be necessary to explicitly spell out the is-made-of relationship (just I don’t intend to make is-coloured relationships explicit when the predicate is a colour). The predicates will be commutative. They will say nothing about the isotopic composition, phase, or other such characteristics of their argument, but they will not be applicable when the element is chemically combined with a different element (sodium chloride is not sodium).

In the (mostly European) languages that I have looked at, elements are normally listed as nouns and sometimes also as adjectives. My expectation was that the nouns would be considered uncountable (in formal writing at least), but many of the dictionaries that I have looked at imply otherwise. I’m investigating whether this is intended, or is merely the result of uncountable nouns not being consistently marked. More about these topics in a future post.

I’m going to pass, for now, on elements with temporary names (ununbium, ununtrium and so on), although it would appear that some languages do have translated forms for them (ununbio, ununtrio in Italian, for example). These probably do want to be numbered, and the ideal would be to handle them systematically, but that will have to wait until more work has been done on (a) processing numbers and (b) constructing compound words.

The existing language definition files, which were used for some of the early experimentation, are going to be deleted to allow for a fresh start. This isn’t a great loss: the number of readings is quite small, and when I do revisit those topics then no doubt much of the content can be resurrected - but at present I don’t have sufficiently well-considered plans for how the predicates or inflectional paradigms should be structured.

Namespaces

November 13th, 2008

It is becoming clear that drawing all predicate names from a single unstructured namespace will require compromises which would harm both readability and writeability.

For example, consider the need to define predicates to represent (a) the chemical element iron, (b) the appliance used to iron clothes, and (c) the type of golf club of the same name. At most one of these can lay claim to the predicate named iron, leaving two possible alternatives for naming the others:

  1. qualify the word ‘iron’ in some way, for example by making one of the predicates golfing-iron;
  2. choose a different word entirely, such as ferrum (or just Fe) for the element.

The problem with synonyms is that they would be difficult to remember, and while abbreviations would work well enough for elements, there are few other categories of predicate to which they would be applicable. For this reason, qualifiers are the favoured option.

For predicates of the same type it is clearly desirable to use the same qualifier - in effect creating a namespace for that class of predicate. Less obvious is whether predicates should be qualified for the purpose of consistency, even if there is no ambiguity to resolve. This would add a significant amount of verbosity, but I’m inclined to think that it is worthwhile: not only would the alternative be inelegant, it would also be very difficult to learn.

(This is not to say that every predicate should be placed in a namespace: for those which are used very frequently the organisational benefits would be low - because there will be fewer such predicates to organise - and the cost of the added verbosity would be high. However my expectation is that the bulk of them would be.)

For the namespace separator I’ve chosen a single colon, on the grounds that it works well visually, and its use for this purpose is already well-established by XML. To avoid ambiguity, where colons were already used in other roles they have been replaced by equals signs. (This inconvenience could have been avoided by using a double-colon, as in C++, but that would have been likely to increase the size of the language definition files by several percent when they are already expected to be very large.)

I have no current plans to allow the equivalent of C++ namspace or using constructs, because I’m not convinced that this would be of sufficient benefit to offset the loss of readability. For this reason, no explicit software support is needed at present beyond allowing the namespace separator within predicate names.

Translating Names

November 9th, 2008

Some thoughts about the translation of names - such as those of plant and animal species, geographical locations, and individual people.

Some types of name undergo translation when moved from one language to another:

  • fox in English becomes renard in French and zorro in Spanish.
  • Deutschland in German becomes Allemagne in French and Germany in English.
  • The apostle known as Sanctus Petrus in Latin is called Saint Peter in English, Saint Pierre in French, and Simon Petrus in German.

Other names do not:

  • Linnaean species names, such as Tyrannosaurus rex, are language-independent and should not be changed from their Latin form.
  • The French name Jean-Luc Picard has an anglicised counterpart, John-Luke Pickard, but it would not be correct to use the latter when referring to the well-known Starfleet captain.
  • Zorro should not be translated to Fox or Renard when referring to the swashbuckling alter-ego of Don Diego de la Vega.

(It may be appropriate for names such as these to be transliterated, but that is a less invasive process which represents a change only of writing system, not of language.)

These requirements can be satisfied by defining predicates for names which might need translation, and a mechanism for the literal inclusion of names which don’t. It is then for the author of the text to decide how a particular name should be handled.

Many names are of limited geographical relevance, and therefore only have translations in a small number of languages. Providing readings in every language would be a large and unnecessary burden, so fallbacks should be given in the main predicate dictionary. (However It is probably best that languages do not rely on those fallbacks if a specific outcome is required, because then there would be no clear distinction between a name that has not been thought about versus a name for which a positive decision has been made.)

Babel Blog

November 8th, 2008

I’ve started this blog as a way to:

  • report progress with the translation system,
  • record, and potentially discuss, some of the more important design decisions, and
  • compensate for the lack of formal documentation at this stage.

Comments are most welcome, either here or on the mailing list.

Currently I’m looking at how to best go about creating language definitions on a large scale, as will be necessary if this project is to amount to anything. Originally there appeared to be merit in concentrating on two or three languages in the first instance, on the grounds that finishing one task is preferable to starting many. However some practical considerations have emerged which favour the development of many languages in parallel.

First and foremost, it is possible to take a more balanced view of what predicates are needed and how they should be defined if the needs of many languages are considered together. In particular, this will help to counter the bias towards English that is likely to result from it being both my first language, and the one from which most predicate names will be drawn.

Secondly, choosing good predicates is a long and slow process which I don’t want to rush, whereas translating those predicates is relatively straightforward. Working on more languages will provide time to think about predicate selection and other common issues.

Finally, a small vocabulary that translates into a large number of languages has the potential to be very useful for applications where that vocabulary is sufficient, whereas a large vocabulary with a small number of languages is of marginal benefit to any application. For these reasons I’m minded to greatly expand the number of languages in progress, up to several dozen if I can assemble enough reliable source material.