Archive for November, 2008

Using UTF-8 with a regex engine that is not Unicode-aware

Monday, November 24th, 2008

The regular expression engine provided by POSIX - which I’m currently using in the C++ library - does not provide any explicit support for Unicode. Fortunately, thanks to the properties of UTF-8, for the most part it doesn’t need to.

All of the characters that have special meanings within a regular expression pattern have Unicode values of less than 128, and therefore have the same representation in both ASCII and UTF-8. Furthermore, a UTF-8 byte stream will not contain these codes for any other purpose. Other characters will be represented by a sequence of two or more codes in the range 128-255, but so long as the regex engine is 8-bit clean it will be able to match these when they occur literally in the pattern.

Where a difference can be seen is that patterns which attempt to count characters in some way will actually be counting bytes. For this reason, codes greater than 127 within a bracket expression (a list of possible characters in square brackets) will not have the intended effect, nor will the wildcard (dot) character if there is any finite limit on the number of repetitions.

The workaround for bracket expressions is to use branches instead (pipe characters), thereby presenting the multi-byte sequences as strings instead of characters. This is what I’ve done in the language definition files that are currently being written.

I definitely don’t intend for this to be a permanent feature of the system. The replacement may be a regex engine that is Unicode-aware, in which case there are obvious benefits to ensuring forward compatibility (which the above workaround does: it will function correctly whether or not the engine supports Unicode).

Alternatively, I am looking seriously at replacing regular expressions with a system known as two-level morphology. This would be entirely incompatible, but has the advantage of being reversible (and therefore equally applicable to text generation and analysis). My main reservations concern its efficiency and readability, but given the considerable effort that will be needed to create the language definitions, anything that would widen their potential use has to be worth considering.

Countability

Tuesday, November 18th, 2008

I’m seeing a lot of inconsistency both between and within dictionaries as to whether nouns are countable or uncountable.

The basic problem is that most supposedly uncountable nouns can be used countably if you are willing to inflict sufficient violence upon the language in question. This inevitably results in differences of opinion as to what is acceptable and what is not.

For example, in answer to the question what atoms are needed to make a molecule of hydrogen peroxide? you might answer two hydrogens and two oxygens. Does that make hydrogen and oxygen countable? If you are willing to accept that response as a valid utterance then clearly in some sense they must be, but not to the same extent as words like dog or car which pluralise more readily.

My analysis of this usage is that each of these element names has at least two readings:

  • as an uncountable noun, meaning a quantity of that element, and acceptable in formal or informal contexts;
  • as a countable noun, as shorthand for an atom of the element, and acceptable only in very informal contexts.

The implied unit need not be an atom: if the reference were to two golds and a silver then that could very plausibly be counting in units of medals. This is really no different to the process that makes the word iron countable when referring to household appliances or golf clubs, except that you probably won’t find the informal readings in a dictionary.

[Update 2008-12-23: it would appear that Wiktionary, for one, lists Oxygen as countable (plural 'oxygens') with the reading "an atom of the element". Carbon, however, is listed as countable only in the sense of 'carbon paper' or 'carbon copy'.]

Is there a need for the language definition files to include the informal readings? I wouldn’t be against this in principle, provided that the readings were tagged appropriately, but see three objections in practice:

  1. The semantics are highly context-dependent. Counting atoms is one of the more likely possibilities, but in principle it could be almost anything made of, containing, or otherwise related to the word in question. This isn’t something that the current framework can handle well.
  2. Attempting to model usage of this nature would be a major distraction. The required information would be time-consuming to collect and difficult to objectively verify.
  3. None of the software currently envisaged has a need to generate or accept informal text of this nature (and writing software with that ability would probably be a tall order).

For these reasons I’m inclined to exclude marginal readings at present, but with one important concession: the inflectional paradigm should attempt to produce a reasonable plural, even if the grammatical rules say that there isn’t one. This will allow text containing the plural to be correctly parsed by relaxing the grammar. Plurals that are well established, such as golfing irons or pencil leads in English, are obviously acceptable in the language definition.

One point I am not certain about is whether countability needs to be indicated in the language definition file at all, or whether it can be inferred from the reading. Put another way, is countability a characteristic of language, or is it a real-world characteristic which languages can be expected to follow in a consistent manner? Not sure, but the safer option will be to place them in the language definition file for now (it is a lot easier to delete tags retrospectively than to add them). I’ll review this when I have more experience handling the issue in different languages.

Chemical Elements

Saturday, November 15th, 2008

The first group of predicates that I’m going to tackle are the chemical elements. This ought to be a relatively straightforward task: there are a manageable number of them, and they have very well-defined identities. The main issues are:

  1. deciding what the predicates should be called,
  2. defining their precise semantics,
  3. obtaining authoritative translations into various target languages,
  4. establishing what parts of speech those translations correspond to, and
  5. modelling any associated morphological behaviour.

The namespace I’ve chosen is chem:element, so the element with an atomic number of 6 would be chem:element:carbon. I have some reservations about using names with three components rather than two, but they read well and the length is not grossly excessive. So, for relatively low-frequency predicates such as these, I’m going to take the view that having an obvious meaning is more important than brevity.

As a matter of policy I want to use existing naming standards when forming predicates unless there is a very good reason for inventing something different, and for that reason the element names will be those preferred by the International Union of Pure and Applied Chemistry (IUPAC).

(This does not mean that IUPAC names will necessarily be used in the translated output, even if the target language is a dialect of English. If the standard names are wanted then they should be written as literals: use of a predicate implies that translation to the target language is desired.)

An alternative approach would have been to number the elements instead of naming them. This would have reduced the number of required predicates to one, but was rejected on the grounds that:

  1. names make the source text more readable, and
  2. if an element is missing from a language, fallback to the IUPAC name is (IMO) preferable to giving the atomic number.

(The case for numbering would have carried more weight if there had been some prospect of doing the same for other predicate groups, because that could have significantly reduced the size of the predicate dictionary, but most would be difficult or impossible to number - plants and animals, for example.)

The use of one of these predicates will indicate that its argument is composed of the element in question. It will not therefore be necessary to explicitly spell out the is-made-of relationship (just I don’t intend to make is-coloured relationships explicit when the predicate is a colour). The predicates will be commutative. They will say nothing about the isotopic composition, phase, or other such characteristics of their argument, but they will not be applicable when the element is chemically combined with a different element (sodium chloride is not sodium).

In the (mostly European) languages that I have looked at, elements are normally listed as nouns and sometimes also as adjectives. My expectation was that the nouns would be considered uncountable (in formal writing at least), but many of the dictionaries that I have looked at imply otherwise. I’m investigating whether this is intended, or is merely the result of uncountable nouns not being consistently marked. More about these topics in a future post.

I’m going to pass, for now, on elements with temporary names (ununbium, ununtrium and so on), although it would appear that some languages do have translated forms for them (ununbio, ununtrio in Italian, for example). These probably do want to be numbered, and the ideal would be to handle them systematically, but that will have to wait until more work has been done on (a) processing numbers and (b) constructing compound words.

The existing language definition files, which were used for some of the early experimentation, are going to be deleted to allow for a fresh start. This isn’t a great loss: the number of readings is quite small, and when I do revisit those topics then no doubt much of the content can be resurrected - but at present I don’t have sufficiently well-considered plans for how the predicates or inflectional paradigms should be structured.

Namespaces

Thursday, November 13th, 2008

It is becoming clear that drawing all predicate names from a single unstructured namespace will require compromises which would harm both readability and writeability.

For example, consider the need to define predicates to represent (a) the chemical element iron, (b) the appliance used to iron clothes, and (c) the type of golf club of the same name. At most one of these can lay claim to the predicate named iron, leaving two possible alternatives for naming the others:

  1. qualify the word ‘iron’ in some way, for example by making one of the predicates golfing-iron;
  2. choose a different word entirely, such as ferrum (or just Fe) for the element.

The problem with synonyms is that they would be difficult to remember, and while abbreviations would work well enough for elements, there are few other categories of predicate to which they would be applicable. For this reason, qualifiers are the favoured option.

For predicates of the same type it is clearly desirable to use the same qualifier - in effect creating a namespace for that class of predicate. Less obvious is whether predicates should be qualified for the purpose of consistency, even if there is no ambiguity to resolve. This would add a significant amount of verbosity, but I’m inclined to think that it is worthwhile: not only would the alternative be inelegant, it would also be very difficult to learn.

(This is not to say that every predicate should be placed in a namespace: for those which are used very frequently the organisational benefits would be low - because there will be fewer such predicates to organise - and the cost of the added verbosity would be high. However my expectation is that the bulk of them would be.)

For the namespace separator I’ve chosen a single colon, on the grounds that it works well visually, and its use for this purpose is already well-established by XML. To avoid ambiguity, where colons were already used in other roles they have been replaced by equals signs. (This inconvenience could have been avoided by using a double-colon, as in C++, but that would have been likely to increase the size of the language definition files by several percent when they are already expected to be very large.)

I have no current plans to allow the equivalent of C++ namspace or using constructs, because I’m not convinced that this would be of sufficient benefit to offset the loss of readability. For this reason, no explicit software support is needed at present beyond allowing the namespace separator within predicate names.

Translating Names

Sunday, November 9th, 2008

Some thoughts about the translation of names - such as those of plant and animal species, geographical locations, and individual people.

Some types of name undergo translation when moved from one language to another:

  • fox in English becomes renard in French and zorro in Spanish.
  • Deutschland in German becomes Allemagne in French and Germany in English.
  • The apostle known as Sanctus Petrus in Latin is called Saint Peter in English, Saint Pierre in French, and Simon Petrus in German.

Other names do not:

  • Linnaean species names, such as Tyrannosaurus rex, are language-independent and should not be changed from their Latin form.
  • The French name Jean-Luc Picard has an anglicised counterpart, John-Luke Pickard, but it would not be correct to use the latter when referring to the well-known Starfleet captain.
  • Zorro should not be translated to Fox or Renard when referring to the swashbuckling alter-ego of Don Diego de la Vega.

(It may be appropriate for names such as these to be transliterated, but that is a less invasive process which represents a change only of writing system, not of language.)

These requirements can be satisfied by defining predicates for names which might need translation, and a mechanism for the literal inclusion of names which don’t. It is then for the author of the text to decide how a particular name should be handled.

Many names are of limited geographical relevance, and therefore only have translations in a small number of languages. Providing readings in every language would be a large and unnecessary burden, so fallbacks should be given in the main predicate dictionary. (However It is probably best that languages do not rely on those fallbacks if a specific outcome is required, because then there would be no clear distinction between a name that has not been thought about versus a name for which a positive decision has been made.)

Babel Blog

Saturday, November 8th, 2008

I’ve started this blog as a way to:

  • report progress with the translation system,
  • record, and potentially discuss, some of the more important design decisions, and
  • compensate for the lack of formal documentation at this stage.

Comments are most welcome, either here or on the mailing list.

Currently I’m looking at how to best go about creating language definitions on a large scale, as will be necessary if this project is to amount to anything. Originally there appeared to be merit in concentrating on two or three languages in the first instance, on the grounds that finishing one task is preferable to starting many. However some practical considerations have emerged which favour the development of many languages in parallel.

First and foremost, it is possible to take a more balanced view of what predicates are needed and how they should be defined if the needs of many languages are considered together. In particular, this will help to counter the bias towards English that is likely to result from it being both my first language, and the one from which most predicate names will be drawn.

Secondly, choosing good predicates is a long and slow process which I don’t want to rush, whereas translating those predicates is relatively straightforward. Working on more languages will provide time to think about predicate selection and other common issues.

Finally, a small vocabulary that translates into a large number of languages has the potential to be very useful for applications where that vocabulary is sufficient, whereas a large vocabulary with a small number of languages is of marginal benefit to any application. For these reasons I’m minded to greatly expand the number of languages in progress, up to several dozen if I can assemble enough reliable source material.