Archive for the ‘Strategy’ Category

The Need for a Bootstrap Lexicon

Saturday, August 7th, 2010

The method I ultimately intend to use for compiling a lexicon is by tasking a parser to search for words that need to be of a particular lexical category for the surrounding text to be grammatical. The catch-22 is that this parser will only be able to do its work when the lexical categories for most of the words in a sentence are already known. In other words, in order to compile a lexicon by this method I need to already have a lexicon.

The parser can be run a number of times if necessary, so it need not catch every word at the first attempt. The only requirement is that the parser add enough new words to the lexicon each time it is run to make subsequent runs more successful. However, my expectation is that this will only happen when the lexicon is above a certain critical size: any smaller and the number of sentences successfully parsed will be too small to sustain growth. It follows that an initial lexicon, generated by other means, will be needed to bootstrap the process.

To generate the bootstrap lexicon I’ve chosen to look for methods with a very low false-positive rate, even if this results in a relatively poor yield. There are two main reasons for this:

  • I can remove false positives from the English lexicon by manually reviewing it, but will not have this luxury for languages with which I am less familiar. In the absence of an army of helpers, I need to develop methods that can deliver useful results without supervision.
  • The corpus I am working with is sufficiently large that yield is not a major concern during the bootstrap phase.

I don’t intend to search for closed-class words automatically, in English or in any other language: these can be entered manually. I will be looking for automatic methods for isolating nouns, verbs, adjectives and adverbs. Of these, I have so far had most success searching for nouns with regular plurals. My next post will describe the method used and the outcome.

Redesigning the Lexicon

Wednesday, June 30th, 2010

The lexicon on the Project Babel website was originally set up as a means for generating language description files. I’ve since concluded that making it the authoritative source for the files isn’t practicable: they need to live in the Subversion repository like everything else. Nor is it particularly well suited for managing readings: my experience is that these are better handled systematically, one topic at a time.

Where I think the lexicon can add value is by providing a dataset against which morphological rules can be evaluated and regression-tested, but some changes are needed if it is to perform this function effectively.

First and foremost the process of populating it needs to become much, much faster. That means automatically and accurately guessing much of the required content, so that my rôle is largely reduced to approving or correcting those guesses. The user interface needs to be designed for doing this in bulk: minimising the number of clicks and page reloads by placing many entries on one page.

Secondly, better coverage of each language is needed. The texts I was using previously (from Project Gutenberg) don’t do this effectively enough. The problem isn’t that the texts are unrepresentative: on the contrary, they are too closely representative of normal writing, with not enough coverage of rare words.

For example, of the chemical elements I found hydrogen, oxygen, aluminium (and aluminum), sulphur (but not sulfur), argon, iron, nickel, copper, silver, tin, gold, mercury and lead. (Some of these have multiple meanings, but that doesn’t matter provided the words in question enter the database). Including some texts about chemistry would improve matters, but even then it would take a huge amount of raw material to complete the list.

Importing a copy of the periodic table would clearly solve this particular problem, but I’m not convinced that including tabular data in the corpus is a good idea: there would be little or no opportunity to deduce parts of speech automatically, and I’m concerned that it would add significantly to the number of non-words in the database. In any event, it is not a general solution because most of the words that need to be covered won’t appear in neat tables.

A type of prose that could provide much better coverage is an encyclopedia,
and the obvious candidate is Wikipedia. There was a good case for using this anyway because of its sheer size, and also the number of languages in which an edition is published. I did have some concerns that it would be statistically unrepresentative, but it is now clear that a representative sample is not what is needed.

(Of course I could obtain much of the information needed from Wiktionary, but my preference is to keep this and other dictionaries in reserve as a means for checking data that I have generated independently. This, I hope, will provide more opportunity for detecting errors.)

The third change is to make the language description files an input to the lexicon rather than an output. There then needs to be a means to compare generated word forms with ones that have been reviewed and approved.

Progress towards Unification

Thursday, December 31st, 2009

I’ve now created data structures to represent features, conjuncts and disjuncts, and can perform unification in a few simple cases — but not yet generally in the presence of disjuncts. The latter is hard to do efficiently, so in the first instance I plan to use an inefficient but simple algorithm (probably expanding both inputs to disjunctive normal form prior to unification).

Also not implemented yet are paths, constituent sets and patterns. An open question is whether to implement paths explicitly, or instead use named variables to tie branches of the tree together. So far as I can see, these options provide essentially the same functionality, however I suspect that variables may be more convenient to implement given that the data structures I am using only allow you to move down the tree, not upwards.

I’m now confident that unification can provide all of the current functionality of the translation system except (perhaps) for morphological operations and left-right agreement. Furthermore, it would do so in a significantly more flexible and elegant manner than the ad-hoc mechanisms that exist at present. My main concern remains that of efficiency. I’m also uncertain as to how best to support dialects.

One mechanism which could be eliminated is the use of namespaces. For example, the predicate zoo:species:tyrannosaurus:rex could be replaced by:

[type=animal,rank=species,genus=tyrannosaurus,species=rex]

This is more verbose, but self-describing (a bit like the difference between X.500 and the DNS) and able to be processed in ways that an opaque predicate name cannot.

Unification

Monday, November 30th, 2009

I’ve begun experimenting with some changes to the internal data structures so that the translation system could support unification. This is a concept borrowed from formal logic which is has become popular as a natural language processing technique. A good introduction can be found in Functional Unification Grammar: A Formalism for Machine Translation by Martin Kay.

The most notable difference this will bring to the translation system is that functional unification grammars are declarative in nature, whereas the languages descriptions I am using currently are (for the most part) procedural. A declarative approach has two important advantages:

  • The language description need not be closely tied to any particular processing task, or to any particular implementation method. This makes it more likely to be reusable for other purposes. For example, a grammar described exclusively in terms of unification would be reversible.
  • Because the whole grammar applies in parallel (as opposed to individual rules being applied in series), the order in which components of the grammar are specified is unimportant.

Against this I expect there will be a price to be paid in terms of efficiency and/or complexity. Declarative languages tend not to be naturally efficient, by which I mean that when implemented simplistically they tend to be very inefficient. It is often possible to claw back some or all of this loss by using more sophisticated algorithms, but the programming effort required to do that can be considerable.

To obtain the full benefit of a unification grammar it would be necessary to eliminate all of the existing rule types, but that isn’t practicable in the short term. For this reason I’m going to provide the means to translate phrases into functional descriptions and back again. This is straightforward enough:

  • If the phrase is atomic then the corresponding token becomes the value of an appropriately-named feature (eg. ‘atom’).
  • If the phrase is composite then the left- and right-hand sides each become features (eg. ‘lhs’ and ‘rhs’).
  • Each tag becomes a feature with a value of ‘true’.

Because functional descriptions can only grow, it may be necessary to use different feature names for the conversions to and from functional descriptions (depending on what is being attempted at the time).

Most likely it will be agreement rules that I attempt to replace first. This will require at least three significant changes to the way the language description is written:

  • In an agreement rule the absence of a tag can be used to indicate with certainty that a particular condition is false, whereas in a unification grammar the absence of a feature means that a value is unknown: to indicate falsehood it is necessary to explicitly specify a value of ‘false’.
  • When propagating attributes such as gender it will not be necessary to provide a separate rule for each of the possible values.
  • Agreement rules match only when they need to change something. A unification grammar must provide rules to match every valid input (otherwise the input would not be allowed), and normally you would want only one rule to match each input.

The changes are strictly experimental for now, and are being developed in a branch so that they can be easily abandoned should that be necessary. The main criteria will be whether they improve readability and maintainability of the language description files, and how much they add to memory and processing requirements. If they are successful then ideally they would replace all of the current phases from part-of-speech discovery through to transformation. Alternatively, they may simply be used to add to the existing rule types available when writing a language description.

Dialects

Wednesday, October 14th, 2009

Currently the translation system supports only one dialect of any given language, so English means British English and Portuguese means European Portuguese. The United States and Brazilian dialects of these languages are sufficiently different (and popular) that they certainly ought to be supported too, but similar enough that writing entirely separate language descriptions would result in an undesirable amount of duplication.

What is needed is a mechanism which allows language description files to share common data. This would bring two benefits:

  • a reduction the amount of memory and disc space consumed;
  • automatic propagation of any corrections or improvements to a language to its dialects.

There are two ways in which sharing could be achieved:

  • by merging related dialects into a single language description file, then switching sections of that file in and out using some form of conditional notation; or
  • by requiring a separate language description file for each dialect, but allowing inheritance relationships between dialects such that only the differences need be specified.

Drawbacks of the first method are reduced modularity and readability. All dialects of a language would have to be loaded together, even if only one were needed. The language descriptions are likely to be quite complex enough handling one dialect, and if anything I would prefer to be looking at ways to break them down into smaller units rather than making them larger.

The main drawback of the second method is that it scales very poorly if the language can vary in several dimensions independently. For example, in Celtic languages the use of decimal versus vigesimal numbers is only loosely correlated with dialect and to a large extent is a matter of personal choice. You could write two language description files for each language, one for decimal and one for vigesimal, but then what happens when another issue is found where a similar choice is needed?

For these reasons I don’t think that either method provides a complete solution, so am inclined to implement both. This is not an unreasonable extravagance: many programming languages provide comparable facilities (such as #ifdef and #include in the C preprocessor).

Broadly speaking my intention is to use inheritance for regional dialects, and conditional rules for preferences which cut across those dialects. Inheritance is in the process of being implemented, and I will describe the syntax shortly. Conditional rules I don’t have a clear strategy for yet, but they are a less urgent requirement.

Merging the predicate dictionary into the language description

Wednesday, September 30th, 2009

I’m going to eliminate any formal distinction between the predicate dictionary and the language description. Instead there will be only one type of file, which can hold any of the currently supported types of declaration (predicate, morpheme, reading and so on).

Of course, I certainly don’t want to copy and paste the same set of predicate declarations into many separate files, but that won’t be necessary. There will need to be a way for one language to inherit declarations from another language in order to efficiently support dialects. The same mechanism, once it has been implemented, can be used across all languages to share a common set of predicate declarations.

One important difference between this and the current arrangement is that it will provide a basis for predicate declarations to be overridden. This will allow information to be associated with a predicate even if it is not strictly language-independent.

For example, Polish treat nouns differently according to whether they are animate or inanimate. For the most part animacy is defined as you would expect it to be, but there are marginal cases which are decided by convention (plants are generally inanimate, but viruses, bacteria and fungi are animate), and more than a few outright exceptions (units of currency, such as the złoty, are animate). To the extent that the classification is based on objective criteria it can and should be shared between languages, but exceptions rightly belong within the relevant language description.

Implementing this capability will not be a great burden. Arguably it simplifies the translation system slightly, and it avoids the annoyance (within the internal C++ API) of having to explicitly instantiate the predicate dictionary and provide a reference to it when constructing a language object.

Translating Names

Sunday, November 9th, 2008

Some thoughts about the translation of names - such as those of plant and animal species, geographical locations, and individual people.

Some types of name undergo translation when moved from one language to another:

  • fox in English becomes renard in French and zorro in Spanish.
  • Deutschland in German becomes Allemagne in French and Germany in English.
  • The apostle known as Sanctus Petrus in Latin is called Saint Peter in English, Saint Pierre in French, and Simon Petrus in German.

Other names do not:

  • Linnaean species names, such as Tyrannosaurus rex, are language-independent and should not be changed from their Latin form.
  • The French name Jean-Luc Picard has an anglicised counterpart, John-Luke Pickard, but it would not be correct to use the latter when referring to the well-known Starfleet captain.
  • Zorro should not be translated to Fox or Renard when referring to the swashbuckling alter-ego of Don Diego de la Vega.

(It may be appropriate for names such as these to be transliterated, but that is a less invasive process which represents a change only of writing system, not of language.)

These requirements can be satisfied by defining predicates for names which might need translation, and a mechanism for the literal inclusion of names which don’t. It is then for the author of the text to decide how a particular name should be handled.

Many names are of limited geographical relevance, and therefore only have translations in a small number of languages. Providing readings in every language would be a large and unnecessary burden, so fallbacks should be given in the main predicate dictionary. (However It is probably best that languages do not rely on those fallbacks if a specific outcome is required, because then there would be no clear distinction between a name that has not been thought about versus a name for which a positive decision has been made.)

Babel Blog

Saturday, November 8th, 2008

I’ve started this blog as a way to:

  • report progress with the translation system,
  • record, and potentially discuss, some of the more important design decisions, and
  • compensate for the lack of formal documentation at this stage.

Comments are most welcome, either here or on the mailing list.

Currently I’m looking at how to best go about creating language definitions on a large scale, as will be necessary if this project is to amount to anything. Originally there appeared to be merit in concentrating on two or three languages in the first instance, on the grounds that finishing one task is preferable to starting many. However some practical considerations have emerged which favour the development of many languages in parallel.

First and foremost, it is possible to take a more balanced view of what predicates are needed and how they should be defined if the needs of many languages are considered together. In particular, this will help to counter the bias towards English that is likely to result from it being both my first language, and the one from which most predicate names will be drawn.

Secondly, choosing good predicates is a long and slow process which I don’t want to rush, whereas translating those predicates is relatively straightforward. Working on more languages will provide time to think about predicate selection and other common issues.

Finally, a small vocabulary that translates into a large number of languages has the potential to be very useful for applications where that vocabulary is sufficient, whereas a large vocabulary with a small number of languages is of marginal benefit to any application. For these reasons I’m minded to greatly expand the number of languages in progress, up to several dozen if I can assemble enough reliable source material.