Archive for November, 2009

Unification

Monday, November 30th, 2009

I’ve begun experimenting with some changes to the internal data structures so that the translation system could support unification. This is a concept borrowed from formal logic which is has become popular as a natural language processing technique. A good introduction can be found in Functional Unification Grammar: A Formalism for Machine Translation by Martin Kay.

The most notable difference this will bring to the translation system is that functional unification grammars are declarative in nature, whereas the languages descriptions I am using currently are (for the most part) procedural. A declarative approach has two important advantages:

  • The language description need not be closely tied to any particular processing task, or to any particular implementation method. This makes it more likely to be reusable for other purposes. For example, a grammar described exclusively in terms of unification would be reversible.
  • Because the whole grammar applies in parallel (as opposed to individual rules being applied in series), the order in which components of the grammar are specified is unimportant.

Against this I expect there will be a price to be paid in terms of efficiency and/or complexity. Declarative languages tend not to be naturally efficient, by which I mean that when implemented simplistically they tend to be very inefficient. It is often possible to claw back some or all of this loss by using more sophisticated algorithms, but the programming effort required to do that can be considerable.

To obtain the full benefit of a unification grammar it would be necessary to eliminate all of the existing rule types, but that isn’t practicable in the short term. For this reason I’m going to provide the means to translate phrases into functional descriptions and back again. This is straightforward enough:

  • If the phrase is atomic then the corresponding token becomes the value of an appropriately-named feature (eg. ‘atom’).
  • If the phrase is composite then the left- and right-hand sides each become features (eg. ‘lhs’ and ‘rhs’).
  • Each tag becomes a feature with a value of ‘true’.

Because functional descriptions can only grow, it may be necessary to use different feature names for the conversions to and from functional descriptions (depending on what is being attempted at the time).

Most likely it will be agreement rules that I attempt to replace first. This will require at least three significant changes to the way the language description is written:

  • In an agreement rule the absence of a tag can be used to indicate with certainty that a particular condition is false, whereas in a unification grammar the absence of a feature means that a value is unknown: to indicate falsehood it is necessary to explicitly specify a value of ‘false’.
  • When propagating attributes such as gender it will not be necessary to provide a separate rule for each of the possible values.
  • Agreement rules match only when they need to change something. A unification grammar must provide rules to match every valid input (otherwise the input would not be allowed), and normally you would want only one rule to match each input.

The changes are strictly experimental for now, and are being developed in a branch so that they can be easily abandoned should that be necessary. The main criteria will be whether they improve readability and maintainability of the language description files, and how much they add to memory and processing requirements. If they are successful then ideally they would replace all of the current phases from part-of-speech discovery through to transformation. Alternatively, they may simply be used to add to the existing rule types available when writing a language description.

Tags in Predicate Definitions

Sunday, November 1st, 2009

There are several stages of the translation process during which tags can be applied, but currently almost all of the raw information on which they are based is provided by the lexicon. Some of that information is language-independent, and I want to move it into the predicate dictionary so that it can be shared between languages. Examples of the type of information I have in mind are the fact that a predicate represents:

  • a colour, or
  • a substance, or
  • an animal.

As I’ve previously indicated, language-independence need not be absolute: if a particular language needs to handle a particular predicate differently then it can be overridden. I wouldn’t want to over-use this facility, because attributes which are substantially language-dependent belong in the lexicon, but it will simplify the handling of edge cases and idiosyncrasies.

The syntax for attaching tags to a predicate is as follows:

predicate foo
{
  tag bar,baz,quux;
};

Multiple tag statements are permitted as an alternative to listing them on one line. Tags are applied as the very first stage of the translation process. I’ve also made some changes to later stages so that tags survive up to and beyond word selection.

What can’t easily be achieved at this point is tagging of composite predicates prior to word selection, so while the system can be told that zoo:genus:vulpes has the characteristic of being animate, it is not able to deduce that (colour:red zoo:genus:vulpes) is equally animate. This is an issue I intend to address soon.