Progress towards Unification

December 31st, 2009

I’ve now created data structures to represent features, conjuncts and disjuncts, and can perform unification in a few simple cases — but not yet generally in the presence of disjuncts. The latter is hard to do efficiently, so in the first instance I plan to use an inefficient but simple algorithm (probably expanding both inputs to disjunctive normal form prior to unification).

Also not implemented yet are paths, constituent sets and patterns. An open question is whether to implement paths explicitly, or instead use named variables to tie branches of the tree together. So far as I can see, these options provide essentially the same functionality, however I suspect that variables may be more convenient to implement given that the data structures I am using only allow you to move down the tree, not upwards.

I’m now confident that unification can provide all of the current functionality of the translation system except (perhaps) for morphological operations and left-right agreement. Furthermore, it would do so in a significantly more flexible and elegant manner than the ad-hoc mechanisms that exist at present. My main concern remains that of efficiency. I’m also uncertain as to how best to support dialects.

One mechanism which could be eliminated is the use of namespaces. For example, the predicate zoo:species:tyrannosaurus:rex could be replaced by:

[type=animal,rank=species,genus=tyrannosaurus,species=rex]

This is more verbose, but self-describing (a bit like the difference between X.500 and the DNS) and able to be processed in ways that an opaque predicate name cannot.

Unification

November 30th, 2009

I’ve begun experimenting with some changes to the internal data structures so that the translation system could support unification. This is a concept borrowed from formal logic which is has become popular as a natural language processing technique. A good introduction can be found in Functional Unification Grammar: A Formalism for Machine Translation by Martin Kay.

The most notable difference this will bring to the translation system is that functional unification grammars are declarative in nature, whereas the languages descriptions I am using currently are (for the most part) procedural. A declarative approach has two important advantages:

  • The language description need not be closely tied to any particular processing task, or to any particular implementation method. This makes it more likely to be reusable for other purposes. For example, a grammar described exclusively in terms of unification would be reversible.
  • Because the whole grammar applies in parallel (as opposed to individual rules being applied in series), the order in which components of the grammar are specified is unimportant.

Against this I expect there will be a price to be paid in terms of efficiency and/or complexity. Declarative languages tend not to be naturally efficient, by which I mean that when implemented simplistically they tend to be very inefficient. It is often possible to claw back some or all of this loss by using more sophisticated algorithms, but the programming effort required to do that can be considerable.

To obtain the full benefit of a unification grammar it would be necessary to eliminate all of the existing rule types, but that isn’t practicable in the short term. For this reason I’m going to provide the means to translate phrases into functional descriptions and back again. This is straightforward enough:

  • If the phrase is atomic then the corresponding token becomes the value of an appropriately-named feature (eg. ‘atom’).
  • If the phrase is composite then the left- and right-hand sides each become features (eg. ‘lhs’ and ‘rhs’).
  • Each tag becomes a feature with a value of ‘true’.

Because functional descriptions can only grow, it may be necessary to use different feature names for the conversions to and from functional descriptions (depending on what is being attempted at the time).

Most likely it will be agreement rules that I attempt to replace first. This will require at least three significant changes to the way the language description is written:

  • In an agreement rule the absence of a tag can be used to indicate with certainty that a particular condition is false, whereas in a unification grammar the absence of a feature means that a value is unknown: to indicate falsehood it is necessary to explicitly specify a value of ‘false’.
  • When propagating attributes such as gender it will not be necessary to provide a separate rule for each of the possible values.
  • Agreement rules match only when they need to change something. A unification grammar must provide rules to match every valid input (otherwise the input would not be allowed), and normally you would want only one rule to match each input.

The changes are strictly experimental for now, and are being developed in a branch so that they can be easily abandoned should that be necessary. The main criteria will be whether they improve readability and maintainability of the language description files, and how much they add to memory and processing requirements. If they are successful then ideally they would replace all of the current phases from part-of-speech discovery through to transformation. Alternatively, they may simply be used to add to the existing rule types available when writing a language description.

Tags in Predicate Definitions

November 1st, 2009

There are several stages of the translation process during which tags can be applied, but currently almost all of the raw information on which they are based is provided by the lexicon. Some of that information is language-independent, and I want to move it into the predicate dictionary so that it can be shared between languages. Examples of the type of information I have in mind are the fact that a predicate represents:

  • a colour, or
  • a substance, or
  • an animal.

As I’ve previously indicated, language-independence need not be absolute: if a particular language needs to handle a particular predicate differently then it can be overridden. I wouldn’t want to over-use this facility, because attributes which are substantially language-dependent belong in the lexicon, but it will simplify the handling of edge cases and idiosyncrasies.

The syntax for attaching tags to a predicate is as follows:

predicate foo
{
  tag bar,baz,quux;
};

Multiple tag statements are permitted as an alternative to listing them on one line. Tags are applied as the very first stage of the translation process. I’ve also made some changes to later stages so that tags survive up to and beyond word selection.

What can’t easily be achieved at this point is tagging of composite predicates prior to word selection, so while the system can be told that zoo:genus:vulpes has the characteristic of being animate, it is not able to deduce that (colour:red zoo:genus:vulpes) is equally animate. This is an issue I intend to address soon.

Fallbacks

October 22nd, 2009

One of the main challenges that the translation system must address is how to behave when there is no word in the target language to express one of the concepts requested. I intend to address this by providing fallback translations which are available across all languages. There will be two types of fallback:

  • Fallbacks to a description. For example, an ‘aeroplane’ might be described as a ‘flying machine’, or a ‘desk’ as a ‘writing table’. These are not perfect replacements, but in a language with limited vocabulary there may be no better way to convey the required meaning.
  • Fallbacks to a reading. Sometimes it is better to borrow a foreign word than to attempt any form of translation. Good examples of this would be animal names such as ‘kangaroo’ or ‘meerkat’ (which made their way into English through this mechanism). Any useful description would be unreasonably long, and falling back to ‘marsupial’ or ‘mongoose’ is unlikely to be helpful.

It is also worth mentioning one other type of construct which, though not strictly a fallback, has similar behaviour in that it can be translated without the need for an explicit reading:

  • Aliases. These are predicates which can be exactly expressed in terms of other, more primitive predicates. The alias exists only as a convenience for those writing source texts and language descriptions. Any analysis occurs after decomposition into primitives.

The natural place for descriptive fallbacks to be specified is in the corresponding predicate definition. (One consequence of this is that descriptive fallbacks will only be able to replace atomic predicates, not compounds. I think this is a reasonable restriction.) I’ve chosen the following syntax:

predicate vehicle:aeroplane
{
  fallback ((for-purpose-of flying) machine);
};

The intended behaviour is straightforward: if no reading is found then the predicate is replaced by the fallback expression and an attempt made to translate that. I would like to allow multiple fallbacks (to be tried sequentially until one translates successfully) provided that this is not overly difficult to implement.

Originally I had intended to make fallback readings part of the predicate definition too, but eventually decided that there is no need: ordinary readings already provide all of the functionality that is needed:

reading meerkat[noun] = zoo:species:suricata:suricatta;

Should an attempt be made to classify fallback readings into parts of speech (as in the example above), or should they simply be tagged as ‘foreign words’ which are somehow outside the grammar of the target language? My expectation is that most fallback readings will be noun-like, but if there are differences then it must surely be useful for the target language to know about them (and if there are none then ‘foreign word’ is simply an alias for noun).

This will make it necessary for parts of speech used in fallback readings to be standardised across languages, so French would use tags such as ‘noun’ and ‘adjective’ as opposed to ’substantif’ and ‘adjectif’. Fortunately I’ve being doing this anyway (albeit largely for my own convenience rather than as part of any grand plan). The same will apply to other types of tag such as ’singular’ and ‘plural’.

Should there be any attempt to create default inflection rules for fallback readings? I think probably yes. When words are borrowed from one language to another they often retain their original morphology in the first instance. At the very least it can’t do any harm to know which language the fallback was taken from. If a language wants to follow its own rules regardless of this information then it can override the default.

I’m aware that different languages have different parts of speech, and that while most languages have words which pass for nouns, adjectives, verbs and adverbs, that does not mean they have the same semantics or behaviour as English nouns, adjectives, verbs and adverbs. I also appreciate that inflecting completely alien words could prove to be difficult in some languages (although knowing that they are alien should help). However this is about producing something when the alternative is to provide nothing, so perfection is not a requirement.

Should languages omit readings if there is a suitable fallback reading? That’s a tricky question. On the one hand, to say no results in a large amount of duplication within the language description files. This is surely undesirable. On the other hand I do think that lists of differences will be more difficult to write, check and maintain, and that automatic propagation of changes is not desirable in the way that it is between dialects.

A good example of this is chemical elements in Danish. There is very nearly a 50-50 split between native and international names, so any saving would not be of the same order as (for example) British versus US English where there are only three differences. If a full list is given then it is possible to check that every element has been considered, whereas a partial list cannot be checked for completeness without redoing much of the research used to compile it in the first place.

For these reasons I’m minded to stick with full lists for the time being. However when compilation is introduced it would certainly be possible for any redundant readings to be automatically removed by the compiler, and I see no harm in that. Finally, I would only intend to duplicate readings for words which have been or are being assimilated into the language in question. Where a language has no word for a concept, the fallback will not be duplicated.

Dialects (Inheritance)

October 17th, 2009

There are several decisions to be made regarding implementation of the inheritance mechanism. First and most obviously, there needs to be some form of command to indicate that one language is derived from another. I considered two alternatives for the syntax:

  • a Java-style extends keyword, which would form part of a header at the start of the language description, or
  • a Perl-style use statement, which would be placed in the body of the language description.

The problem with a use statement is that the natural expectation would be for it to be allowed anywhere in the language description: later declarations would override, earlier ones would be overridden. This is easy enough to implement: create a master-index (for each declaration type) listing all of the declarations that have not been overridden. The only snag is that I might not want to generate such an index (if optimising for start-up time and/or memory usage), in which case implementation becomes much more fiddly. I could live with this if there was an compelling need for mid-file inclusion, but I’m not convinced there is one.

That leaves the extends keyword. This already exists and is used for specifying inheritance relationships between morphological paradigms, so using it between languages would be a natural extension. It forces the point of inclusion to be at the start of the language description, so avoids the implementation issue described above. The specific syntax I have in mind is of the form:

language "en_US" extends "en";

Should multiple inheritance be supported? I don’t see any reason why not, but I can’t say I have any particular use in mind for it either. On that basis I’ve decided to allow only single inheritance in the first instance, but keep open the option of adding multiple inheritance later if needed.

Having established a hierarchy, it is then necessary to decide in detail how parent and child languages interact with each other. The general rules are that:

  • Everything defined in the parent is imported into the child.
  • Where search order is important (as is the case when objects are referred to by name), the child is searched first.

The specific effect on each type of object is currently as follows:

  • Tags could be made to override earlier definitions, but I can’t think of any good reason why you would want to do this: the effect on the parent language would likely be both profound undesirable. I’ve therefore forbidden tag definitions which conflict with an existing name.
  • Predicate definitions override any previous definition of the same name. This is potentially quite awkward to implement, because they are indexed both by name and by meaning. If the meaning of a predicate is overridden then searches of the latter index do not necessarily yield the correct result. Unlike tags, predicates definitely need to be able to override each other, but their meanings probably don’t. I’ve therefore forbidden predicate definitions which alter the meaning of an existing predicate, but allowed the fallback and any other attributes to be changed.
  • Paradigms and morpheme definitions override any previous definition of the same name. These are indexed only by name, so implementation is straightforward.
  • Grammatical productions are effectively cumulative. The dialect is searched before the parent language, but since there is currently no way to indicate that a production does not occur, it is not possible to override in any meaningful way.
  • Decomposition rules are searched in the same order as productions, but only the first match is applied. This makes overriding is possible in most cases, but because decompositions are applied recursively it requires significant trickery to turn one off entirely. (Replacing the pattern by itself won’t work.)
  • Agreement rules are searched in the same order again, but all matching rules are applied. This makes overriding difficult or impossible.
  • Transformations behave in a similar way to decomposition rules, but since they are not applied recursively to any single point in the text they are somewhat easier to override.
  • Readings don’t override each other within a given language description file, but they do override any readings with the same meaning in any parent language. (Coincidently, this is very similar to how inheritance and overloading interact with each other in C++.) There is some potential for unexpected behaviour, but I think the alternatives would be worse.

Of these I would consider the behaviour of tags, predicates and readings to be satisfactory, but intend to look further at decomposition rules, productions, agreement rules and transformations. I think it is clear that inheritance works best with objects that have names. I’m reluctant to give every rule a name, but grouping them together into named ‘translation phases’ would probably suffice (and bring other benefits too).

Finally, a word about when inheritance should be used. The answer is: whenever it makes good engineering sense to do so. It does not matter whether linguists would classify the difference as one of language or dialect, or the manner in which it evolved historically. The most important considerations are:

  • whether (and to what extent) inheritance reduces duplication, and
  • whether it makes sense for changes to the parent language to propagate into the child.

Update 2009-10-18: I’ve now implemented Brazilian Portuguese and American English using inheritance. In addition to some changed spellings, this includes the American style of writing numbers in hundreds up to 9999, and omitting the ‘and’ in compound numbers. I’m not completely convinced regarding the last of these points: some speakers are quite adamant that use of ‘and’ is wrong, but evidence of usage in reasonably formal contexts is not so conclusive. I’d welcome comment as to whether I’ve got this right or wrong.

Dialects

October 14th, 2009

Currently the translation system supports only one dialect of any given language, so English means British English and Portuguese means European Portuguese. The United States and Brazilian dialects of these languages are sufficiently different (and popular) that they certainly ought to be supported too, but similar enough that writing entirely separate language descriptions would result in an undesirable amount of duplication.

What is needed is a mechanism which allows language description files to share common data. This would bring two benefits:

  • a reduction the amount of memory and disc space consumed;
  • automatic propagation of any corrections or improvements to a language to its dialects.

There are two ways in which sharing could be achieved:

  • by merging related dialects into a single language description file, then switching sections of that file in and out using some form of conditional notation; or
  • by requiring a separate language description file for each dialect, but allowing inheritance relationships between dialects such that only the differences need be specified.

Drawbacks of the first method are reduced modularity and readability. All dialects of a language would have to be loaded together, even if only one were needed. The language descriptions are likely to be quite complex enough handling one dialect, and if anything I would prefer to be looking at ways to break them down into smaller units rather than making them larger.

The main drawback of the second method is that it scales very poorly if the language can vary in several dimensions independently. For example, in Celtic languages the use of decimal versus vigesimal numbers is only loosely correlated with dialect and to a large extent is a matter of personal choice. You could write two language description files for each language, one for decimal and one for vigesimal, but then what happens when another issue is found where a similar choice is needed?

For these reasons I don’t think that either method provides a complete solution, so am inclined to implement both. This is not an unreasonable extravagance: many programming languages provide comparable facilities (such as #ifdef and #include in the C preprocessor).

Broadly speaking my intention is to use inheritance for regional dialects, and conditional rules for preferences which cut across those dialects. Inheritance is in the process of being implemented, and I will describe the syntax shortly. Conditional rules I don’t have a clear strategy for yet, but they are a less urgent requirement.

Merging the predicate dictionary into the language description

September 30th, 2009

I’m going to eliminate any formal distinction between the predicate dictionary and the language description. Instead there will be only one type of file, which can hold any of the currently supported types of declaration (predicate, morpheme, reading and so on).

Of course, I certainly don’t want to copy and paste the same set of predicate declarations into many separate files, but that won’t be necessary. There will need to be a way for one language to inherit declarations from another language in order to efficiently support dialects. The same mechanism, once it has been implemented, can be used across all languages to share a common set of predicate declarations.

One important difference between this and the current arrangement is that it will provide a basis for predicate declarations to be overridden. This will allow information to be associated with a predicate even if it is not strictly language-independent.

For example, Polish treat nouns differently according to whether they are animate or inanimate. For the most part animacy is defined as you would expect it to be, but there are marginal cases which are decided by convention (plants are generally inanimate, but viruses, bacteria and fungi are animate), and more than a few outright exceptions (units of currency, such as the złoty, are animate). To the extent that the classification is based on objective criteria it can and should be shared between languages, but exceptions rightly belong within the relevant language description.

Implementing this capability will not be a great burden. Arguably it simplifies the translation system slightly, and it avoids the annoyance (within the internal C++ API) of having to explicitly instantiate the predicate dictionary and provide a reference to it when constructing a language object.

Licensing

September 27th, 2009

The translation system is first and foremost an Open Source project, so it needs to be released under at least one OSD-compatible licence. Broadly speaking there are three classes of licence from which to choose:

  • fully reciprocal licences (such as the GPL),
  • reciprocal licences with a library exception (such as the LGPL), and
  • non-reciprocal licences (such as the modified BSD licence).

My preference is for a fully-reciprocal licence because that encourages other developers to give back to the community by releasing their own programs as Open Source, and the most widely used licence which does this is the GPL. However, I also want the translation system to be used as widely as possible, and imposition of the GPL by itself would prevent such use:

  • by Open Source projects that have chosen a licence that is incompatible with the GPL, and
  • by proprietary vendors who either cannot or choose not to release under the terms of an Open Source licence.

For this reason I want to retain the ability to grant further licences, including fully commercial ones if the recipient is not giving back in some other way. Currently that is straightforward because the code is entirely written by me, but the situation becomes more complicated if third-party contributions are added to the codebase.

In the case of the library itself I don’t foresee any substantial third-party contributions being needed, but for the language definition files they will be essential if the project is to be a success. There are several ways to preserve my ability to issue new types of licence:

  • by assigning copyright to me;
  • by giving me the right to sublicense on terms of my choosing; or
  • by licensing contributions permissively enough to coexist with any other licence I might want or need use.

Sun use a combination of the first two methods for contributions to Java and MySQL, and there are other companies with similar arrangements, but they are sometimes criticised for being overly one-sided. I wouldn’t personally object to contributing on such terms, but can see why others might.

I won’t claim that the third method is completely fair — dual-licensing schemes generally aren’t — but it is at least fairer than the other two. The protection it provides is weaker, because not all of the code is covered by the GPL, but still much better than placing everything under the BSD licence or the LGPL. I think it is an effective and defensible solution for the current situation where the project is first and foremost my work. (If others start making contributions of comparable size and value then I’m willing to negotiate.)

I’ve not settled on a precise form of words yet, but it will most likely be similar to the copyright disclaimer used by the GNU Project.

Names of Colours part 4: Implementation

August 31st, 2009

I’ve now had some experience working with the colour predicates described previously, and so far they have proved to be satisfactory. Certainly I have not yet had cause to wish that they were defined differently. However there are a number of constructions which cannot currently be translated, due in large part to how the word selection algorithm works. Here is an outline of what works, what doesn’t, and how the situation could be improved.

Unqualified hues present no difficulty provided that the readings given cover all of the allowed predicates. This can and often does result in several readings for the same colour name. For example, both colour:azure and colour:blue have been translated as ‘blue’ in English.

Hues qualified with colour:dark, colour:light or colour:bright also work as intended provided that they are bound together as a compound predicate, for example:

(colour:dark colour:orange) ⇒ ‘brown’

With a few additions to the language description this can be expanded to:

((colour:dark colour:orange) bio:genus:vulpes) ⇒ ‘brown fox’.

However, other permutations of these predicates have a less desirable surface form:

((colour:orange colour:dark) bio:genus:vulpes) ⇒ ‘orange dark fox’
(colour:dark (colour:orange bio:genus:vulpes)) ⇒ ‘dark orange fox’
(colour:orange (colour:dark bio:genus:vulpes)) ⇒ ‘orange dark fox’

The question is, should the translation system do better with these inputs, or should these inputs be avoided?

A partial answer to this question is that it shouldn’t matter whether colour:dark or colour:orange is specified first, because the effect of these predicates on the membership function is linear. By this I mean that if f(x) represents darkness and g(x) represents orangeness then:

f(g(x)) ∝ f(x) × g(x)

Since multiplication is commutative it follows that:

f(g(x)) ∝ g(f(x))

This does not mean that (colour:dark colour:orange) and (colour:orange colour:dark) should necessarily produce the same output, but the surface forms should at least be of similar quality, which is clearly not the case at present.

One solution would be to provide two readings for the word ‘brown’, but this would be inelegant, and scales poorly if more than two predicates were involved. The alternatives are to improve the word selection algorithm so as to recognise when permutations are equivalent to each other, or to force the predicates into a particular canonical order.

Many languages have a preferred order for adjectives, so some reordering will be needed whether or not it is a requirement for word selection. For those languages which don’t have a preferred order, there is no reason why one can’t be imposed anyway. Even for those languages which use adjective order to indicate emphasis, there is no need to preserve the original order of the predicates, because that would not be a correct way to deduce what should be emphasised.

However I can see one situation where reordering won’t help. Where a noun encompasses the meaning of one or more adjectives (such as ‘lamb’ or ‘ewe’ in place of ’sheep’) there is no guarantee that the predicates replaced will be canonically adjacent to each other (for example ‘young black sheep’). For this reason I think impovements to the word selection algorithm will be needed, even if canonicalisation is introduced too.

Regarding the question of associativity, (colour:dark (colour:orange bio:genus:vulpes)) is certainly acceptable: it merely applies the three predicates in sequence. As they are all descriptive and do not contradict each other there is no reason why this shouldn’t happen, but it is not something that the translation system can handle currently.

The colour in isolation, however, must be expressed as (colour:dark colour:orange) (or vice versa), because there is no other way in which two predicates can be combined. It follows that all of these forms need to be matched. The options are similar to before, except that there will almost certainly need to be changes to word selection (because the current system cannot replace a set of predicates which do not form a subtree).

It has occurred to me that I may be making life unnecessarily difficult for myself by using an explicit binary tree structure as opposed to something more akin to the list structures used in Lisp. In the latter case there is a terminator at the end of the list, so there is no structural difference when colour:dark and colour:orange are applied to each other or applied to something else. This would be a radical change that would affect the whole translation system, but I think it is worth considering.

Hedges

August 9th, 2009

In fuzzy logic, a ‘hedge’ is a type of unary operator which acts on the membership function of a fuzzy set. Hedges typically correspond in meaning to adverbs such as ‘very’ and ’slightly’. Effects may include displacing the peak of the membership function, or causing it to become more concentrated or spread out.

Many predicates need to have fuzzy semantics if they are to accurately represent their linguistic counterparts. For example, colour:bright is fully true for colours with a saturation and value of 100%, but would still be mostly true if the value were lowered to 90%. The way I envisage the membership function to be defined, it would become completely false only on reaching the grey line. Similar considerations apply to colour:pale and colour:dark.

Because there are degrees of truth to these predicates it is possible to talk of colours which are (for example) ‘very bright’ or ’slightly dark’. All that is needed are predicates to represent the required hedges. Here are the ones I intend to make available in the first instance:

  • not[1] inverts the membership function, so that members become non-members and vice versa. It is normally implemented by the function f(x) = 1−x.
  • very concentrates the membership function, peaking at the highest possible values. It is often implemented by the function f(x) = x2 but there are other ways to define it.
  • slightly concentrates the membership function, peaking at what would otherwise be the lowest possible values. A possible definition is (very not).
  • moderately concentrates the membership function, peaking at what would otherwise be middling values. A possible definition is the union of (not slightly) and (not very).

Hedges are sufficiently fundamental that I see no objection to their being placed in the global namespace.

Given that slightly and moderately could be defined in terms of other predicates it may be asked whether they are necessary. I’ve included them for three reasons: in the interests of brevity, to avoid committing to any particular definition of the membership function, and because I’ve not seen any evidence that decomposing these hedges would be of value linguistically.

On the other hand, I do intend to make use of constructs such as (not very) and (very slightly) because equivalent forms are used in many languages.

I’ve not (yet) defined a somewhat hedge because I’m not convinced that the way this term is used in fuzzy logic (typically implemented by means of the square root function) corresponds to ’somewhat’ in English, or indeed to any word that I have been able to identify. To my mind, if there were to be a somewhat hedge then it should peak between slightly and moderately.

A potentially useful benefit of defining very to mean f(x) = x2 is that it allows ((very dark) orange) to be re-expressed as (dark dark orange), which could then be translated as ‘dark brown’. I’m not going to commit to saying that very is defined this way, because I want to avoid that level of detail. I will say that it is an acceptable approximation, which is all the translation system needs to know in order to make use of it (since translation is not an exact process).

There will probably need to be restrictions on when and where these predicates can be used, and in what combinations. For example, whereas (very dark) makes sense and ought, (very (cardinal 5)) does not (because being fivefold is not an attribute of degree). I’ll return to this subject when I have more experience of implementation.

[1] Some sources class not as a hedge, others as part of a larger class of functions called modifiers.