The Need for a Bootstrap Lexicon

August 7th, 2010

The method I ultimately intend to use for compiling a lexicon is by tasking a parser to search for words that need to be of a particular lexical category for the surrounding text to be grammatical. The catch-22 is that this parser will only be able to do its work when the lexical categories for most of the words in a sentence are already known. In other words, in order to compile a lexicon by this method I need to already have a lexicon.

The parser can be run a number of times if necessary, so it need not catch every word at the first attempt. The only requirement is that the parser add enough new words to the lexicon each time it is run to make subsequent runs more successful. However, my expectation is that this will only happen when the lexicon is above a certain critical size: any smaller and the number of sentences successfully parsed will be too small to sustain growth. It follows that an initial lexicon, generated by other means, will be needed to bootstrap the process.

To generate the bootstrap lexicon I’ve chosen to look for methods with a very low false-positive rate, even if this results in a relatively poor yield. There are two main reasons for this:

  • I can remove false positives from the English lexicon by manually reviewing it, but will not have this luxury for languages with which I am less familiar. In the absence of an army of helpers, I need to develop methods that can deliver useful results without supervision.
  • The corpus I am working with is sufficiently large that yield is not a major concern during the bootstrap phase.

I don’t intend to search for closed-class words automatically, in English or in any other language: these can be entered manually. I will be looking for automatic methods for isolating nouns, verbs, adjectives and adverbs. Of these, I have so far had most success searching for nouns with regular plurals. My next post will describe the method used and the outcome.

Harvesting text from Wikipedia

August 2nd, 2010

Having imported about a quarter of Wikipedia onto a local server, I now need a way to harvest the text from those articles in a form that is suitable for analysis. There are three stages to this: fetching the HTML, parsing the HTML, then isolating the useful parts of the text.

I did briefly consider using the MediaWiki API to obtain the text as XML rather than HTML. This had some attractions, in that the markup was more orientated to semantics rather than formatting, but XML is less forgiving than HTML and there were enough irregularities to make parsing difficult: in some cases the data was not even well-formed. HTML parsers are usually quite good at handling marginal input, so this appeared to be the more promising approach.

In order to fetch pages using HTTP I need the page titles, preferably in the same order as they appear in the XML dump file. There are two reasons why this order is helpful: so that I can start work without waiting for the whole of Wikipedia to import, and (in the interests of repeatability) to make it easy to describe which subset of articles I used on a particular occasion. I’ve written a script to do this called extract-wikipedia-titles. After extracting the titles it filters out redirects, and also pages in the namespaces Template, File, Category, Wikipedia or MediaWiki (as these are less likely to contain text suitable for analysis).

The Perl script that performs the harvesting is called harvest-wikipedia-text. It uses LWP::UserAgent to fetch the pages, and HTML::TreeBuilder to parse them. This is not the most efficient approach, as there is no real need to build an explicit parse tree, however it was an easy way to avoid having to worry about implied or missing tags.

Once the tree has been built, the script performs a depth-first traversal. Harvesting takes place only within a <div> element with an id of bodyContent, this being where the body of the article resides. Some types of element are not entered during the traversal, including <script>, <li> and <td>, on the grounds that they either won’t contain or are less likely to contain usable text. (I don’t doubt that many lists and tables do contain well-formed sentences, but with tens of gigabytes to work with I can afford to be picky.)

I didn’t want to break the text into sentences at this stage, because while this can be done up to a point using simple heuristics, better results can be obtained by a parser that understands the grammar of the language. However there are formatting features in the text that are very likely to be sentence boundaries, and I didn’t want to lose this information. I’ve therefore chosen to produce output consisting of one line per paragraph. Each line may contain several sentences, but the assumption is that sentences don’t span multiple lines.

Text in the HTML parse tree is liable to be split into fragments because of intervening structures such as elements and comments. Whenever my script sees a text fragment, it appends it to the end of the current paragraph. There is a function break_paragraph, called whenever a <p> element begins, which outputs then clears the current paragraph.

Wikipedia makes heavy use of templates, often nested several levels deep, and if any of these are missing then the text does not render correctly. I had therefore thought it would be best to import all templates before harvesting. To expedite this I wrote a script to extract the required subset of pages from the XML dump file, allowing them to be imported separately from the bulk of the text. The result was disappointing: page rendering times were greatly increased, and much of the extra material generated was not helpful. I therefore intend to work without a full set of templates in the first instance, but this is an area that would benefit from further investigation.

In order to compensate for the lack of proper template processing, my script recognises and ignores the resulting broken link elements that appear in the HTML. It also stops processing a page when it sees a section called “References”, “See also” or “Links” (as there is usually no more usable material after that point).

I’ve performed an initial export of 100,000 articles, yielding 635 megabytes of text containing (according to wc) 107 million words. I will certainly want to add to this in time, but for the next phase of work it should be ample. This will be to construct a small- to medium-sized lexicon using very simple methods, which can then be used to bootstrap a more accurate parser-based analysis of the full corpus.

Importing Wikipedia

July 27th, 2010

Wikipedia generally encourage reuse of their material, but discourage spidering because of the load it places on their servers and bandwidth. Instead they provide copies of the underlying database in an XML-based format. This is many times smaller than the rendered HTML, and can be served statically, so it is a much more efficient method of transfer from both their point of view and mine. It doesn’t include images, but for the analysis I have in mind the text is all I need.

I tried processing this XML file directly, but while it is easy enough to strip away the outer layer of XML, what that leaves is unparsed wikitext. Unfortunately there are many variants of wikitext, and (currently) no authoritative specification for the particular dialect used by Wikipedia (beyond the MediaWiki source code). Writing a parser would be easy enough, but writing one that resolved ambiguities in exactly the same way as MediaWiki would be more difficult. From discussion of this topic elsewhere, my understanding is that the content of Wikipedia is quite sensitive to these subtleties.

The obvious solution to this problem is to use MediaWiki itself to perform the parsing. The simplest way to do that (without spidering) is to create a local copy of Wikipedia, then download from there. Building a local copy is fairly straightforward, but there are a few wrinkles that are worth recording.

MediaWiki is available as a package for both Debian and Ubuntu (my favoured distributions), but the specific versions provided by Lenny and Lucid rendered many pages incorrectly: the content following a reference was often rendered as preformatted text, due to an interaction between the different layers of syntax. I therefore ran up a copy of Squeeze (which is still in testing but quite usable for this purpose) and that appears to work much better. (I could have downloaded and installed the upstream version instead, but I prefer to use packages where feasible.)

There are several ways to import the text. I chose to use importDump.php, which is definitely not the quickest method in terms of elapsed time, but it appeared to be the least risky and required the smallest amount of effort from me. The only problem I encountered is that script chokes on <redirect /> elements. Fortunately they are easy to remove due to the particular way in which the XML is formatted:

egrep -v "^\s*<redirect />\s*$"

What effect this will have on the behaviour of the wiki is unclear (I presume these elements were there for a reason), but redirects to normal articles I can definitely live without: one copy is sufficient. Redirects to templates may be more of an issue.

In order to correctly render Wikipedia there are some MediaWiki extensions which must be enabled. The most important of these are Cite and ParserFunctions (Cite being the extension that handles references, and which had been erroneously generating preformatted text). They are located in a separate package, mediawiki-extensions-base, and need to be enabled using the mwenext command. I’ve also enabled Poem, InputBox and CategoryTree from the same package. Wikipedia itself uses several dozen more (full list here), but I don’t think any of them significantly affect how the text is rendered.

I chose to use the current article text only (enwiki-20100622-pages-articles.xml.bz2). The page histories would have been likely to add a lot of repetition and noise, but little that was genuinely useful. A better case could be made for including the discussion pages, but these are by their nature quite informally written (with little or no effort made to correct mistakes), and I thought on balance that they were more likely to hinder than help.

The XML file is approximately 6GB compressed and 27GB uncompressed. I’ve allowed 128GB for the resulting MySQL database, and current indications are that this should be sufficient. The machine used to perform the import had 1GB of RAM and this proved to be grossly inadequate, but I was able to work around it by stopping then restarting the import script several times. It quickly skips over any pages that are already present, using little or no memory in the process. I wouldn’t want to recommend this technique in general, because I don’t know what it does to the database, but the result appears to be good enough for the intended use of text harvesting.

The import has already taken two weeks, and it is likely to be several more weeks before it finishes. That’s OK, because the text it has imported is more than enough to start working on. More soon about how to extract and process the text.

Redesigning the Lexicon

June 30th, 2010

The lexicon on the Project Babel website was originally set up as a means for generating language description files. I’ve since concluded that making it the authoritative source for the files isn’t practicable: they need to live in the Subversion repository like everything else. Nor is it particularly well suited for managing readings: my experience is that these are better handled systematically, one topic at a time.

Where I think the lexicon can add value is by providing a dataset against which morphological rules can be evaluated and regression-tested, but some changes are needed if it is to perform this function effectively.

First and foremost the process of populating it needs to become much, much faster. That means automatically and accurately guessing much of the required content, so that my rĂ´le is largely reduced to approving or correcting those guesses. The user interface needs to be designed for doing this in bulk: minimising the number of clicks and page reloads by placing many entries on one page.

Secondly, better coverage of each language is needed. The texts I was using previously (from Project Gutenberg) don’t do this effectively enough. The problem isn’t that the texts are unrepresentative: on the contrary, they are too closely representative of normal writing, with not enough coverage of rare words.

For example, of the chemical elements I found hydrogen, oxygen, aluminium (and aluminum), sulphur (but not sulfur), argon, iron, nickel, copper, silver, tin, gold, mercury and lead. (Some of these have multiple meanings, but that doesn’t matter provided the words in question enter the database). Including some texts about chemistry would improve matters, but even then it would take a huge amount of raw material to complete the list.

Importing a copy of the periodic table would clearly solve this particular problem, but I’m not convinced that including tabular data in the corpus is a good idea: there would be little or no opportunity to deduce parts of speech automatically, and I’m concerned that it would add significantly to the number of non-words in the database. In any event, it is not a general solution because most of the words that need to be covered won’t appear in neat tables.

A type of prose that could provide much better coverage is an encyclopedia,
and the obvious candidate is Wikipedia. There was a good case for using this anyway because of its sheer size, and also the number of languages in which an edition is published. I did have some concerns that it would be statistically unrepresentative, but it is now clear that a representative sample is not what is needed.

(Of course I could obtain much of the information needed from Wiktionary, but my preference is to keep this and other dictionaries in reserve as a means for checking data that I have generated independently. This, I hope, will provide more opportunity for detecting errors.)

The third change is to make the language description files an input to the lexicon rather than an output. There then needs to be a means to compare generated word forms with ones that have been reviewed and approved.

Computer Terminology part 2: Topics

May 11th, 2010

In choosing what the topics should be, my main priority has been to avoid creating large grey areas. This is partly to simplify the initial classification process, but mostly to make the resulting predicates easier to memorise: if the classifications are obvious given the topics, then all you have to remember are the topics.

Computer technology makes a good choice for a top-level namespace because it is a large subject that has much specialised vocabulary and reasonably well-defined boundaries. I’ve called this namespace ‘comp‘. The namespaces immediately below it wil include:

comp:alg (algorithms)
comp:arch (machine architecture)
comp:code (source and object code)
comp:data (data types and structures)
comp:dev (software development)
comp:exec (execution)
comp:io (input/output)
comp:net (networking)
comp:os (operating systems)
comp:ui (the user interface)

Some of the second-level namespaces have third-level namespaces within them. Examples include comp:alg:sort and comp:io:printer. The main reason for introducing them was the groups of predicate names that would otherwise share a common suffix, such as inkjet, laser and thermal printers, or bubble, insertion and selection sorts. The word ‘printer’ or ’sort’ was going to have to appear in the predicate name anyway, so why not make it a namespace and obtain the benefit of grouping like terms together?

Much computer-related terminology has been adapted from other fields. It will therefore be no surprise to find the same predicate names appearing in other namespaces with similar definitions, both inside and outside the comp hierarchy. An example would be comp:dev:fork and comp:os:fork. One could argue that these are essentially the same process applied respectively to source code and to a running process, but there are two good reasons for keeping them separate:

  • Although they have much in common, there is enough contextual baggage associated with each that it would be difficult for a generic description to convey their full meanings
  • The concept of forking is one that applies only to a very small number of isolated concepts. It is therefore quite practicable to enumerate the alternatives, and there is little to be gained by attempting to generalise.

A subtopic that I am unsure about adding is one for types of software. Originally I had put together quite a number of predicates that could be grouped together in this way, but it became increasing apparent that most of them could be expressed using the pattern ‘program to do X’. Thus, a compiler is a program for compilation, a text editor is a program for editing text, and so on. There may be a case for providing these anyway, as aliases, but I’m not yet convinced that is justified.

There may also be a need to add subtopics relating to specific technologies. For example, ‘method’ is a generic term for a subroutine that is a member of a class, but for methods written in C++ it is more common and more appropriate to use the term ‘member function’. This difference needs to be captured somehow, and one way to do that would be to treat a C++ member function as a type of method that is sufficiently distinctive to be given its own predicate. However I’m not necessarily convinced that this is the right answer, not least because it would violate orthogonality.

One final point: I think my work on this namespace is sufficiently mature to start committing parts of it to the repository, but this does not mean that the predicate names should be considered stable. On the contrary, I expect change as it beds in, and I won’t be making any particular effort to ensure backward compatibility. The only predicates that have been declared stable so far are those relating to numbers. At some point in the future it will be necessary to identify stable and non-stable predicates in the BabelScript documentation, but for now it is enough to say that everything apart from cardinal and ordinal numbers is non-stable.

A Markup Language for Documenting BabelScript

May 1st, 2010

So far the very small amount of documentation that I’ve written concerning BabelScript, aside from these blog posts, has been created using DocBook. I will shortly be adding the predicate definitions covering computer terminology, but rather than continue in the same manner this strikes me as an almost ideal candidate for a custom markup language that can be transformed using XSL:

  • The final result will be large but highly structured, consisting of many thousands of predicate descriptions that (with a few notable exceptions such as numbers) all follow broadly the same pattern.
  • I would strongly prefer not to commit to any particular layout at the present time.
  • It is quite conceivable that there will be a need to present the documentation in several ways, with different layout and perhaps also different content

The elements currently defined are:

  • babelscriptml, the root of the document
  • topic, corresponding to a top-level namespace
  • subtopic, corresponding to a second- or lower-level namespace
  • title, the title of a topic or subtopic
  • naming, specifying how predicates are named in a topic or subtopic if this is done systematically
  • predicate-def, the definition of a predicate containing (at least) the predicate defined and a description
  • predicate, the name of a predicate
  • description, a description of a predicate, topic or subtopic
  • compare, a predicate that it is instructive to compare with the one being described
  • note, an additional comment within a topic, subtopic or predicate definition

This is not intended to be a final list, but I won’t be adding anything that is not needed for the task in hand. The markup is strictly ad hoc in nature, and I do not intend to worry about forward or backward compatibility with anything outside the Project Babel source repository.

Computer Terminology part 1: Strategy

April 27th, 2010

One of the main motivations for developing the translation system was to automatically generate localised user interfaces for computer software. This task is likely to make heavy use of computer-related terminology, making that topic an obvious one to address at an early stage. I’ve refrained from doing this previously because there were some basic questions about predicates which needed to be settled first, and it was easier to do that using well-ordered sets of concepts such as numbers and colours. Now that I’ve gained some experience I think it is feasible to attempt something more ambitious.

Identification of the required predicates has been largely ad hoc, and I am expecting substantial additions to be necessary in the future. Writing definitions has been time-consuming but mostly straightforward. I doubt they are up to dictionary standard, but they should be adequate to indicate what is intended.

More difficult has been deciding how to divide the resulting predicates between namespaces. I’ve explored two alternative approaches:

  • A subtype-supertype hierarchy, in which members of a given namespace share an is-a relationship with their parent.
  • A topic-based hierarchy, in which predicates with fundamental semantic differences can and should share the same namespace if they relate to the same subject area.

For example, the first method would place the user interface elements ‘window’, ‘icon’ and ‘menu’ into one namespace, and the user actions ‘click’ and ‘drag’ into another, because the former are user interface elements whereas the latter are actions. The second method would more likely place all of these in a single namespace because they all relate to the topic of computer user interfaces.

The first method is attractive because it is relatively objective, but I’ve found that it has three major drawbacks:

  1. The resulting namespaces are often very small (and therefore numerous).
  2. The namespaces can be difficult to name concisely.
  3. Homonyms, which namespaces were supposed to resolve, are quite likely to end up in the same namespace.

The third point is arguably the most serious one, because there is no point having namespaces unless they provide a way to specify which sense of a word is intended. For example, using the supertype-subtype approach, there isn’t a large difference between hanging out the washing and hanging a person: both are actions which involve suspending something from a rope or line. However the topics with which they are associated are entirely different: housekeeping versus criminal justice.

Another question that arose was how far to go in creating predicates to represent concepts that could be expressed in terms of other predicates. Here are some examples where I think a good case can be made one way or the other:

  • Although it is true to say that a ‘laser printer’ is a printer that contains a laser, there is much more to its meaning than that. A full description would be impracticable, on grounds of both verbosity and fragility. It therefore makes good sense for there to be a separate predicate corresponding to the concept of a laser printer.
  • The alternative would be to create a predicate that captures the full, specialised meaning of the word ‘laser’ as used in the term ‘laser printer’. However, such a predicate would be so specialised that it could, I suspect, only be used to modify the concept of a printer. If the predicate language was being developed for semantic analysis then this added orthogonality might be useful, but I don’t think a translation system has any need for it.
  • A ‘printer cable’ is merely a cable for a printer. If a separate predicate were created for this concept then, for consistency, it would also be necessary to add ‘keyboard cable’, ‘mouse cable’, ‘plotter cable’ and many others. The need to systematically replicate a complete class of predicates is a clear indication that orthogonality has been violated, and while I wouldn’t be above that if there were a clear practical benefit, I can’t see any justification for it in this instance.

The distinction here is between what would be an idiom and what is merely a collocation. Idioms are forbidden in BabelScript, because they violate the rule for combining predicates, so if a concept would be too complex to be described analytically then it needs to become a predicate in its own right.

However there is an exception to this rule. Even if a predicate is not strictly needed, it can be added anyway as an alias for the decomposed form. In this case the criterion is whether the convenience of having a single predicate to express a concept outweighs the clutter resulting from additional and unnecessary predicates. In the third example above I’ve said no, but this is very much a value judgment.

For many of the potential predicates I’ve considered the correct answer is unclear. For example:

  • Is a package manager merely a program for managing packages?
  • A compiler is (I would say), merely a program for compiling, but is it a sufficiently well-used concept to justify an alias?
  • Should constructors and destructors be expressed as functions for constructing and destroying?

The approach I’ve taken is to err on the side of caution and avoid creating predicates that are questionable. Adding new predicates is straightforward, especially if they are aliases, whereas removing them is more disruptive (because it breaks compatibility with texts that use them).

Issues with the use of Unification for Word Selection

March 19th, 2010

Unification can be used to translate from one lexicon to another by means of a ‘transfer dictionary’ of the form:

{[lang1=foo,lang2=bar], [lang1=baz,lang2=qux],…}

I had hoped to use something very similar to perform word selection, for example:

{[class=animal,rank=familia,familia=accipitridae,
category=noun,lexeme=eagle], …}

This correctly selects

[category=noun,lexeme=eagle]

when presented with the input

[class=animal,rank=familia,familia=accipitridae].

Unfortunately it would also select

[category=noun,lexeme=eaglet]

if there were a dictionary entry of the form:

{[class=animal,rank=familia,familia=accipitridae,
maturity=child,category=noun,lexeme=eaglet], …}

because my selection criterion does not specify the required degree of maturity. Similarly, source text that you might expect to select the word ‘vehicle’ would also match ‘hovercraft’ and ‘penny farthing’.

It is possible to circumvent this problem by providing a special symbol ‘none’ that is recognised during the unification process and matches only if its counterpart is unspecified. However, requiring the source text author to use this symbol — possibly several times per FD to exclude different attributes — would be an unreasonable burden, and unification does not (so far as I can tell) provide the means to insert it automatically.

The second issue is how to distinguish between information that:

  • need not be expressed, but is provided to assist with word selection, or
  • must be expressed, but can be subsumed into another word, or
  • must be expressed, as a word in its own right.

For example, suppose there is a requirement to refer to a young, female sheep. In the first case the animal would be described as a ‘lamb’, because that is a more accurate word than ’sheep’ if it is young. The fact that it is female would not be used in English, because (so far as I’m aware) there is no word to express all three concepts together.

In the second case we are saying that ‘young’ and ‘female’ are essential to the meaning of the sentence and must be expressed somehow. The concept ‘young’ is subsumed into the meaning of ‘lamb’, but ‘female’ is not, therefore the output would be ‘female lamb’.

In the third case we are saying that ‘young’ and ‘female’ are so important that they must be expressed as entirely separate words. The output would therefore be ‘young female sheep’. Note that it needs to be a ’sheep’ in this case, because ‘young lamb’ — by encoding the age twice — would imply a very young animal (in much the same was that dark brown is very dark orange).

How to implement this behaviour is a task for another day. My current concern is how to request these different types of behaviour from within the source language. For the current, predicate-based format this is straightforward and I had already reserved two tags for the purpose:

explicit — must be expressed separately
optional — need not be expressed

These would be applicable to both atomic and compound predicates, so (for example) tagging (dark orange) as explicit would prevent the compound as a whole from being merged with anything else, but should not prevent dark and orange from being expressed using the single word ‘brown’.

A further refinement would be to provide shortcuts to allow these tags to be added more concisely: ‘?’ for optional and some other symbol for explicit. This would allow notation of the form:

(young? female? zoo:species:ovis:ares)

Functional descriptions can be qualified in a very similar manner, but individual features can’t be if their values are atomic: to be able to qualify the value of a feature, that value must itself be a functional description. A more concise approach would be to use atomic-valued features for one of the alternatives and separate FDs for the other two:

  • information provided solely by atomic-valued features is optional, whereas
  • each item of essential information is placed in a separate FD.

I don’t have any objection to this solution in principle: it is very similar to my existing intent to use predicates for data and tags for metadata. However, despite my efforts above, the resulting source text would still be much more verbose than one based on predicates — so much so that I don’t think I could reasonably ask developers to produce source text in that format. For example:

(young female zoo:species:ovis:ares)

might become something like:

[category=np,
age=[class=maturity,maturity=child],
gender=[class=gender,gender=female],
head=[class=animal,rank=genus,genus=ares]]

It might be possible to shorten this by tweaking the syntax, but there seems little prospect of it improving on the conciseness or readability of the existing predicate-based notation. Since they convey essentially the same information, and since this is an external interface to the translation system which has to be easy to use, I think that makes a strong case for not using functional descriptions in the source text.

That doesn’t mean that there is no role for unification later in the translation process, and if there needs to be a conversion between two different formalisms then word selection would be a convenient point at which to do it.

Progress towards Unification

December 31st, 2009

I’ve now created data structures to represent features, conjuncts and disjuncts, and can perform unification in a few simple cases — but not yet generally in the presence of disjuncts. The latter is hard to do efficiently, so in the first instance I plan to use an inefficient but simple algorithm (probably expanding both inputs to disjunctive normal form prior to unification).

Also not implemented yet are paths, constituent sets and patterns. An open question is whether to implement paths explicitly, or instead use named variables to tie branches of the tree together. So far as I can see, these options provide essentially the same functionality, however I suspect that variables may be more convenient to implement given that the data structures I am using only allow you to move down the tree, not upwards.

I’m now confident that unification can provide all of the current functionality of the translation system except (perhaps) for morphological operations and left-right agreement. Furthermore, it would do so in a significantly more flexible and elegant manner than the ad-hoc mechanisms that exist at present. My main concern remains that of efficiency. I’m also uncertain as to how best to support dialects.

One mechanism which could be eliminated is the use of namespaces. For example, the predicate zoo:species:tyrannosaurus:rex could be replaced by:

[type=animal,rank=species,genus=tyrannosaurus,species=rex]

This is more verbose, but self-describing (a bit like the difference between X.500 and the DNS) and able to be processed in ways that an opaque predicate name cannot.

Unification

November 30th, 2009

I’ve begun experimenting with some changes to the internal data structures so that the translation system could support unification. This is a concept borrowed from formal logic which is has become popular as a natural language processing technique. A good introduction can be found in Functional Unification Grammar: A Formalism for Machine Translation by Martin Kay.

The most notable difference this will bring to the translation system is that functional unification grammars are declarative in nature, whereas the languages descriptions I am using currently are (for the most part) procedural. A declarative approach has two important advantages:

  • The language description need not be closely tied to any particular processing task, or to any particular implementation method. This makes it more likely to be reusable for other purposes. For example, a grammar described exclusively in terms of unification would be reversible.
  • Because the whole grammar applies in parallel (as opposed to individual rules being applied in series), the order in which components of the grammar are specified is unimportant.

Against this I expect there will be a price to be paid in terms of efficiency and/or complexity. Declarative languages tend not to be naturally efficient, by which I mean that when implemented simplistically they tend to be very inefficient. It is often possible to claw back some or all of this loss by using more sophisticated algorithms, but the programming effort required to do that can be considerable.

To obtain the full benefit of a unification grammar it would be necessary to eliminate all of the existing rule types, but that isn’t practicable in the short term. For this reason I’m going to provide the means to translate phrases into functional descriptions and back again. This is straightforward enough:

  • If the phrase is atomic then the corresponding token becomes the value of an appropriately-named feature (eg. ‘atom’).
  • If the phrase is composite then the left- and right-hand sides each become features (eg. ‘lhs’ and ‘rhs’).
  • Each tag becomes a feature with a value of ‘true’.

Because functional descriptions can only grow, it may be necessary to use different feature names for the conversions to and from functional descriptions (depending on what is being attempted at the time).

Most likely it will be agreement rules that I attempt to replace first. This will require at least three significant changes to the way the language description is written:

  • In an agreement rule the absence of a tag can be used to indicate with certainty that a particular condition is false, whereas in a unification grammar the absence of a feature means that a value is unknown: to indicate falsehood it is necessary to explicitly specify a value of ‘false’.
  • When propagating attributes such as gender it will not be necessary to provide a separate rule for each of the possible values.
  • Agreement rules match only when they need to change something. A unification grammar must provide rules to match every valid input (otherwise the input would not be allowed), and normally you would want only one rule to match each input.

The changes are strictly experimental for now, and are being developed in a branch so that they can be easily abandoned should that be necessary. The main criteria will be whether they improve readability and maintainability of the language description files, and how much they add to memory and processing requirements. If they are successful then ideally they would replace all of the current phases from part-of-speech discovery through to transformation. Alternatively, they may simply be used to add to the existing rule types available when writing a language description.