Not logged in
Documents : Lexicon User Guide

Lexicon User Guide

Introduction

This is the user guide for the Project Babel Lexicon, a multi-lingual database of paradigms, morphemes and readings from which language definition files are generated.

Access

The lexicon is publically readable but not writeable.

The Corpus

The main source of words used to populate the lexicon is a corpus of texts. To qualify for inclusion, a text must be:

  • freely redistributable in the UK;
  • presented as a plain text file in a supported encoding;
  • of high quality (to avoid adding an unmanageably large number of non-words to the lexicon);
  • largely written in a dialect that is within the bounds of what the corresponding language definition is intended to cover.

(The last of these conditions will tend to exclude texts beyond a certain age due to language change.)

The corpus also provides an indication of how common each word is, by recording how many times it has been seen. It would be unwise to read too much into this count, as the corpus is not a balanced sample (and in many respects it is very unbalanced); however the counts do serve a number of useful purposes, including prioritisation (handling the most common words first) and checking that paradigms have not missed any common word forms.

It is possible to limit processing of a text to a particular set of line ranges. Anything which is not part of the text as such (for example, notes about how it was digitised) should be excluded. The same applies to any substantial sections which are written in a different language, or which otherwise do not meet the criteria for inclusion.

Words

A database entry is created for a word when it is seen in the corpus, or if it is generated by the paradigm of a confirmed stem. The first of these processes will naturally result in many non-words being added to the database. These are not deleted (otherwise they would merely be re-added when next encountered), but can instead be marked as words to be ignored.

When a word is added, an attempt will be made to predict the corresponding stem by means of a stemming algorithm. This is done for convenience, to avoid having to type the stem in cases where the prediction was correct. In the main word list, the prediction will be displayed in the absence of a confirmed stem.

Stems

For the purposes of the lexicon, a stem is a word form which has undergone derivation but not inflection. (Exactly where the boundary is drawn between derivation and inflection is an issue which will be addressed in the policy manual for each language.)

When a stem has been fully analysed the database records how it decomposes into morphemes, which major lexical categories it belongs to, and what word forms it can produce. Because the analysis may be incomplete there is also a status indicator which can be one of:

  • proposed: an entry for the stem has been created in the database, but it is not necessarily in a usable state.
  • confirmed: the stem should be in a usable state, however it has not necessarily been fully checked.
  • checked:

Even in the checked state, you should not assume that every lexical category has been identified (although an attempt should have been made by that point to identify all that are in common use).

Stems are recorded in the database in order to account for words harvested from the corpus. For lightly derivational languages such as English it is expected that most valid stems will be explicitly recorded. For languages which use derivation more heavily such as tlhIngan Hol this may be neither possible nor desirable. For this reason it is only necessary to record a stem if it has actually be encountered in the corpus, or if there is a desire to use it as an example.

It is possible for a morpheme or paradigm definition to change after one or more stems that make use of it have been confirmed. Preventing this from happening would be undesirable, as it would (for example) require that the default paradigm be frozen at a very early stage of each language. Instead, any material changes are detected after the event by comparing the computed paradigm with the word forms recorded when the stem was confirmed.

It is sometimes possible for a given stem to have more than one paradigm (for example the English past tense ‘hung’ versus ‘handed’). This situation is handled by treating them as separate stems. To distinguish them, a suffix beginning with a colon may be added to the stem name.

Morphemes

To further analyse each stem it is broken into a sequence of morphemes. At present these are handled very simplisticly:

  • the stem is formed by concatenating the list of morphemes;
  • the inflectional paradigm is always determined by the last morpheme.

Paradigms

A paradigm consists of a set of rules which determine how a morpheme is inflected. As in a language definition file, paradigms may either be directly associated with a morpheme or may be named separately. To distinguish them, current practice is to prefix stand-alone paradigm names with an exclamation mark. Most paradigms have a parent from which rules are inherited.

Attributes

Attributes represent grammatical and lexical concepts such as tense, number, case and mood. Each type of inflection is associated with a list of attributes, as is each rule. A rule is triggered if its attribute list is a subset (proper or otherwise) of that associated with the inflection.

Dialects

Support for multiple dialects is not currently implemented, however the need to support them has been recognised.

Licence

The lexicon, and the language definition files derived from it, are Copyright © Graham Shaw. Redistribution and modification are permitted within the terms of the GNU General Public License (version 3 or any later version).

In order to guarantee the provenance and legal status of the lexicon, current policy is to avoid any substantial import of third-party lexical data. It is acceptable for third-party data to be used to validate the lexicon, provided that this is done only to identify errors (as opposed to providing corrections), and provided that the process does not rely so heavily on any one source that there is a risk of becoming substantially similar to that work.

Texts for import into the corpus are not required to be OSD-free, however they must be freely redistributable within the UK in the form presented.