Documents : Lexicon User Guide
Lexicon User Guide
Introduction
This is the user guide for the Project Babel Lexicon, a multi-lingual
database of paradigms, morphemes and readings from which language definition
files are generated.
Access
The lexicon is publically readable but not writeable.
The Corpus
The main source of words used to populate the lexicon is a corpus of
texts. To qualify for inclusion, a text must be:
- freely redistributable in the UK;
- presented as a plain text file in a supported encoding;
- of high quality (to avoid adding an unmanageably large number
of non-words to the lexicon);
- largely written in a dialect that is within the bounds of
what the corresponding language definition is intended to cover.
(The last of these conditions will tend to exclude texts beyond a
certain age due to language change.)
The corpus also provides an indication of how common each word is, by
recording how many times it has been seen. It would be unwise to read too
much into this count, as the corpus is not a balanced sample (and in many
respects it is very unbalanced); however the counts do serve a number of
useful purposes, including prioritisation (handling the most common words
first) and checking that paradigms have not missed any common word forms.
It is possible to limit processing of a text to a particular set of
line ranges. Anything which is not part of the text as such (for example,
notes about how it was digitised) should be excluded. The same applies to
any substantial sections which are written in a different language, or
which otherwise do not meet the criteria for inclusion.
Words
A database entry is created for a word when it is seen in the corpus,
or if it is generated by the paradigm of a confirmed stem. The first of
these processes will naturally result in many non-words being added to
the database. These are not deleted (otherwise they would merely be
re-added when next encountered), but can instead be marked as words to
be ignored.
When a word is added, an attempt will be made to predict the
corresponding stem by means of a stemming algorithm. This is done for
convenience, to avoid having to type the stem in cases where the
prediction was correct. In the main word list, the prediction will be
displayed in the absence of a confirmed stem.
Stems
For the purposes of the lexicon, a stem is a word form which has
undergone derivation but not inflection. (Exactly where the boundary
is drawn between derivation and inflection is an issue which will be
addressed in the policy manual for each language.)
When a stem has been fully analysed the database records how it
decomposes into morphemes, which major lexical categories it belongs to,
and what word forms it can produce. Because the analysis may be incomplete
there is also a status indicator which can be one of:
- proposed: an entry for the stem has been created in the database,
but it is not necessarily in a usable state.
- confirmed: the stem should be in a usable state, however it has
not necessarily been fully checked.
- checked:
Even in the checked state, you should not assume that every lexical
category has been identified (although an attempt should have been
made by that point to identify all that are in common use).
Stems are recorded in the database in order to account for words
harvested from the corpus. For lightly derivational languages such as
English it is expected that most valid stems will be explicitly
recorded. For languages which use derivation more heavily such as
tlhIngan Hol
this may be neither possible nor desirable. For this reason it is
only necessary to record a stem if it has actually be encountered
in the corpus, or if there is a desire to use it as an example.
It is possible for a morpheme or paradigm definition to change after
one or more stems that make use of it have been confirmed. Preventing this
from happening would be undesirable, as it would (for example) require
that the default paradigm be frozen at a very early stage of each language.
Instead, any material changes are detected after the event by comparing
the computed paradigm with the word forms recorded when the stem was
confirmed.
It is sometimes possible for a given stem to have more than one
paradigm (for example the English past tense ‘hung’ versus
‘handed’). This situation is handled by treating them as
separate stems. To distinguish them, a suffix beginning with a colon
may be added to the stem name.
Morphemes
To further analyse each stem it is broken into a sequence of morphemes.
At present these are handled very simplisticly:
- the stem is formed by concatenating the list of morphemes;
- the inflectional paradigm is always determined by the last morpheme.
Paradigms
A paradigm consists of a set of rules which determine how a morpheme
is inflected. As in a language definition file, paradigms may either be
directly associated with a morpheme or may be named separately. To
distinguish them, current practice is to prefix stand-alone paradigm
names with an exclamation mark. Most paradigms have a parent from which
rules are inherited.
Attributes
Attributes represent grammatical and lexical concepts such as tense,
number, case and mood. Each type of inflection is associated with a list
of attributes, as is each rule. A rule is triggered if its attribute
list is a subset (proper or otherwise) of that associated with the
inflection.
Dialects
Support for multiple dialects is not currently implemented, however
the need to support them has been recognised.
Licence
The lexicon, and the language definition files derived from it, are
Copyright © Graham Shaw.
Redistribution and modification are permitted within the terms of the
GNU General Public
License (version 3 or any later version).
In order to guarantee the provenance and legal status of the lexicon,
current policy is to avoid any substantial import of third-party lexical
data. It is acceptable for third-party data to be used to validate the
lexicon, provided that this is done only to identify errors (as opposed to
providing corrections), and provided that the process does not rely so
heavily on any one source that there is a risk of becoming substantially
similar to that work.
Texts for import into the corpus are not required to be
OSD-free,
however they must be freely redistributable within the UK in the form
presented.