The Need for a Bootstrap Lexicon
Saturday, August 7th, 2010The method I ultimately intend to use for compiling a lexicon is by tasking a parser to search for words that need to be of a particular lexical category for the surrounding text to be grammatical. The catch-22 is that this parser will only be able to do its work when the lexical categories for most of the words in a sentence are already known. In other words, in order to compile a lexicon by this method I need to already have a lexicon.
The parser can be run a number of times if necessary, so it need not catch every word at the first attempt. The only requirement is that the parser add enough new words to the lexicon each time it is run to make subsequent runs more successful. However, my expectation is that this will only happen when the lexicon is above a certain critical size: any smaller and the number of sentences successfully parsed will be too small to sustain growth. It follows that an initial lexicon, generated by other means, will be needed to bootstrap the process.
To generate the bootstrap lexicon I’ve chosen to look for methods with a very low false-positive rate, even if this results in a relatively poor yield. There are two main reasons for this:
- I can remove false positives from the English lexicon by manually reviewing it, but will not have this luxury for languages with which I am less familiar. In the absence of an army of helpers, I need to develop methods that can deliver useful results without supervision.
- The corpus I am working with is sufficiently large that yield is not a major concern during the bootstrap phase.
I don’t intend to search for closed-class words automatically, in English or in any other language: these can be entered manually. I will be looking for automatic methods for isolating nouns, verbs, adjectives and adverbs. Of these, I have so far had most success searching for nouns with regular plurals. My next post will describe the method used and the outcome.