Finding Adjectives and Adverbs

November 7th, 2010

It has taken me several attempts to find an effective method for isolating adverbs, partly due to the many different grammatical contexts in which they can occur, and partly because (unlike nouns and verbs) they are uninflected in English. Inflection makes a difference because it determines how many word forms are needed to create a false positive. For uninflected categories only one form is needed, therefore false positives occur more readily than when a complete paradigm must be seen.

To overcome this problem I decided to look for adverbs and adjectives in pairs as a means to provide the cross-checking that was needed. Derivation of adverbs from adjectives is highly productive in English, and morphologically very regular, so there are few adverbs that are excluded in principle by this methodology.

(Some authors have argued that the relationship between adjectives and adverbs is so productive that it should be modelled as inflection rather than derivation. The traditional objection to this idea is that inflection should not change the lexical category of a word, however that begs the question because it assumes that adverbs and adjectives are separate categories. My difficulty lies with the semantics rather than the syntax: the usual meaning of an adverb derived from an adjective is ‘in an X manner’, but there are too many exceptions to comfortably get away without explicit entries in the lexicon.)

The morphological relationship is sufficient to act as an initial filter for the adverbs. For adjectives I looked for words preceded by a hedge such as ’slightly’ or ‘very’. The script can be found here and the results were as follows:

threshold matches modifers mistakes accuracy efficiency
1 935 913 22 97.6%
2 594 584 10 98.3% 3.6%
3 463 457 6 98.7% 3.1%
4 386 383 3 99.2% 4.1%
6 302 300 2 99.3% 1.2%

Most of the false positives (17 out of 22) were noun-adjective pairs that had slipped through the hedge filter. A significant fraction of these were due to the use of ‘very’ as an adjective, notably within the phrase ‘the very time’. Others formed part of expressions such as ‘time consuming’ and ‘time critical’.

The total number of pairs found is quite low, even without setting a threshold. This is largely attributable to the hedge filter, for two reasons. Firstly, by its nature it is specific to comparable adjectives and adverbs (or to be more precise, those that have been used comparably in the corpus — hence the appearance of unique in the output). Secondly, even for comparable adjectives, only about one instance in a hundred is accepted. Both of these characteristics are undesirable, but I don’t currently have a better solution.

In the interests of consistency I will be trying to stick to the policy of using a threshold of two. The resulting word list can be found here.

Finding regular verbs in English

October 19th, 2010

My method for finding regular verbs is the same as for irregular ones, except that there is no upper limit for occurrences of the past participle, and the results were screened to exclude irregular verbs. Verbs were excluded if any of the four regular forms matched a corresponding irregular form. The results were manually checked to see how the accuracy varied with the threshold:

threshold matches verbs mistakes accuracy efficiency
1 3761 3629 132 96.5%
2 2900 2851 49 98.3% 10.7%
3 2484 2457 27 98.9% 5.6%
4 2199 2182 17 99.2% 3.6%
6 1857 1844 13 99.3% 1.2%
8 1639 1627 12 99.3% 0.5%
12 1350 1339 11 99.2% 0.3%
16 1159 1150 9 99.2% 1.1%

The trend here is similar to that seen for nouns, and I intend to follow the same policy: use the unedited list resulting from a threshold of one two [corrected 2010-11-07] if feasible, but be prepared to revisit this decision if necessary. The verb list can be found here.

Finding irregular verbs in English

October 17th, 2010

This post was originally intended to be about finding regular verbs, but my initial attempts were finding too many irregular verbs that had been mis-conjugated. After a couple of attempts to filter them out I came to the conclusion that it would be better to actively search for them, review the list manually, then use the result as a blacklist when searching for regular verbs.

I said previously that I wanted to use fully automatic processes where possible, but that was in the context of large, open word classes. English irregular verbs are not an open class, and they are sufficiently few in number that manual checking would be feasible even if English were not my native language. Better therefore to take a more pragmatic, ad hoc approach in this instance.

The method used in both the regular and irregular cases was to search for potential infinitives, finite third-person singular forms, gerund participles and past participles:

  • Words preceded by ‘will’, ‘do’, ‘does’ or ‘did’ (with an optional ‘not’) were considered to be potential infinitives.
  • All words ending in ’s’ were considered to be potential third-person singular forms.
  • Words ending in ‘ing’ and preceded by ‘am’, ‘are’, ‘is’, ‘was’ or ‘were’ (with an optional ‘not’) or ‘be’ or ‘been’ (with no ‘not’) were considered to be potential gerund participles.
  • Words ending in ‘ed’ and preceded by ‘was’, ‘were’, ‘have’, ‘has’ or ‘had’ (with an optional ‘not’) or ‘be’ or ‘been’ (with no ‘not’) were considered to be potential past participles.

The next step was to regularly conjugate the infinitives as follows:

  • The third person singular was formed by the same method used previously for plural nouns.
  • The gerund participle was formed by adding ‘ing’ after removing any final ‘e’.
  • The past participle was formed by adding ‘ed’ after removing any final ‘e’ and changing final ‘y’ into ‘i’.
  • Where the infinitive ended in a consonant (not counting ‘y’) an additional paradigm was formed in which the consonant was doubled.

Regular verbs would be found by looking for instances in which all four forms (infinitive, third-person singular, gerund participle and past participle) are represented in the corpus. In this case it is irregular verbs that are wanted, therefore the search should instead be for instances where the generated past participle is absent.

I’m making some assumptions here: that if a verb is irregular then its past participle will be irregular, but its third-person singular and gerund participle will be regular. This is true in most cases, but with some notable exceptions that will need to be dealt with separately. Ones that I’m aware of include ‘to be’, ‘to have’ (which was found despite this property), the defective auxiliary verbs, and verbs ending in ‘o’ that take ‘es’ in the plural.

It should also be noted that this technique is designed to find verbs that are rarely or never conjugated in a regular manner. Thus ‘keep’ is detected because ‘keeped’ is wrong: it occurs only six times in the corpus, and not at all in a context that is detected by the script. The verb ‘hang’ is not detected because ‘hanged’ is correct in a judicial context and occurs many thousands of times in the corpus.

The script can be found here. With a threshold of 10 and a ratio of 5:1 it yielded 162 matches, containing 132 distinct irregular verbs. To assess its suitability I looked at how well it counteracted the false positives found when searching for regular verbs, and what coverage it gave of irregular verbs generally.

My initial attempt at searching for regular verbs gave 199 false positives. Of these 81 out of 199 were irregular verbs, and 69 out of 81 were on the list of 132. In this respect it is 85% effective.

With the notable exception of ‘to be’, coverage of the most common irregular verbs appears to be quite good. For example, out of 145 listed at www.englishirregularverbs.com it covers 121. However the list of 370 at www.englishpage.com shows that there are many more to be found.

No attempt was made to match up preterites or past participles automatically: these were added manually. Further verbs will be added as and when they are identified. The list can be found here; the initial version is at revision 36.

Finding English Countable Nouns

October 2nd, 2010

I said previously that I was having some success in isolating nouns from the Wikipedia corpus by means other than fully parsing the text. The method I am using is as follows:

  1. Search for words preceded by ‘a’, ‘an’, ‘this’ or ‘that’, and followed by a full stop or a comma. These are provisionally assumed to be singular nouns.
  2. Search for words preceded by ’some’, ‘these’, ‘those’, or a numeral other than ‘one’, and followed by a full stop or comma. These are provisionally assumed to be plural nouns.
  3. Search for singular-plural pairs amongst the words found above that are related by one of the regular pluralisation rules.

By ‘regular pluralisation rules’ I mean:

  1. replacing ‘y’ with ‘ies’ if it is preceded by a consonant; otherwise
  2. appending ‘es’ if the word ends in ’s’, ‘x’, ’sh’ or ‘ch’; otherwise
  3. appending ’s’.

Individually these rules are far from perfect, but together they give an impressively low false-positive rate. The false negative rate is much higher, but that is acceptable because there is no requirement for the bootstrap lexicon to be complete (otherwise there would be no need for the ‘bootstrap’ qualification). By assuming a set of pluralisation rules I am arguably begging the question, but if the rules are wrong in any way then this will come out in the wash

A copy of the processing script can be found here. I checked the results manually against the following criteria:

  • Archaic spellings (such as ‘churche’) were rejected, but words describing archaic concepts (such as ‘arquebus’) were kept.
  • Foreign borrowings were allowed, even if they could not be found in any dictionary, if they described cultural concepts that would lose meaning if translated to a native English word.
  • Abbreviations (such as ‘bio’) were accepted if they could be plausibly classified as words, but mere initialisms (such as ‘pdf’) were not.
  • Where a word has more than one accepted spelling (either within or between major dialects), any of those variants was accepted. However, spelling mistakes were rejected whatever their frequency of usage.
  • Fictional words specific to the works of a particular author were rejected, but those for which there was significant evidence of generic usage (such as ‘ringworld’ or ‘holodeck’) were accepted.
  • Words used as nouns in order to refer to the word (’ifs and buts’, ‘whys and wherefores’) were rejected on the grounds that in principle you could do that to any word. Listing every word in the language as a potential noun would destroy the value of the list, and I am reluctant to draw any distinction based on frequency of usage if the mechanism is as productive as it appears to be.

I do not claim to have done this absolutely consistently or objectively, so would caution against placing too much faith in the absolute values reported below, but in relative terms they should be comparable because I used the same reject list throughout.

Some of the noun pairs that I accepted merit particular comment:

  • deeps I had been poised to reject, because while I was comfortable with phrases such as ‘the deeps of the ocean’ I was not aware of any circumstances where you could actually count them. However it would appear that the deity Ēl dwells near the ’spring of the two deeps’, which appears countable if a little poetic. More persuasive was the abstract of a paper entitled Sediment Deposition in Offshore Deeps of the Western North Sea: Questions for Models which concerns “offshore bathymetric deeps in the North Sea” and in which “two deeps are considered”.
  • axises was facing rejection too. All of the sixteen instances in the corpus were wrong, because they referred to the geometric concept which should be pluralised as ‘axes’. However I have since discovered that the species of deer known as an ‘axis’ takes the regular plural instead, so this result was in fact valid even though it had been found for the wrong reasons.
  • wug is apparently the name of an imaginary creature, invented for the purpose of testing the ability of children to form plurals of words they had not previously been exposed to. The experiment is called the ‘Wug Test’, and the fact that this word was detected presumably means that my computer passed.

Out of a total of 10925 matching singular-plural pairs I found 191 false positives, corresponding to an accuracy of 98.25%. The most common types of false positive were:

  • Words that were being used as nouns, but which fail to meet the criteria listed above for some reason. For the most part there is little that can be done to exclude them automatically, however I was able to remove some of the obviously unpronounceable abbreviations by excluding words with no vowels (taking ‘y’ to be a vowel).
  • Valid nouns which appear superficially to be a singular/plural pair but are not. For example, ironworks is not the plural of ironwork and corpses is certainly not the plural of corps. Some of these may be detectable in principle, but I have not found any simple method for doing so.
  • A small number of verbs were misclassified as nouns. For example, ‘develop’ slipped through as a result of its use in the relative clause “that develop” and the phrase “a relationship between the two develops”. This is a weakness of the original filtering algorithm which can be addressed by removing ‘that’ from the list of singular indicators.

Excluding words with no vowels reduced the number of false positives by 8 with no new false negatives. Excluding ‘that’ as a singular indicator removed a further 17 false positives at the cost of 31 false negatives. Having done this the list was reduced to 10869 matches composed of 10703 valid nouns and 166 false positives. The accuracy rose to 98.47%, a small but worthwhile improvement on the previous rate.

The other modification I considered was to increase the number of occurrences that must be seen before a noun is recognised. I felt there was a good case for requiring at least two occurrences, for the following reason. The number of possible spelling errors is vastly greater than the number of valid words. Most of the mistakes are individually very rare, but in a large corpus they are collectively very numerous. If every observed word is let through this means that the quality of the result tends towards zero as the corpus grows. Setting an appropriate threshold counterbalances this tendency and prevents the result from becoming dominated by noise.

I ran a number of trials with different thresholds to determine which was best. The results were as follows:

threshold matches nouns mistakes accuracy efficiency
1 10869 10703 166 98.47%
2 7341 7302 39 99.47% 3.73%
3 6002 5982 20 99.67% 1.42%
4 5195 5182 13 99.75% 0.87%
6 4246 4243 3 99.93% 1.05%
8 3657 3654 3 99.92% 0
12 2972 2969 3 99.90% 0
16 2574 2571 3 99.88% 0

The column labelled ‘efficiency’ is the number of false positives removed divided by the number of false negatives created, as compared to the previous row. For comparison, the removal of ‘that’ as a singular indicator had an efficiency of 54.8%. Raising the threshold is clearly much less efficient and it suffers from diminishing returns, but considering the difficulty of removing errors manually (even in my native language) I am inclined to consider it worthwhile.

The final three false positives that were not eliminated in the trials were ‘corps’ / ‘corpses’, ‘live’ / ‘lives’ and ‘new’ / ‘news’. Although they could certainly be removed by raising the threshold further, the level needed (230) would be absurdly high. I have not found any other method for removing them automatically, which leaves manual checking as the only option if a completely clean list of nouns is wanted.

In this case I’m not convinced that a completely clean list is essential, and I would much prefer to use unsupervised methods if possible. This may sound wasteful having already done the checking, but makes sense when the exercise is viewed as a dry run for other languages. I’ve therefore decided to try using the list that is produced using a threshold of two with no checking. The result can be found here. If this doesn’t work well then I will consider other options.

I could probably use the method above to search for groups of irregular nouns too. By this I don’t mean nouns that are one of a kind, but it would be possible (for example) to search for ones that follow a particular latin or greek declension. However I see little need to do this during the bootstrap phase, and will instead be looking at irregular nouns after parsing. Uncountable nouns are a more significant issue and a simple method for identifying them would be desirable.

Parsing Wikitext

September 22nd, 2010

Having extracted the wikitext for every page in the English Wikipedia I now need to strip away the markup and reduce the content to plain text. This task is made more difficult by the lack of any formal specification for the syntax or behaviour of the markup, and those descriptions which do exist omit some important details. However I believe my understanding is now good enough to give acceptable results, and there are indications that this will ultimately prove to be a more robust and effective method of text harvesting than the HTML page-scraping I was using previously.

An introduction to the markup syntax can be found here. It covers most of the markup used in Wikipedia, with the notable exception of the ref and references tags provided by the Cite module. However it says nothing about how the different types of markup interact with each other if they overlap, or what should happen if the markup is malformed in some way. I therefore ran some tests to determine how MediaWiki behaves under those circumstances.

(An alternative approach would have been to use the existing parser code from MediaWiki as a starting point. This has some potential to give better results, however there are two issues which discouraged me from taking this path. One was the (understandable) degree to which MediaWiki is geared towards producing HTML. I wasn’t confident that the earlier stages could be adapted to the production of plaintext without affecting the behaviour of later stages. The other was that Wikipedia uses a post-processor to clean up its output, but this can only happen if the output consists of HTML. Both problems could be solved by producing HTML then reducing it to plaintext, but then there would be little point in me writing my own parser.)

The most notable finding concerned the precedence of HTML-style tags. This is normally very low, but for tags provided by extension modules (such as ref or math) the precedence is extremely high — higher even than template expansion. This suggests that the extension modules are implemented as preprocessing stages and are not closely integrated into the main parsing algorithm.

Using the information gained from the tests I wrote a parser with four phases:

  • Handle extension tags (for example, the refs and math).
  • Expand (or rather at present, ignore) templates and remove comments.
  • Handle layout-dependent markup (for example, lists indicated by asterisks and/or hash signs).
  • Handle layout-independent markup (for example, links indicated by square brackets).

Some of the content is simply discarded, including the content of lists and tables, and any text that is indented or preformatted. These are regions that I expect to have a below-average proportion of usable text, and since text is not in short supply then discarding them is the most expedient behaviour.

Paragraph breaks are identified and preserved (represented as linefeeds), but no attempt is made to break the text into sentences. The reason for this is that paragraph breaks can be derived from the markup whereas sentence breaks cannot. (The latter process is non-trivial and requires a detailed knowledge of the language in question.) Line breaks are not preserved because they are of no significance once the wikitext has been parsed out.

The parser can be downloaded from here. The main known limitation is the lack of support for templates: these are recognised but discarded. There is at least one memory leak (possibly several) which I have not been able to track down, but there is reason to suspect that this is an issue with Perl (and specifically named capture buffers) as opposed to the code presented here. It can be worked around by processing articles in batches of at most a few thousand.

The parsed text occupies about 17GB spread across roughly 3.4 million files. According to wc it contains a little over one billion words. It appears to be of good enough qualify for constructing the bootstrap lexicon, but further work may be needed before the text is suitable for more detailed analysis.

Extracting Wikitext

September 20th, 2010

Loading articles into a local instance of MediaWiki allowed me to begin harvesting text quickly, but has proved to have shortcomings. Without all templates loaded it produces significant amounts of spurious text, but with all templates it runs very slowly. I therefore decided to revisit the option of parsing the wikitext directly into plaintext. This is not trivial, for the reasons I described previously, but it is a significantly more attractive method now that I know the limitations of the alternative.

Extracting the wikitext from the XML file is straightforward. Each article corresponds to a page element, which contains title and text elements containing the article name and content respectively. I’ve written a script extract-mediawiki-pages.pl to search for these elements and write the (wikitext) content of each page to a file.

The script does not filter out redirects. The reason for this is that if I want to process templates (a decision I’ve not made yet) then I need the ability to fetch templates by name — including templates that have been redirected.

Storage of the resulting files is in a hierarchy intended to limit the number of files per directory. This is not as essential as it once was using Linux: during testing I was able to create ten million files in one directory on an ext3 volume without any intolerable loss of performance. However, that doesn’t mean that every component of a GNU/Linux system (or even the kernel) can cope gracefully with large directories, and there are enough that can’t (such as NFS) to make them an inconvenience.

The titles need to be encoded before being used as filenames because some of them contain forbidden characters. In theory the only characters that need this treatment are null and forward slash. In practice there are many others that cause problems: hyphens at the start of filenames are perhaps the worst (because they can cause the filename to be mistaken for a command-line argument), but even spaces are a significant nuisance.

Ideally I would have identified the problematic characters and escaped them, leaving the remainder as they were. However in the interests of simplicity I instead decided to hash the filenames using MD5 and convert the result to ASCII hex. This makes the filenames opaque, but guarantees that there are no problem characters and has the added benefit of limiting the filename length.

The resulting wikitext occupied a total of about 57MB spread across approximately ten million files. However the large amount of wiki markup that the text still contains makes it unsuitable for natural language processing in this form. The next step is therefore to parse and remove that markup.

The Need for a Bootstrap Lexicon

August 7th, 2010

The method I ultimately intend to use for compiling a lexicon is by tasking a parser to search for words that need to be of a particular lexical category for the surrounding text to be grammatical. The catch-22 is that this parser will only be able to do its work when the lexical categories for most of the words in a sentence are already known. In other words, in order to compile a lexicon by this method I need to already have a lexicon.

The parser can be run a number of times if necessary, so it need not catch every word at the first attempt. The only requirement is that the parser add enough new words to the lexicon each time it is run to make subsequent runs more successful. However, my expectation is that this will only happen when the lexicon is above a certain critical size: any smaller and the number of sentences successfully parsed will be too small to sustain growth. It follows that an initial lexicon, generated by other means, will be needed to bootstrap the process.

To generate the bootstrap lexicon I’ve chosen to look for methods with a very low false-positive rate, even if this results in a relatively poor yield. There are two main reasons for this:

  • I can remove false positives from the English lexicon by manually reviewing it, but will not have this luxury for languages with which I am less familiar. In the absence of an army of helpers, I need to develop methods that can deliver useful results without supervision.
  • The corpus I am working with is sufficiently large that yield is not a major concern during the bootstrap phase.

I don’t intend to search for closed-class words automatically, in English or in any other language: these can be entered manually. I will be looking for automatic methods for isolating nouns, verbs, adjectives and adverbs. Of these, I have so far had most success searching for nouns with regular plurals. My next post will describe the method used and the outcome.

Harvesting text from Wikipedia

August 2nd, 2010

Having imported about a quarter of Wikipedia onto a local server, I now need a way to harvest the text from those articles in a form that is suitable for analysis. There are three stages to this: fetching the HTML, parsing the HTML, then isolating the useful parts of the text.

I did briefly consider using the MediaWiki API to obtain the text as XML rather than HTML. This had some attractions, in that the markup was more orientated to semantics rather than formatting, but XML is less forgiving than HTML and there were enough irregularities to make parsing difficult: in some cases the data was not even well-formed. HTML parsers are usually quite good at handling marginal input, so this appeared to be the more promising approach.

In order to fetch pages using HTTP I need the page titles, preferably in the same order as they appear in the XML dump file. There are two reasons why this order is helpful: so that I can start work without waiting for the whole of Wikipedia to import, and (in the interests of repeatability) to make it easy to describe which subset of articles I used on a particular occasion. I’ve written a script to do this called extract-wikipedia-titles. After extracting the titles it filters out redirects, and also pages in the namespaces Template, File, Category, Wikipedia or MediaWiki (as these are less likely to contain text suitable for analysis).

The Perl script that performs the harvesting is called harvest-wikipedia-text. It uses LWP::UserAgent to fetch the pages, and HTML::TreeBuilder to parse them. This is not the most efficient approach, as there is no real need to build an explicit parse tree, however it was an easy way to avoid having to worry about implied or missing tags.

Once the tree has been built, the script performs a depth-first traversal. Harvesting takes place only within a <div> element with an id of bodyContent, this being where the body of the article resides. Some types of element are not entered during the traversal, including <script>, <li> and <td>, on the grounds that they either won’t contain or are less likely to contain usable text. (I don’t doubt that many lists and tables do contain well-formed sentences, but with tens of gigabytes to work with I can afford to be picky.)

I didn’t want to break the text into sentences at this stage, because while this can be done up to a point using simple heuristics, better results can be obtained by a parser that understands the grammar of the language. However there are formatting features in the text that are very likely to be sentence boundaries, and I didn’t want to lose this information. I’ve therefore chosen to produce output consisting of one line per paragraph. Each line may contain several sentences, but the assumption is that sentences don’t span multiple lines.

Text in the HTML parse tree is liable to be split into fragments because of intervening structures such as elements and comments. Whenever my script sees a text fragment, it appends it to the end of the current paragraph. There is a function break_paragraph, called whenever a <p> element begins, which outputs then clears the current paragraph.

Wikipedia makes heavy use of templates, often nested several levels deep, and if any of these are missing then the text does not render correctly. I had therefore thought it would be best to import all templates before harvesting. To expedite this I wrote a script to extract the required subset of pages from the XML dump file, allowing them to be imported separately from the bulk of the text. The result was disappointing: page rendering times were greatly increased, and much of the extra material generated was not helpful. I therefore intend to work without a full set of templates in the first instance, but this is an area that would benefit from further investigation.

In order to compensate for the lack of proper template processing, my script recognises and ignores the resulting broken link elements that appear in the HTML. It also stops processing a page when it sees a section called “References”, “See also” or “Links” (as there is usually no more usable material after that point).

I’ve performed an initial export of 100,000 articles, yielding 635 megabytes of text containing (according to wc) 107 million words. I will certainly want to add to this in time, but for the next phase of work it should be ample. This will be to construct a small- to medium-sized lexicon using very simple methods, which can then be used to bootstrap a more accurate parser-based analysis of the full corpus.

Importing Wikipedia

July 27th, 2010

Wikipedia generally encourage reuse of their material, but discourage spidering because of the load it places on their servers and bandwidth. Instead they provide copies of the underlying database in an XML-based format. This is many times smaller than the rendered HTML, and can be served statically, so it is a much more efficient method of transfer from both their point of view and mine. It doesn’t include images, but for the analysis I have in mind the text is all I need.

I tried processing this XML file directly, but while it is easy enough to strip away the outer layer of XML, what that leaves is unparsed wikitext. Unfortunately there are many variants of wikitext, and (currently) no authoritative specification for the particular dialect used by Wikipedia (beyond the MediaWiki source code). Writing a parser would be easy enough, but writing one that resolved ambiguities in exactly the same way as MediaWiki would be more difficult. From discussion of this topic elsewhere, my understanding is that the content of Wikipedia is quite sensitive to these subtleties.

The obvious solution to this problem is to use MediaWiki itself to perform the parsing. The simplest way to do that (without spidering) is to create a local copy of Wikipedia, then download from there. Building a local copy is fairly straightforward, but there are a few wrinkles that are worth recording.

MediaWiki is available as a package for both Debian and Ubuntu (my favoured distributions), but the specific versions provided by Lenny and Lucid rendered many pages incorrectly: the content following a reference was often rendered as preformatted text, due to an interaction between the different layers of syntax. I therefore ran up a copy of Squeeze (which is still in testing but quite usable for this purpose) and that appears to work much better. (I could have downloaded and installed the upstream version instead, but I prefer to use packages where feasible.)

There are several ways to import the text. I chose to use importDump.php, which is definitely not the quickest method in terms of elapsed time, but it appeared to be the least risky and required the smallest amount of effort from me. The only problem I encountered is that script chokes on <redirect /> elements. Fortunately they are easy to remove due to the particular way in which the XML is formatted:

egrep -v "^\s*<redirect />\s*$"

What effect this will have on the behaviour of the wiki is unclear (I presume these elements were there for a reason), but redirects to normal articles I can definitely live without: one copy is sufficient. Redirects to templates may be more of an issue.

In order to correctly render Wikipedia there are some MediaWiki extensions which must be enabled. The most important of these are Cite and ParserFunctions (Cite being the extension that handles references, and which had been erroneously generating preformatted text). They are located in a separate package, mediawiki-extensions-base, and need to be enabled using the mwenext command. I’ve also enabled Poem, InputBox and CategoryTree from the same package. Wikipedia itself uses several dozen more (full list here), but I don’t think any of them significantly affect how the text is rendered.

I chose to use the current article text only (enwiki-20100622-pages-articles.xml.bz2). The page histories would have been likely to add a lot of repetition and noise, but little that was genuinely useful. A better case could be made for including the discussion pages, but these are by their nature quite informally written (with little or no effort made to correct mistakes), and I thought on balance that they were more likely to hinder than help.

The XML file is approximately 6GB compressed and 27GB uncompressed. I’ve allowed 128GB for the resulting MySQL database, and current indications are that this should be sufficient. The machine used to perform the import had 1GB of RAM and this proved to be grossly inadequate, but I was able to work around it by stopping then restarting the import script several times. It quickly skips over any pages that are already present, using little or no memory in the process. I wouldn’t want to recommend this technique in general, because I don’t know what it does to the database, but the result appears to be good enough for the intended use of text harvesting.

The import has already taken two weeks, and it is likely to be several more weeks before it finishes. That’s OK, because the text it has imported is more than enough to start working on. More soon about how to extract and process the text.

Redesigning the Lexicon

June 30th, 2010

The lexicon on the Project Babel website was originally set up as a means for generating language description files. I’ve since concluded that making it the authoritative source for the files isn’t practicable: they need to live in the Subversion repository like everything else. Nor is it particularly well suited for managing readings: my experience is that these are better handled systematically, one topic at a time.

Where I think the lexicon can add value is by providing a dataset against which morphological rules can be evaluated and regression-tested, but some changes are needed if it is to perform this function effectively.

First and foremost the process of populating it needs to become much, much faster. That means automatically and accurately guessing much of the required content, so that my rôle is largely reduced to approving or correcting those guesses. The user interface needs to be designed for doing this in bulk: minimising the number of clicks and page reloads by placing many entries on one page.

Secondly, better coverage of each language is needed. The texts I was using previously (from Project Gutenberg) don’t do this effectively enough. The problem isn’t that the texts are unrepresentative: on the contrary, they are too closely representative of normal writing, with not enough coverage of rare words.

For example, of the chemical elements I found hydrogen, oxygen, aluminium (and aluminum), sulphur (but not sulfur), argon, iron, nickel, copper, silver, tin, gold, mercury and lead. (Some of these have multiple meanings, but that doesn’t matter provided the words in question enter the database). Including some texts about chemistry would improve matters, but even then it would take a huge amount of raw material to complete the list.

Importing a copy of the periodic table would clearly solve this particular problem, but I’m not convinced that including tabular data in the corpus is a good idea: there would be little or no opportunity to deduce parts of speech automatically, and I’m concerned that it would add significantly to the number of non-words in the database. In any event, it is not a general solution because most of the words that need to be covered won’t appear in neat tables.

A type of prose that could provide much better coverage is an encyclopedia,
and the obvious candidate is Wikipedia. There was a good case for using this anyway because of its sheer size, and also the number of languages in which an edition is published. I did have some concerns that it would be statistically unrepresentative, but it is now clear that a representative sample is not what is needed.

(Of course I could obtain much of the information needed from Wiktionary, but my preference is to keep this and other dictionaries in reserve as a means for checking data that I have generated independently. This, I hope, will provide more opportunity for detecting errors.)

The third change is to make the language description files an input to the lexicon rather than an output. There then needs to be a means to compare generated word forms with ones that have been reviewed and approved.