Archive for January, 2009

Currying

Saturday, January 24th, 2009

Predicates are single-argument functions. They can act on another predicate and change its behaviour, but they don’t have any direct ability to act on two or more predicates.

Sometimes this ability is needed. For example, suppose you wish to wish to express concepts such as ‘taller than’, ’shorter than’, ‘quicker than’, and ‘cleverer than’. Clearly there is a common theme to these concepts which allows them to be derived from ‘tall’, ’short’, ‘quick’ and ‘clever’, so it would not be appropriate to implement new atomic predicates for each.

A more generic solution to provide a predicate with the meaning ‘more x than y’. This apparently takes two arguments: the type of comparison to be made, and the entity with which to compare. However, since predicates take one argument only, it cannot be implemented that way.

The restriction on the number of arguments can be circumvented by breaking the problem down into stages:

  • the ‘more than’ predicate first acts on x to produce an intermediate form which the meaning ‘more x than’;
  • this intermediate form then acts on y to give a final form with the meaning ‘more x than y’.

This technique is called ‘currying‘ (after Haskell Curry, although it had previously been described by Moses Schönfinkel and Gottlob Frege).

In principle there is no limit to the number of arguments which can be handled in this way, or the order in which they are processed. In practice these factors can have a large impact on readability. The number of arguments obviously should be kept as small as possible, because otherwise it will be difficult to remember what they should be. The order of the arguments should be chosen, so far as possible, to make any intermediate forms meaningful in their own right. This makes the meaning of the predicate easier to describe and remember, and avoids creating purely artificial constructs which have no linguistic counterpart.

Numbers

Monday, January 19th, 2009

One of the most important classes of predicate that will be needed is one for representing numbers. The main complication is that there are an infinite number of values to be represented, therefore they cannot simply be enumerated. Two possible solutions are:

  • to provide an atomic predicate for each number, named according to some algorithm.
  • to provide a small set of atomic predicates, which can be combined to represent any value.

The first option effectively creates new syntax, which I’m reluctant to do without very good reason, so my initial preference was for the second method. Some of the more obvious variants of this include:

  • a base-10 additive system (for example, forty two);
  • a base-10 positional system (for example, four two);
  • a base-2 positional system (for example, one zero one zero one zero).

(I’m only considering whole numbers in this post, but any scheme will need to be extensible to negative, rational, and maybe fixed-point numbers too.)

One issue with these formats is that they are potentially very verbose, even without a namespace prefix (which they probably ought to have, but which could perhaps be dispensed with in the name of usability).

A more serious objection is that they are difficult to manipulate. This is most obvious in the case of a binary system, because most languages would then require conversion to base ten (which in turn requires implementation of arithmetic operations such as division and remainder). Starting from base ten would greatly reduce the need for conversions, and make some (such as base twenty) more straightforward, but handling the general case would remain just as difficult. This is not an entirely theoretical concern: languages are known which use base 6 (Ndom) and base 15 (Huli).

For this reason I’m inclined to think that numbers do need to be treated as a special case, with special rules for handling them. That implies that new syntax will be needed regardless of the representation used, in which case the representation may as well be as simple and readable as possible. One of the simplest (and certainly the most familiar notation for most users) would be to use base ten and the digits 0 to 9 (for example, 42).

This, then, is an entirely new type of predicate. They do not need to be explicitly pre-defined, which is just as well because there are an infinite number of possible ones. Their meanings can be determined algorithmically from their names. They can be distinguished from symbolic predicates by the fact that they begin with a digit. Rules will exist specifically for handling them, and those rules will be able to process them as numbers rather than as abstract symbols.

The next task will be to determine exactly what types of rule are needed, and to do that I intend to look at a variety of natural languages and how they handle numbers - with particular emphasis on the more unusual ones.

UTF-8 support added

Saturday, January 3rd, 2009

I wrote in November about the limitations of the standard regex engine when processing UTF-8 encoded text. These have now become sufficiently annoying to make an upgrade worthwhile. The regex engine I’ve decided to use is the one provided by the PCRE library.

The issue that brought this to a head was the realisation that, while adding accented characters to a set was relatively straightforward, subtracting them was not (other than by inverting the set and listing all allowed characters explicitly, which can be rather cumbersome).

Currently it is possible to switch between the two regex engines at compile time by setting the makefile variable USE_PCRE to true (1) or false (0). This is only a temporary measure: as the language definitions come to depend on UTF-8 support the older engine will soon become unusable, at which point it will be removed to avoid clutter.

An important topic that I have not addressed yet is now Unicode input should be normalised. (I don’t think that there is any doubt that it should be normalised, but there is more than one algorithm that could be used and it isn’t yet clear to me which is preferable.) There may also be a need to support character classes (both the standard ones, and perhaps also use-defined character classes to handle language-dependent groupings such as ‘vowels’ or ‘consonants’). Two-level morphology remains an option for the future, but definitely not the near future.