Archive for the ‘Readings’ Category

Computer Terminology part 1: Strategy

Tuesday, April 27th, 2010

One of the main motivations for developing the translation system was to automatically generate localised user interfaces for computer software. This task is likely to make heavy use of computer-related terminology, making that topic an obvious one to address at an early stage. I’ve refrained from doing this previously because there were some basic questions about predicates which needed to be settled first, and it was easier to do that using well-ordered sets of concepts such as numbers and colours. Now that I’ve gained some experience I think it is feasible to attempt something more ambitious.

Identification of the required predicates has been largely ad hoc, and I am expecting substantial additions to be necessary in the future. Writing definitions has been time-consuming but mostly straightforward. I doubt they are up to dictionary standard, but they should be adequate to indicate what is intended.

More difficult has been deciding how to divide the resulting predicates between namespaces. I’ve explored two alternative approaches:

  • A subtype-supertype hierarchy, in which members of a given namespace share an is-a relationship with their parent.
  • A topic-based hierarchy, in which predicates with fundamental semantic differences can and should share the same namespace if they relate to the same subject area.

For example, the first method would place the user interface elements ‘window’, ‘icon’ and ‘menu’ into one namespace, and the user actions ‘click’ and ‘drag’ into another, because the former are user interface elements whereas the latter are actions. The second method would more likely place all of these in a single namespace because they all relate to the topic of computer user interfaces.

The first method is attractive because it is relatively objective, but I’ve found that it has three major drawbacks:

  1. The resulting namespaces are often very small (and therefore numerous).
  2. The namespaces can be difficult to name concisely.
  3. Homonyms, which namespaces were supposed to resolve, are quite likely to end up in the same namespace.

The third point is arguably the most serious one, because there is no point having namespaces unless they provide a way to specify which sense of a word is intended. For example, using the supertype-subtype approach, there isn’t a large difference between hanging out the washing and hanging a person: both are actions which involve suspending something from a rope or line. However the topics with which they are associated are entirely different: housekeeping versus criminal justice.

Another question that arose was how far to go in creating predicates to represent concepts that could be expressed in terms of other predicates. Here are some examples where I think a good case can be made one way or the other:

  • Although it is true to say that a ‘laser printer’ is a printer that contains a laser, there is much more to its meaning than that. A full description would be impracticable, on grounds of both verbosity and fragility. It therefore makes good sense for there to be a separate predicate corresponding to the concept of a laser printer.
  • The alternative would be to create a predicate that captures the full, specialised meaning of the word ‘laser’ as used in the term ‘laser printer’. However, such a predicate would be so specialised that it could, I suspect, only be used to modify the concept of a printer. If the predicate language was being developed for semantic analysis then this added orthogonality might be useful, but I don’t think a translation system has any need for it.
  • A ‘printer cable’ is merely a cable for a printer. If a separate predicate were created for this concept then, for consistency, it would also be necessary to add ‘keyboard cable’, ‘mouse cable’, ‘plotter cable’ and many others. The need to systematically replicate a complete class of predicates is a clear indication that orthogonality has been violated, and while I wouldn’t be above that if there were a clear practical benefit, I can’t see any justification for it in this instance.

The distinction here is between what would be an idiom and what is merely a collocation. Idioms are forbidden in BabelScript, because they violate the rule for combining predicates, so if a concept would be too complex to be described analytically then it needs to become a predicate in its own right.

However there is an exception to this rule. Even if a predicate is not strictly needed, it can be added anyway as an alias for the decomposed form. In this case the criterion is whether the convenience of having a single predicate to express a concept outweighs the clutter resulting from additional and unnecessary predicates. In the third example above I’ve said no, but this is very much a value judgment.

For many of the potential predicates I’ve considered the correct answer is unclear. For example:

  • Is a package manager merely a program for managing packages?
  • A compiler is (I would say), merely a program for compiling, but is it a sufficiently well-used concept to justify an alias?
  • Should constructors and destructors be expressed as functions for constructing and destroying?

The approach I’ve taken is to err on the side of caution and avoid creating predicates that are questionable. Adding new predicates is straightforward, especially if they are aliases, whereas removing them is more disruptive (because it breaks compatibility with texts that use them).

Countability

Tuesday, November 18th, 2008

I’m seeing a lot of inconsistency both between and within dictionaries as to whether nouns are countable or uncountable.

The basic problem is that most supposedly uncountable nouns can be used countably if you are willing to inflict sufficient violence upon the language in question. This inevitably results in differences of opinion as to what is acceptable and what is not.

For example, in answer to the question what atoms are needed to make a molecule of hydrogen peroxide? you might answer two hydrogens and two oxygens. Does that make hydrogen and oxygen countable? If you are willing to accept that response as a valid utterance then clearly in some sense they must be, but not to the same extent as words like dog or car which pluralise more readily.

My analysis of this usage is that each of these element names has at least two readings:

  • as an uncountable noun, meaning a quantity of that element, and acceptable in formal or informal contexts;
  • as a countable noun, as shorthand for an atom of the element, and acceptable only in very informal contexts.

The implied unit need not be an atom: if the reference were to two golds and a silver then that could very plausibly be counting in units of medals. This is really no different to the process that makes the word iron countable when referring to household appliances or golf clubs, except that you probably won’t find the informal readings in a dictionary.

[Update 2008-12-23: it would appear that Wiktionary, for one, lists Oxygen as countable (plural 'oxygens') with the reading "an atom of the element". Carbon, however, is listed as countable only in the sense of 'carbon paper' or 'carbon copy'.]

Is there a need for the language definition files to include the informal readings? I wouldn’t be against this in principle, provided that the readings were tagged appropriately, but see three objections in practice:

  1. The semantics are highly context-dependent. Counting atoms is one of the more likely possibilities, but in principle it could be almost anything made of, containing, or otherwise related to the word in question. This isn’t something that the current framework can handle well.
  2. Attempting to model usage of this nature would be a major distraction. The required information would be time-consuming to collect and difficult to objectively verify.
  3. None of the software currently envisaged has a need to generate or accept informal text of this nature (and writing software with that ability would probably be a tall order).

For these reasons I’m inclined to exclude marginal readings at present, but with one important concession: the inflectional paradigm should attempt to produce a reasonable plural, even if the grammatical rules say that there isn’t one. This will allow text containing the plural to be correctly parsed by relaxing the grammar. Plurals that are well established, such as golfing irons or pencil leads in English, are obviously acceptable in the language definition.

One point I am not certain about is whether countability needs to be indicated in the language definition file at all, or whether it can be inferred from the reading. Put another way, is countability a characteristic of language, or is it a real-world characteristic which languages can be expected to follow in a consistent manner? Not sure, but the safer option will be to place them in the language definition file for now (it is a lot easier to delete tags retrospectively than to add them). I’ll review this when I have more experience handling the issue in different languages.