Concepts in the Lexicon: Introduction

John F. Sowa

The lexicon is the bridge between a language and the knowledge expressed in that language. Every language has a different vocabulary, but every language provides the grammatical mechanisms for combining its stock of words to express an open-ended range of concepts. Different languages, however, differ in the grammar, the words, and the concepts they express. The differences arise from three kinds of variation:

Accidental. The most obvious differences result from arbitrary choices of sounds, such as hand in English and mano in Italian. Other variations depend on arbitrary choices of where to draw boundaries. In English, hand refers to the part of the body from the fingertips to the wrist. But in Russian, the corresponding word ruka extends all the way to the elbow.
Systematic. The grammar of a language determines how the conceptual structures are linearized as strings of words in a sentence. English and Chinese, for example, put the subject first, the verb in the middle, and the object at the end for an SVO word order. Irish and Biblical Hebrew are VSO languages that put the verb first. Latin and Japanese are SOV languages that put the verb at the end. The grammar also determines how the units of meaning, called morphemes, are combined to form words. Chinese is an extreme example of an analytic language in which almost all the morphemes can be used as stand-alone words. German is an agglutinative language, which forms compound words like Lebensversicherungsgesellschaftsangestellter (life insurance company employee). Old English was an agglutinative language like German, but as it evolved into modern English, it became almost as analytic as Chinese.
Cultural. The concepts expressed by a language are determined by the environment, activities, and culture of the people who speak the language. Since French, Chinese, and Indian cuisines are based on very different ingredients, methods of preparation, and cooking utensils, the people who cook and eat each kind of food use words for it that have no counterparts in the other cultures. The specialized concepts, however, can be transferred with the culture whenever a cook opens a new restaurant in a foreign land. Cultural and conceptual shifts occur across time as well as space. A book on science or business, for example, is easier to translate from modern English to modern Japanese than from modern English to the language of Shakespeare.

Grammars and words belong to the province of linguistics, but the concepts they express belong to the extra-linguistic knowledge about the world. For each language, the lexicon must provide the links that enable a language processor to carry messages from one province to the other.

Besides accommodating the idiosyncracies of each language, the lexicon must support all the possible uses of language. Each use has a different purpose, which requires a different kind of information. A simple spelling checker, for example, can catch many errors with nothing but a list of words. To distinguish there from their, however, it must contain syntactic information. To distinguish sight from site, it must also contain semantics. And to distinguish infer from imply, it must contain enough information to enable a language processor to recognize the context, the topic, and the logical inferences necessary to determine what was being inferred or implied.

The demands on the lexicon also vary with the type of application: speech transcription, information retrieval, information extraction, text summarization, message classification, question answering, machine translation, and discourse understanding. Each application can also be processed at levels of detail ranging from a rough approximation triggered by keywords to a deep understanding that applies all the resources of syntax, semantics, and pragmatics. As a bridge, the lexicon is partly language dependent, partly language independent, and partly domain and application dependent. It need not contain all information about the language and domain, but it must contain the hooks that link the language-dependent words to the language-dependent grammar and to the language-independent, but domain-dependent conceptual structures.

This document is a revised, reorganized, and updated compilation of material extracted from several papers by John Sowa. The major contributions are taken from three papers (Sowa 1988, 1992a, 1993). Additional material has been excerpted from several other papers (Sowa 1991, 1998, 1999, Sowa & Way 1986), and the terminology and notation have been revised to conform to the book Knowledge Representation (Sowa 2000). The result is organized in three parts:

Problems and Issues. Part I is a survey of linguistic examples that impose requirements on the kinds of knowledge that must be represented in the lexicon. It emphasizes the problems and their implications rather than the details of any particular theory or notation.
Representations. Part II addresses the structure of the lexicon and its links to syntax, semantics, and world knowledge. It uses logic as a theory-neutral representation and shows how other representations, both theoretical and computational, can be translated to logic in either the predicate calculus or conceptual graph notations.
Language Processing. Part III shows how the lexicon is used in language parsing, information extraction, semantic interpretation, discourse analysis, and ambiguity resolution. It shows how the problems and issues raised in Part I can be addressed by using the lexical representations introduced in Part II.

The combined bibliography is located in the reference section. Clicking on any citation represented in blue transfers the browser to the corresponding reference; clicking on the back button of the browser returns to the previous text.

Send comments to John F. Sowa.

Last Modified: