Germanic Lexicon Project
The goal of this project is to create comprehensive electronic documentation of the lexicons of the early Germanic languages, particularly of the etymological relationships among the words in those languages.
A central principle is that the resulting data are to be free, meaning that they are under no legal encumberance from copyright other any other intellectual properly restriction. Copying, sharing, and modifying of the data is permitted and encouraged.
The short-term task of the project, which has been in progress since 1998, is to digitize copyright-expired dictionaries and glossaries on the early Germanic languages.
The three texts judged to be of greatest importance in terms of complete lexical and etymological coverage are:
- Fick/Falk/Torp: Wörterbuch der Indogermanischen Sprachen: Dritter Teil: Wortschatz der Germanischen Spracheinheit (Dictionary of the Indo-European Languages: Third Part: Vocabulary of the Germanic Language Unity) by August Fick with contributions by Hjalmar Falk, entirely revised by Alf Torp in 1909
- Bosworth/Toller: An Anglo-Saxon dictionary, based on the manuscript collections of the late Joseph Bosworth; edited and enlarged by T. Northcote Toller (main volume published in 1898; supplement published in 1921)
- Cleasby/Vigfusson: An Icelandic-English Dictionary, by Richard Cleasby and Gudbrand Vigfusson (1874)
Hand correction of Fick/Falk/Torp is largely completed. The current phase of the project is focusing on the correction of Bosworth/Toller and Cleasby/Vigfusson.
Consider the entire mass of information on the lexicons of the older Germanic language. This is the sum of every bit of lexical information which can be extracted from the corpora of these languages. It would at least include every piece of verifiably correct information in every published dictionary of these languages; and probably more, since there is no reason to think that every word token in the corpora of these languages has yet been properly documented.
This information can be thought of as a huge network consisting of lexical items and their properties, and the various kinds of connections among those items.
For example, consider the Old English word dæg. This word has various inherent synchronic properties: it is in the nominative singular; it is a masculine noun; it belongs to the a-stem class of nouns; it has a meaning which is glossed in modern English as "day".
The word form dæg has connections with other word forms in Old English. It is connected to dæges, dæge, dagas, daga, dagum because those are members of the same noun paradigm, differing only in case and/or number. It is related to deag because this is an attested alternate spelling for the same word. It is connected to words such as ge-win-dæg and dæg-weorc because dæg is one of the elements in these compound words. It is connected to the verb dægian "to dawn" because this verb derives from the noun by means of a derivational suffix.
This word form has connections to various Old English texts because it is attested in them. dæg is found in Menol. Fox 347, in Gen 1,5, in Bd. App. S. 771, 45., etc. To state this a different way, this abstract word form has a connection to the position in the text of each of its tokens.
This word has connections with words in other Germanic languages as well. For example, it is cognate with Old Icelandic dagr, Gothic dags, OHG tac, and others. Each of these cognate words has its own complex network of connections to other lexical items in its own respective language.
Traditional paper references (synchronic and etymological dictionaries, and concordances) are views of this network, but only partial ones. The network comprises so much information, with so many different types of connections, that it is impractical to explicitly represent the entire network in a paper medium in any sort of unified form.
The long-term goal of this project is to electronically represent this entire network as a single body, with as close an approximation of completeness as practical limitations will allow.
From this perspective, the digitization of existing paper dictionaries should not be seen as merely the conversion of an existing work to a more readily searchable or transportable form. It is simply the most efficient strategy for the initial stages of populating the electronic lexical network. The texts will be parsed, and the information converted into a suitable form and merged with data from other texts into a single entity.
In a shallow sense, one might view this process as involving e.g. the mere addition of a link from the dæg entry in Bosworth/Toller's dictionary of Old English to the dagr entry of Cleasby/Vigfusson's dictionary of Old Icelandic. But this is setting one's sights too low. The source of the individual pieces of information in this novel work will not be important except for bibliographic purposes, much in the same way that one often references earlier paper dictionaries when creating a new paper dictionary.
The merged information will generally not be recognizable as individual entries from particular earlier dictionaries (except perhaps in the case of specific wordings of glosses). It will be so heavily processed into an computer-readable format with explicit semantics, and so substantially augmented and validated, that it will not be a mere copy of the source dictionaries in any obvious sense.
Nor will the process end with the integration of the information from the earlier paper dictionaries. The corpora of some of the languages in question are fairly large, and there are very likely to be substantial omissions in many of the existing paper dictionaries, since it simply was not possible to consider the entire corpora of some of these languages when working in a paper medium. A check of the lexical network against the language corpora is likely to reveal both previously undocumented words and also word tokens for which no concordance reference previously existed.
Using the data
Different scholars will be able to use this resource for different purposes. My own (Sean Crist's) impetus for this project is to improve our understanding of the Germanic sound changes, and our understanding of Proto-Germanic, by running the sound changes (as we understand them) against this entire body of data.
For example, given an Old English word such as beorgan, we can automatically project this word upstream (backwards in time) thru the known sequence of Old English sound changes to produce a set of possible Proto-Germanic forms which would have produced this form. (This set of possible PGmc forms may contain more than one member since some of the sound changes involve mergers.) Similarly, we can automatically project the Gothic word baírgan upstream to Proto-Germanic.
Now we can ask: can OE beorgan and Goth. baírgan be cognate? What does this question mean? In this context, the answer is yes if there is at least one form which appears both in the upstream-from-OE set and in the upstream-from-Gothic set.
With this tool, we can loop thru the entire lexicon of each of the early Germanic languages. For every etymological claim which has been made, we can automatically ask whether this claim can be correct or not; we can mark the matching items as having been verified, and kick out problematic forms for consideration by a human.
Similarly, there are often cases where we do not know which of two sound changes came earlier in the relative chronology of sound changes. With the proposed network, a query can be structured to search thru thousands of words to identify words which are of a form where the ordering of the sound changes makes a crucially different prediction.
This discussion is simply a brief overview; it glosses over a number of complications (e.g. how to handle known cases of analogy; how to encode the "cognate" relation when it only applies between substrings of words as in cases of compounding and derivational morphology, etc.).