1. What is a lemma?
  2. Lemmatisation
  3. Why is the lemma important for computational linguistics?

 

 

What is a lemma?

 

In linguistics and lexicography, lemma is the form of a word under which it is registered in a dictionary. A lemma is, so to speak, the keyword in the respective reference work. This is helpful because not all possible word forms of a word get their own entry in a lexicon. For example, run, runs, running and ran are forms of one and the same basic form: run; run is the lemma.

The concept of the lemma is closely related to that of the lexeme. A lexeme is a semantic concept that is intended to help in the ordering of inflected word forms. A word is regarded as inflected if it is no longer in its basic form due to grammatical adaptation, so whenever it has been conjugated or declined. For example, the conjugated word forms give, gives, gave, giving and given, together make up the lexeme GIVE. The lexeme can therefore be imagined as an abstraction of the set of possible forms of a word.

But how do lexemes and lemmata differ? A distinction can be made by looking at it in a functional context. Depending on the language, the lexeme is often used as a lemma. A lemma is defined above all by its property of being the keyword in a dictionary or lexicon.
 

Lemmatisation

 

The process that determines which lemma is used for each term is called lemmatisation. The definition of this basic form is arbitrary in principle, but is subject to certain conventions. In German it has become established to quote verbs in the infinitive present active (e.g. laufen), nouns are usually quoted in the nominative singular (e.g. Lauf). In other languages, however, other conventions may apply to the formation of lemmas.
 

Why is the lemma important for computational linguistics?

 

In computational linguistics, lemmatisation is helpful, among other things, for making statements about how often a word occurs in a text. This in turn can become important in a later step when the machine wants to capture and classify content in a semantic analysis.

The creation of a lemma lexicon is fundamental for this, since the number of words formed by flexion is significantly greater than the number of corresponding words. The following example should illustrate this:

“When we ran through the forest I told her that I wanted to run away from everything. She just laughed and said, “You never run away. But then I was already running and ran until my legs could no longer carry me.”

Here we find four syntactic words with three different word forms (ran/run/running/ran). Without the creation of a lexicon, the machine would not be able to recognise that the following example is about repeating inflected forms of the word. Only the two variants of ran would be recognised as identical.

However, if the machine has a lemma lexicon that enables it to sort the terms and recognise them as inflected forms of the same term, this is an important prerequisite for successful semantic text analysis

 

References & PDF:

  • Gallmann, Peter: Wort, Lexem und Lemma. In: Augst, Gerhard / Schaeder, Burkhard: Rechtschreibwörterbücher in der Diskussion. Geschichte – Analyse – Perspektiven. Frankfurt am Main / Bern / New York / Paris, 1991: Peter Lang (= Theorie und Vermittlung der Sprache, 13). Seiten 261–280.
  • Haß-Zumkehr, Ulrike: Das Wort in der Korpuslinguistik. In: von Ágel, Vilmos/Gardt, Andreas/ Haß-Zumkehr, Ulrike/Roelcke, Thorsten (Hrsg.): «Das Wort. Seine strukturelle und kulturelle Dimension. Festschrift für Oskar Reichmann zum 65. Geburtstag. Tübingen: Niemeyer, 2001. S. 45–70.
  • Hausser, Roland: Grundlagen der Computerlinguistik. Mensch-Maschine-Kommunikation in natürlicher Sprache. Berlin, Heidelberg, New York, 2000. S. 274-277.
  • Glück, Helmut: etzler Lexikon Sprache. Zweite, überarbeitete und erweiterte Auflage». Stuttgart/Weimar, 2000. S. 403 u. 407.
  • Meibauer, Jörg: Einführung in die germanistische Linguistik. 2. Aktualisierte Auflage. Stuttgart/Weimar, 2002, S. 17ff.

 

You might also be interested in these articles:

Artificial intelligence

Deep learning

Machine learning

Natural language generation

Natural language processing

Product descriptions

Robot journalism