Corpora for lemmatizing NEs

Introduction

Generally, the task of lemmatization can be carried out easily and successfully in case a good dictionary and a list of suffixes are provided. However, NEs cannot be listed in a dictionary, thus, it is only the list of suffixes in the given language that can be used for lemmatization. There are three typical problems that might occur when determining the boundary between the lemma and the suffix:

The NE ends in an apparent suffix
Two (or more) NEs of the same type follow each other and are not separated by punctuation marks
The NE contains punctuation marks within

Several NEs for the suffix problem and the separation problem were collected, on which all possible cuts were performed (that is, all possible boundaries between the potential lemma and suffixes were detected) and later were manually classified as positive or negative examples of cutting. The corpora serve as the training database for an algorithm that seeks to lemmatize NEs based on web frequency data.

Reference

Farkas, Richárd; Vincze, Veronika; Nagy, István; Ormándi, Róbert; Szarvas, György; Almási, Attila 2008: Web-based lemmatisation of Named Entities. In: Horák, Ale;; Kopeiek, Ivan; Pala, Karel; Sojka, Petr (eds.): Proceedings of the 11th International Conference on Text, Speech and Dialogue (TSD2008)</i>, Berlin, Heidelberg, Springer Verlag, LNCS 5246, pp. 53-60.

For further information please contact Veronika Vincze (vinczev AT inf.u-szeged.hu).

Download

The database used in the experiments can be downloaded.