Named Entity Recognition
Introduction
A Named Entity (NE) is a phrase in the text which uniquely refers to an entity of the world. It includes proper nouns, dates, identification numbers, phone numbers, e-mail addresses and so on. As the identification of dates and other simpler categories are usually carried out by hand-written regular expressions we will focus on proper names like organisations, persons, locations, genes or proteins.
The identification and classification of proper nouns in plain text is of key importance in numerous natural language processing applications. It is the first step of an IE system as proper names generally carry important information about the text itself, and thus are targets for extraction. Moreover Named Entity Recognition (NER) can be a stand-alone application as well and besides IE, Machine Translation also has to handle proper nouns and other sorts of words in a different way due to the specific translation rules that apply to them.
For the Hungarian NER we constructed manually annotated corpora. Based on this and other available English corpora we developed a NER system. It employs a rich feature set and has been successfully applied to Hungarian and English newswire NER and also to English clinical NER.
Download
Named Entity Recognition tool for Hungarian [download]
- (using the CRF implementation of MALLET)
- The English trained model will be available soon.
How to use from commandline
- Parameters:
- mode: It defines the process(es) to be executed. Possible values are:
- predicate
- input: It defines the input file on which the process will be executed. The input file must be a txt file containing running (raw) text.
- output: It defines the output file in which the analysis will be saved. If this parameter is not set, then the predication will be displayed on the default output.
- mode: It defines the process(es) to be executed. Possible values are:
- Examples:
- java -Xmx3G -jar ner.jar -mode predicate -input input.txt -output output.txt
- java -Xmx3G -jar ner.jar -mode predicate -input input.txt
How to use from Java code
- NamedEntityRecognizer ner = new NamedEntityRecognizer();
- ner.predicate("Egyszerû szöveges tartalom Szegedrõl.");
References
- György Szarvas, Richárd Farkas, András Kocsor: A Multilingual Named Entity Recognition System Using Boosting and C4.5 Decision Tree Learning Algorithms. In: The Ninth International Conference on Discovery Science 2006, LNAI 4265
- György Szarvas: Feature Engineering for Domain Independent Named Entity Recognition and Biomedical Text Mining Applications. PhD thesis at University of Szeged (2008)
- Lafferty, John, Andrew McCallum, and Fernando Pereira: Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In: Proceedings of ICML (2001)