Szeged Corpus: a natural language processed Hungarian corpus

Host: University of Szeged, Department of Informatics
Other institutions involved: 1. Research Institute for Linguistics at the Hungarian Academy of Sciences, Department of Corpus Linguistics,

MorphoLogic Ltd. Budapest

URL: http://www.inf.u-szeged.hu/hlt

Description: The Szeged Corpus is a manually annotated natural language corpus, currently comprising 1.2 million words plus 225 thousand punctuation marks. Texts of the corpus derive from six different topic areas: short business news, daily news, fiction, law, texts related to computer science, and compositions of 14 to 16 year-old students. Corpus texts have gone through different phases of natural language analysis, such as morpho-syntactic analysis, POS tagging, shallow syntactic parsing, and semantic annotation. Current works aim at a more detailed syntactic analysis of the texts, including the annotation of adverbial, preverbal, postpositional, and adjectival structures and the identification of verbs and their argument structures. With this, the consortium intends to lay the foundation of a Hungarian treebank which is planned to be enriched with detailed semantic information as well at a later stage. Different versions of the Szeged Corpus are publicly available after on-line registration and can be used for educational and research purposes free of charge. For more information visit the http://www.inf.u-szeged.hu/hlt web site.

Implementation description: The format of the corpus files is XML and their inner structure is first described by the TEIXLITE DTD, then TEI P4 DTD.

Other Related Resources: Hungarian National Corpus (http://corpus.nytud.hu/mnsz/index_eng.html) TELRI Corpus (http://www.telri.bham.ac.uk/)

Contact:

Zoltán Alexin

University of Szeged, Department of Informatics H-6720 Szeged, Árpád tér 2. Hungary

Tel: +36 62 544 222/3411

Fax: +36 62 546 397

Email: alexin@inf.u-szeged.hu