MULTEXT-East

Description:

The resources are a multilingual dataset for language engineering

research and development. This dataset contains, for Bulgarian, Croatian, Czech,

English, Estonian, Hungarian, Lithuanian, Resian, Romanian, Russian, Slovene, and

Serbian, some, or all of the following language resources:

  • the morphosyntactic specifications, lexica, and annotated

"1984" corpus;

  • parallel and comparable text and speech corpora;
  • and associated documentation.

The complete corpora as well as the documentation are encoded in TEI P4.

The project was a spin-off of MULTEXT

and ran from '95 to '97. developed language resources for six

languages: Bulgarian, Czech, Estonian, Hungarian, Romanian, and Slovene, as well as

for English, as the hub language of the project. It also

adapted existing tools and standards to these languages. The main results of the

project were an annotated multilingual corpus and lexical resources for the seven languages.

The extended results of the project were made available in 1998, first on CD-ROM and

then via TRACTOR, the TELRI Research Archive of Computational Tools and Resources.

In the scope of the Concede project, a new release was made available in 2002; it

contained only the (updated and corrected) morphosytntactic resources from the first

release. This second release was made freely available for research use via the Web.

Finally, the third release was made in 2004 - it updates and brings together the

first two, adds new languages, and make the move from SGML to XML, in particular to

TEI P4 - this work was supported by the TEI task force on SGML to XML migration.

Version 3 is also available via the Web, from the home page of the project.

For further information on the project, its results and their

exploitation you can consult the annotated bibliography of , available

in HTML and various other formats from the project Web page.

(from the WWW page)

Contacts:

Tomaž Erjavec

Jožef Stefan Institute

Jamova 39

SI-1000 Ljubljana

Slovenia

Tel: +386 1 477-3507

Fax: +386 1 425-1038

Email: tomaz.erjavec@ijs.si