Dictionary Encoding Based on CSS and XML/HTML Parsers (poster)
Kadyr Momunaliev* Kadyr Momunaliev (Kyrgyzstan-Turkey Manas University) has worked for over five years on the creation of electronic versions of dictionaries and encyclopedias in the Kyrgyz Republic, and is currently based in Bishkek, the capital city of the Kyrgyz Republic., Joseph Ten* Kyrgyz State Technical University, and Nella Israilova* Kyrgyz State Technical University
1. Source File Structure
1The source file has been obtained as a result of pdf-to-doc-to-(x)html conversion. Resulting xhtml file preserved almost all typographic features except correct column rendering. The reason of the conversion is to perform annotation on the text’s source level using OHCO model, not as we did previously in WYSIWYG editors. CSS and XHTML parsing is based on the following tag hierarchy:
2. Parser Output Structure
2
3
3. Interpretation of Text Features
4Lexicographic structure of the dictionary is presented by means of typography (font features, layout) and syntax (predefined indicators and punctuation). Here below we provide some interpretation rules, which we used for parsing, in [text feature] : [interpretation] format:
4. Parsing Workflow
5The parser pursues a simple schema to provide clear logic and minimum complexity.
The main principle: at first some structural tokens are defined. After that key tokens are used to identify desirable elements or their boundaries.
5. Structure According to TEI P5
6. Mapping Parser Output to TEI P5
6Resuming the structure of the dictionary it can be said that it’s recursive. Generally saying a main entry contains a list of senses and after it an optional list of recursive entries which we called sub-entries. The only structural difference between main entry and sub-entry that senses of latter cannot contain examples. Figure 2 and Figure 6 illustrate that sense elements are presented differently: TEI P5 use <cit> elements and our XHTML doesn’t. Additionally our schema doesn’t deal with homographs.