TEI 2017: Encoding Disappearing Characters: The case of 20 C Japanese-Canadian names

TEI 2017 Victoria, British Columbia, Canada November 11 - 15 #tei2017Vic

Encoding Disappearing Characters: The case of 20 C Japanese-Canadian names (paper)

Stewart Arneil* Stewart Arneil has worked for over 20 years on numerous Digital Humanities projects as a software developer (using largely XML-based technologies) and project manager, and co-authored a number of papers over that time. He holds an MA in computational epistemology and certification in Instructional Design and Project Management.

1. The problem of disappearing characters

1The Landscapes of Injustice project seeks to encode community directories and personal letters created through the twentieth century by (and about) the Japanese-Canadian community so they are accessible to modern audiences. Kanji (Chinese characters used in Japanese script) evolve over time. For example, in 1946, 1981 and 2010 the Japanese government specified new kanji (known as shinjitai) to replace certain old kanji (known as kyūjitai) for most purposes (e.g., schooling). In addition, there are other old kanji which lack an official connection to a new kanji and have thus been effectively deprecated (known as hyōgaiji).

2Our pre-1945 source documents include both classes of old kanji, particularly in personal names. The project is sensitive to the representation of names as the community of people involved was largely erased from Canadian society in the 1940s. Changes to the kanji thus risk echoing the disappearance from history suffered by the actual people because names containing the old kanji are difficult to search or read for people who learned Japanese after 1945. The specific problem is how to retain the old kanji for presentation and associate them with current kanji for the purposes of searching and readability.

2. Representing deprecated characters

3Unicode has a remarkably complex (even for them) treatment for mapping old and new kanji (unicode.org, 2017a). Unicode provides Ideographic Variation Sequences for associating code points and publishes a list of Standardized Variants (unicode.org, 2017b). Only some of the shinjitai / kyūjitai pairings and few of the hyōgaiji kanji in our data appear on that list, so we are unable to rely on that Unicode feature. Thus, we sought a TEI encoding to capture all three situations listed in Table 1, from which we could produce output both true to the original and as readable and searchable as possible.

Description	Example old kanji (code point)	Example new kanji (code point)
Pair appears in kyūjitai – shinjitai list and in Unicode Standardized Variants list	社︀ (FA4C)	社 (793E)
Pair appears in kyūjitai – shinjitai list not in Unicode Standardized Variants list	會 (6703)	会 (4F1A)
Pair does not appear in kyūjitai – shinjitai list nor in Unicode Standardized Variants list	舘 (8218)	館 (9928)

3. TEI-based treatment

4We decided to use TEI to represent each old kanji, the new kanji associated with each old kanji, and whether the mapping appears in the kyujitai-shinjitai list and/or the standardized variant list. TEI’s gaiji module provides elements and attributes to capture those distinctions (tei-c.org, 2016). For example:



<charDecl>

  <glyph xml:id="uFA4C_u793E">

    <mapping type="kyūjitai">社︀</mapping>

    <mapping type="shinjitai">社</mapping>

    <mapping type="uniStdVar">&#x793E;&xFE00;</mapping>

  </glyph>

</charDecl>

...

<body>

  ... <g ref="#uFA4C_u793E">社︀</g> ...

  </body>

The character inside the <g> element is an old kanji. The mapping elements in the related glyph element provide the various ways that character might be represented in output.

4. Training encoders of texts containing deprecated characters

5Identifying and classifying the old kanji correctly, and then making the correct association with a current kanji requires a certain degree of knowledge of Japanese, the Unicode standard and TEI-XML. I will describe the implications for training suitable encoders: whether to train those with reading skills (largely elderly and not technically proficient) in the relevant TEI and unicode, or train TEI-capable Japanese speakers (largely early-adult history students) to deal with the old kanji, or some combination.

5. Output Issues and Conclusions

6The next phase of the project focuses on processing the TEI to represent these characters in outputs (e.g., xhtml) so users can see and search both the old and new kanji. I’ll finish with a discussion of the implications of different output representations for usability and conclude with the importance of the TEI encoding in light of issues raised by Unicode’s Standardized Variants and the behaviour of rendering agents.

Bibliography

unicode.org. 2017 a. The Unicode® Standard Version 10.0 – Core Specification, 23. http://www.unicode.org/versions/Unicode10.0.0/ch23.pdf. Accessed 30 June 2017.
unicode.org. 2017 b. Specification of the variation sequences that are defined in the Unicode Standard, v10.0.0. http://unicode.org/Public/UCD/latest/ucd/StandardizedVariants.txt. Accessed 30 June 2017.
tei-c.org. 2016. P5: Guidelines for Electronic Text Encoding and Interchange. http://www.tei-c.org/Vault/P5/3.2.0/doc/tei-p5-doc/en/html/WD.html#D25-20. Accessed 30 June 2017.