Encoding Disappearing Characters: The case of 20 C Japanese-Canadian names (paper)
Stewart Arneil* Stewart Arneil has worked for over 20 years on numerous Digital Humanities projects as a software developer (using largely XML-based technologies) and project manager, and co-authored a number of papers over that time. He holds an MA in computational epistemology and certification in Instructional Design and Project Management.
1. The problem of disappearing characters
1The Landscapes of Injustice project seeks to encode community directories and personal letters created
through the twentieth century by (and about) the Japanese-Canadian community so they are accessible
to modern audiences. Kanji (Chinese characters used in Japanese script) evolve over time. For
example, in 1946, 1981 and 2010 the Japanese government specified new kanji (known as shinjitai) to
replace certain old kanji (known as kyūjitai) for most purposes (e.g., schooling). In addition, there are
other old kanji which lack an official connection to a new kanji and have thus been effectively
deprecated (known as hyōgaiji).
2Our pre-1945 source documents include both classes of old kanji, particularly in personal names. The
project is sensitive to the representation of names as the community of people involved was largely
erased from Canadian society in the 1940s. Changes to the kanji thus risk echoing the disappearance
from history suffered by the actual people because names containing the old kanji are difficult to search
or read for people who learned Japanese after 1945. The specific problem is how to retain the old kanji
for presentation and associate them with current kanji for the purposes of searching and readability.
2. Representing deprecated characters
3Unicode has a remarkably complex (even for them) treatment for mapping old and new kanji
(unicode.org, 2017a). Unicode provides Ideographic Variation Sequences for associating code points
and publishes a list of Standardized Variants (unicode.org, 2017b). Only some of the shinjitai / kyūjitai
pairings and few of the hyōgaiji kanji in our data appear on that list, so we are unable to rely on that
Unicode feature. Thus, we sought a TEI encoding to capture all three situations listed in Table 1, from
which we could produce output both true to the original and as readable and searchable as possible.
Description | Example old kanji (code point) | Example new kanji (code point) |
Pair appears in kyūjitai – shinjitai list and in Unicode Standardized Variants list | 社︀ (FA4C) | 社 (793E) |
Pair appears in kyūjitai – shinjitai list not in Unicode Standardized Variants list | 會 (6703) | 会 (4F1A) |
Pair does not appear in kyūjitai – shinjitai list nor in Unicode Standardized Variants list | 舘 (8218) | 館 (9928) |
3. TEI-based treatment
4We decided to use TEI to represent each old kanji, the new kanji associated with each old kanji, and
whether the mapping appears in the kyujitai-shinjitai list and/or the standardized variant list. TEI’s gaiji
module provides elements and attributes to capture those distinctions (tei-c.org, 2016). For example:
The character inside the <g> element is an old kanji. The mapping elements in the related glyph element
provide the various ways that character might be represented in output.
<charDecl>
<glyph xml:id="uFA4C_u793E">
<mapping type="kyūjitai">社︀</mapping>
<mapping type="shinjitai">社</mapping>
<mapping type="uniStdVar">社&xFE00;</mapping>
</glyph>
</charDecl>
...
<body>
... <g ref="#uFA4C_u793E">社︀</g> ...
</body>
4. Training encoders of texts containing deprecated characters
5Identifying and classifying the old kanji correctly, and then making the correct association with a
current kanji requires a certain degree of knowledge of Japanese, the Unicode standard and TEI-XML.
I will describe the implications for training suitable encoders: whether to train those with reading skills
(largely elderly and not technically proficient) in the relevant TEI and unicode, or train TEI-capable
Japanese speakers (largely early-adult history students) to deal with the old kanji, or some combination.
5. Output Issues and Conclusions
6The next phase of the project focuses on processing the TEI to represent these characters in outputs (e.g.,
xhtml) so users can see and search both the old and new kanji. I’ll finish with a discussion of the
implications of different output representations for usability and conclude with the importance of the
TEI encoding in light of issues raised by Unicode’s Standardized Variants and the behaviour of
rendering agents.
Bibliography
- unicode.org. 2017 a. The Unicode® Standard Version 10.0 – Core Specification, 23. http://www.unicode.org/versions/Unicode10.0.0/ch23.pdf. Accessed 30 June 2017.
- unicode.org. 2017 b. Specification of the variation sequences that are defined in the Unicode Standard, v10.0.0. http://unicode.org/Public/UCD/latest/ucd/StandardizedVariants.txt. Accessed 30 June 2017.
- tei-c.org. 2016. P5: Guidelines for Electronic Text Encoding and Interchange. http://www.tei-c.org/Vault/P5/3.2.0/doc/tei-p5-doc/en/html/WD.html#D25-20. Accessed 30 June 2017.