Florence A. Kirk banner image

Digitization of the Florence A. Kirk Correspondence Series

Background

Florence A. Kirk was a teacher at Regina College. In 1932 she went to China to teach at Ginling College in Nanking. Over the next 18 years she corresponded with family and friends. These letters were saved and have become the Florence A. Kirk Correspondence series held in Archives and Special Collections at the University of Regina.

In October, 1991, Ron Robbins, retired former head of the School of Journalism at the University of Reginagave Florence Kirk’s letters to the Archives. This was shortly after Kirk published Sunshine and storm : a Canadian teacher in China, 1932-1950. Kirk used her letters in the writing of the book, and the book contains portions of some of her letters. Thus, the letters are a kind of draft manuscript for her book.

The letters

The series was digitized in order to make the letters available to the people of Saskatchewan. The Saskatchewan War Experience project has other aims: digitization capacity, interoperability, and collaboration. In addition to providing online access, this digitization also establishes and develops the historical contexts of the letters themselves and the greater contexts which this series is a part of, and may become.

The Series consists of 413 documents. Some of them are publications of Ginling College and until permissions are obtained, they have not been digitized. Kirk’s correspondence thus has a total of 389 documents.

Most of the letters are indeed ‘letter’ sized, but there were not the same standards of today, so there is a lot of variation in size. Many of the letters are made of a translucent paper (rice paper?). These variations in size and media made the digitization more challenging. Kirk’s early letters were generally hand-written. After arriving in Nanking, she used a typewriter when writing her letters. Some correspondence was on ‘ready-to-send’ envelope-letters, while others were telegrams which had a designated message area.

Digitization equipment and software

All correspondence was digitized on a Hewlett Packard hp scanjet 8250 and the resulting scans were saved as 200 dpi TIFF images with each page of each leaf saved a separate file.

Optical Character Recognition was performed on the type-written correspondence with ABBYY FineReader to produce searchable PDF documents.

Optical character recognition and transcription

The hand-written correspondence scans for each letter were combined to produce non-searchable PDF documents. Experiments to recognize Kirk’s handwriting indicated that they would better be transcribed due to the amount of verification and correction of automated script recognition. To date, no letters have been transcribed.

For both hand-written and type-written letters, the translucent paper presents the problem that the writing on the opposite side of the leaf comes through. Kirk’s letter of July 15, 1948 is an example of the problem. This is no great problem for the human reader, but for optical character recognition software, ‘reading’ was more difficult. The FineReader software worked very well but, not surprisingly, there were more errors or lower confidence results for pages with more ‘bleed-through’.

Preservation of appearance

The final type-written PDF documents were prepared placing the scanned graphic image atop the scanned text. In this way the qualities and appearance of the original letters are preserved in these digital renditions, while enabling text searching and copying.

Other preserved qualities are the annotations and corrections made on both hand-written and type-written letters. However, they present the problem of how to ‘digitize’ them, that is, how to make them machine-readable, and then how to integrate them with the document and the flow of information, making them searchable like the body of the text.

Encoding annotation and correction

Corrections are usually placed so that its location is unambiguous. The human reader can understand both the erroneous and corrected text, but the machine reader needs an unambiguous encoding. In the Kirk letters, the corrected text was used where Kirk wrote over the erroneous text. With type-written errors, Kirk often overstruck the erroneous text with ‘x’s, and then typed the correct text after it. In the former case, only the corrected text was rendered into the PDF document. In the latter case, both overstrike ‘text’ and the corrected text was rendered. This would seem to be an opportunity to adopt a standard means to digitally render corrections and annotations in documents. Kirk’s letter of November 18, 1945 has examples of both of these kinds of correction.

Annotations and the text they annotate are more ambiguous. The annotation may not fit into the flow of the document, and nor may it not refer to a single place in the document. Furthermore, whereas most of the corrections were likely made by Kirk, the author(s) of the annotations are less certain. Any solution of digital rendition of corrections could also work for annotations but with extra consideration needed for referents and author. The graphic layering preserves, in rendition, the annotations but for the Kirk letters, no digital encoding was provided. Kirk’s letter of November 12, 1933 has many annotations. Some of them are easier to associate to the text than others.

The telegram correspondence, with designated message area, presents another problem. Should the telegram form and instructions, outside the letter text, be encoded? That is, should the text needed for the communication of the telegram not be rendered, so that electronic searches in the document ignore the text not written by Kirk? Here again, a human reader can differentiate between form and content, but a machine reader needs some unambiguous standard in order to differentiate the two. For the Kirk telegram letters, only Kirk’s text was encoded. Kirk’s second letter of May 18, 1944 is such a telegram.

Recognition problems

Another problem in optical character recognition was that Kirk’s type-written letters were type-written–on a typewriter. Modern printers produce text with consistent line spacing, character spacing, and character density. Kirk likely had a high-quality typewriter, in the day, but variations in the quality of the type made character recognition difficult. In a few letters, Kirk set the line spacing so small that some characters overlapped so much so that the FineReader software could not recognize the characters. In some letters, the characters were too close together to be accurately recognized, and in other letters, the characters were too faint to be recognized. The FineReader software performed excellently, and where it encountered problems, our human readers, Archives staff, did the recognition and so correctly-rendered PDF documents were produced. Note the overlapping lines at the end of Kirk’s letter of July 14, 1946.

Two additional problems encountered were the use of hyphenation and the spelling of Chinese personal names and place names. For the type-written letters, FineReader was usually able to correctly recognize text. Before the Pinyin language reform, Nanjing was spelled Nanking. Again, human readers or researchers will know that these ‘two’ places are the same, but machine readers or search engines might not.

Metadata

The series digitized is a part of the Saskatchewan War Experience online collection, and so this collection schema is used to record the Correspondence metadata. The variety of letters can be seen in two schema fields: Physical Extent and Dimensions. Here, the number of leaves, the number of pages, and sizes of the leaves and pages are recorded.

Future work

The Kirk Correspondence is a fascinating series of letters. So, there will certainly be more work done with it in the future. This work could include the transcription of hand-written letters and the encoding of annotations and corrections. Beyond the University Archives, the series could be connected with other collections and materials. Thus, the contexts of the Florence A. Kirk letters are developed via their availability to all visitors and researchers, in Saskatchewan and elsewhere.


Donald Johnson, Archives & Special Collections
May, 2011


References

Katrin Franke, Isabelle Guyon, Lambert Schomaker, and Louis Vuurpijl,
“The WANDAML markup language for digital document annotation” in
IWFHR '04 Proceedings of the Ninth International Workshop on Frontiers in Handwriting Recognition.
IEEE Computer Society. 2004. pp. 563-568.

Florence A. Kirk,
Sunshine and storm : a Canadian teacher in China, 1932-1950,
Kuai-lo Books, Victoria, B.C., Canada, 1991.

Michelle Light and Tom Hyry,
“Colophons and Annotations: New Directions for the Finding Aid” in
American Archivist, Vol. 65, Fall/Winter 2002. pp. 216-230.

ABBYY,
FineReader
Milpitas, California, USA.

Society of American Archivists,
Diplomatics (definition and citations)
Chicago, Illinois, USA.


Last Revision: 2011-May-12