Mapping between Images and Texts of Completed Collection of Graphs and Writings of Ancient and Modern Times
Date Issued
2015
Date
2015
Author(s)
Chen, Kuan-Chung
Abstract
The Complete Collection of Graphs and Writings of Ancient and Modern Times (Gujintushujicheng, or Jicheng for short), completed in the early 18th century, is the largest book in the world in existence. Containing over one million Chinese characters, almost 100,000 pages, and cover over 6,000 subjects, Jicheng is also difficult to use. During the past decade, several digital systems have been developed so that people can use Jicheng through fulltext search. However, all of these system did not attempt to match images and texts, which would make using Jicheng even easier. This difficult arises partly because for old Chinese books, OCR is still not an effective technology. In this thesis we develop a method that tries to find direct correspondence between an image of Jicheng and its associated text without resorting to OCR. We first calibrate the images so that all 100,000 pages in the book have the same size and format. We then analyze the characteristics such as the format, number of lines, position of graphs, etc, so that each line in the typed text maps to either a line of text, a blank line, of part of a graph in a page image. Once this is done, we then do a character-by-character mapping between each character in the typed text and a character in a page image. Our method is quite effective. The accuracy in mapping the entire contain of Jicheng is 98,7%. The rest is mainly due to typographic errors occurred when typing the full text, which can be easily corrected by hand.
Subjects
Gujintushujicheng
Digital Humanities
Image Processing
Type
thesis
File(s)![Thumbnail Image]()
Loading...
Name
ntu-104-R99922126-1.pdf
Size
23.32 KB
Format
Adobe PDF
Checksum
(MD5):2da19f0e19758b5b51c8080002a0087b
