Statistically-based Model for Computer-Aided Transcription Application

Benjamin K. Tsou,Tom B. Y. Lai,Samuel W. K. Chan,Lawrence Y. L. Cheung,K. T. Ko,Gary K. K. Chan
2000-01-01
Abstract:The recent implementation of bilingualism in the Common Law system in Hong Kong h as brought about an urgent need to develop a Computer-Aided Transcription (CAT) system to efficiently produce verbatim records of court proceedings conducted in Cantonese Chinese. The Cantonese Chinese CAT system essentially converts phonologically-based shorthand code, or stenograph code, into o rthographic representation in Chinese characters. One big challenge in our development of a Cantonese Chinese CAT system is the ambiguity resolution for homophonous Chinese characters that share identical stenograph code. To solve the problem, the bigram model is used as the language model. We implemented the Viterbi algorithm to efficiently compute the most l ikely Chinese character string for each sequence of stenograph code input. The CAT system is trained with a 0.85 million character corpus. By incorporating enhancement features such as earmarked treatment of numerals, special encoding and d omain-specific transcription, the Cantonese Chinese CAT system achieves as much as 96% transcription accuracy.
What problem does this paper attempt to address?