Jurilinguistic engineering in Cantonese Chinese: an N-gram-based speech to text transcription system

Benjamin K. Tsou,K. K. Sin,Samuel W. K. Chan,Tom B. Y. Lai,Caesar Suen Lun,K. T. Ko,Gary K. K. Chan,Lawrence Y. L. Cheung
DOI: https://doi.org/10.3115/992730.992817
2000-01-01
Abstract:A Cantonese Chinese transcription system to automatically convert stenograph code to Chinese characters is reported. The major challenge in developing such a system is the critical homocode problem because of homonymy. The statistical N-gram model is used to compute the best combination of characters. Supplemented with a 0.85 million character corpus of domain-specific training data and enhancement measures, the bigram and trigram implementations achieve 95% and 96% accuracy respectively, as compared with 78% accuracy in the baseline model. The system performance is comparable with other advanced Chinese Speech-to-Text input applications under development. The system meets an urgent need of the Judiciary of post-1997 Hong Kong.
What problem does this paper attempt to address?