Abstract:As an international financial centre, Hong Kong is a metropolitan city that has given rise to multilingual characteristics in recent years. In addition to Cantonese and English, which serve mostly as first and second languages, Hong Kong residents have increasingly begun to develop a third or even a fourth language. The biliteracy and trilingualism language (兩文三語) policy encourages Mandarin as the third language. This paper introduces a corpus-based online pronunciation learning platform for Mandarin teachers, learners, and researchers to better understand the major problems encountered by Hong Kong learners of Cantonese in learning Mandarin pronunciation. A phonological corpus was established and analysed in order (a) to identify learners’ recurring difficulties in accurately and appropriately using Mandarin segmental and suprasegmental features and (b) to suggest possible solutions to reduce or eliminate such difficulties. The phonological corpus contains recorded data of four spoken tasks (reading of monosyllabic words, reading of multisyllabic words, reading of a passage, and free speech) from Hong Kong Cantonese college students. The phonological annotations of the recordings mainly focus on two areas of segmental features (vowels and consonants), two areas of suprasegmental features (tone and retroflex finals), and mispronunciation. In addition to the corpus, a pronunciation learning website was developed for learners to (a) practice segmental and suprasegmental aspects of pronunciation through a variety of perception and production exercises and (b) discover the possible causes of common Mandarin pronunciation features found in the corpus. Based on the corpus, 40 datasets were analysed, and a checklist of common Mandarin pronunciation errors made by Cantonese learners was made available for teachers and learners. The use and the evaluation of the pronunciation learning platform will also be introduced and discussed.

Design and research of Tibetan spoken speech corpus

Design and implementation of Tibetan continuous speech corpus

Design of Speech Corpus for Mandarin Text to Speech

Pvd: A New Pathological Voice Dataset For Intra-Speaker Recognition Research Interest

An Expressive Mandarin Speech Corpus

The Design of Speech Corpus of Chinese Endangered Minority Languages

TH-CoSS,a Mandarin Speech Corpus for TTS

Free Linguistic and Speech Resources for Tibetan

A CANTONESE ACCENT CHINESE SPEECH CORPUS

THCHS-30 : A Free Chinese Speech Corpus

Designing and implementing a corpus-based online pronunciation learning platform for Cantonese learners of Mandarin

TALCS: An Open-Source Mandarin-English Code-Switching Corpus and a Speech Recognition Baseline

ATCSpeech: A Multilingual Pilot-Controller Speech Corpus from Real Air Traffic Control Environment

KeSpeech: an Open Source Speech Dataset of Mandarin and Its Eight Subdialects.

ROBUSTNESS OF SPEECH RECOGNITION AND CONSTRUCTION OF A SPEECH CORPUS

Chinese dialect speech recognition: a comprehensive survey

Building a Non-native Speech Corpus Featuring Chinese-English Bilingual Children: Compilation and Rationale

VoiceBank-2023: A Multi-Speaker Mandarin Speech Corpus for Constructing Personalized TTS Systems for the Speech Impaired

A Miniature Chinese TTS System Based on Tailored Corpus

Design of General-Purpose Chinese Dialect Speech Database

AISHELL-1: An open-source Mandarin speech corpus and a speech recognition baseline