Abstract:Modern statistical machine learning (SML) methods share a major limitation with the early approaches to AI: there is no scalable way to adapt them to new domains. Human learning solves this in part by leveraging a rich, shared, updateable world model. Such scalability requires modularity: updating part of the world model should not impact unrelated parts. We have argued that such modularity will require both "correctability" (so that errors can be corrected without introducing new errors) and "interpretability" (so that we can understand what components need correcting). To achieve this, one could attempt to adapt state of the art SML systems to be interpretable and correctable; or one could see how far the simplest possible interpretable, correctable learning methods can take us, and try to control the limitations of SML methods by applying them only where needed. Here we focus on the latter approach and we investigate two main ideas: "Teacher Assisted Learning", which leverages crowd sourcing to learn language; and "Factored Dialog Learning", which factors the process of application development into roles where the language competencies needed are isolated, enabling non-experts to quickly create new applications. We test these ideas in an "Automated Personal Assistant" (APA) setting, with two scenarios: that of detecting user intent from a user-APA dialog; and that of creating a class of event reminder applications, where a non-expert "teacher" can then create specific apps. For the intent detection task, we use a dataset of a thousand labeled utterances from user dialogs with Cortana, and we show that our approach matches state of the art SML methods, but in addition provides full transparency: the whole (editable) model can be summarized on one human-readable page. For the reminder app task, we ran small user studies to verify the efficacy of the approach.

Automating Crowd-supervised Learning for Spoken Language Systems.

Automated Curriculum Learning for Turn-level Spoken Language Understanding with Weak Supervision

Towards Unsupervised Speech Recognition Without Pronunciation Models

SpeechNet: Weakly Supervised, End-to-End Speech Recognition at Industrial Scale

Robust Speech Recognition via Large-Scale Weak Supervision

An Efficient Self-Learning Framework For Interactive Spoken Dialog Systems

A Spoken English Teaching System Based on Speech Recognition and Machine Learning

Crowdsourced and Automatic Speech Prominence Estimation

Task Oriented Dialogue as a Catalyst for Self-Supervised Automatic Speech Recognition

Exploring Retraining-Free Speech Recognition for Intra-sentential Code-Switching

Personalized language modeling by crowd sourcing with social network data for voice access of cloud applications

Self-supervised Speaker Recognition Training Using Human-Machine Dialogues

Cost-efficient Crowdsourcing for Span-based Sequence Labeling: Worker Selection and Data Augmentation

A Base Camp for Scaling AI

Kaizen: Continuously improving teacher using Exponential Moving Average for semi-supervised speech recognition

Federated Self-Learning with Weak Supervision for Speech Recognition

Bridging the Gap Between Clean Data Training and Real-World Inference for Spoken Language Understanding

Self-Taught Recognizer: Toward Unsupervised Adaptation for Speech Foundation Models

Speech-Copilot: Leveraging Large Language Models for Speech Processing via Task Decomposition, Modularization, and Program Generation

Improving Speech Recognition Error Prediction for Modern and Off-the-shelf Speech Recognizers

Auto-AVSR: Audio-Visual Speech Recognition with Automatic Labels