Abstract:Predictive learning models, which aim to predict future frames based on past observations, are crucial to constructing world models. These models need to maintain low-level consistency and capture high-level dynamics in unannotated spatiotemporal data. Transitioning from frame-wise to token-wise prediction presents a viable strategy for addressing these needs. How to improve token representation and optimize token decoding presents significant challenges. This paper introduces PredToken, a novel predictive framework that addresses these issues by decoupling space-time tokens into distinct components for iterative cascaded decoding. Concretely, we first design a “decomposition, quantization, and reconstruction” schema based on VQGAN to improve the token representation. This scheme disentangles low- and high-frequency representations and employs a dimension-aware quantization model, allowing more low-level details to be preserved. Building on this, we present a “coarse-to-fine iterative decoding” method. It leverages dynamic soft decoding to refine coarse tokens and static soft decoding for fine tokens, enabling more high-level dynamics to be captured. These designs make Pred-Token produce high-quality predictions. Extensive experiments demonstrate the superiority of our method on various real-world spatiotemporal predictive benchmarks. Furthermore, PredToken can also be extended to other visual generative tasks to yield realistic outcomes.

Joint Tokenization and Translation

Joint tokenization, parsing, and translation

Joint Decoding with Multiple Translation Models.

Joint Parsing and Translation

Better Simultaneous Translation with Monotonic Knowledge Distillation.

Joint Training and Decoding Using Virtual Nodes for Cascaded Segmentation and Tagging Tasks.

PredToken: Predicting Unknown Tokens and Beyond with Coarse-to-Fine Iterative Decoding

Synchronous Speech Recognition and Speech-to-Text Translation with Interactive Decoding.

Joint Training for Pivot-based Neural Machine Translation.

Token-Level Serialized Output Training for Joint Streaming ASR and ST Leveraging Textual Alignments

Joint Decoding of Tandem and Hybrid Systems for Improved Keyword Spotting on Low Resource Languages

Sequence Generation with Mixed Representations.

Lattice-based System Combination for Statistical Machine Translation.

Flexible and Efficient Hypergraph Interactions for Joint Hierarchical and Forest-to-String Decoding.

Leveraging Timestamp Information for Serialized Joint Streaming Recognition and Translation

Joint Decoding of Tree Transduction Models for Sentence Compression

Language Tokens: A Frustratingly Simple Approach Improves Zero-Shot Performance of Multilingual Translation

C L ] 1 0 Ju n 20 18 Deconvolution-Based Global Decoding for Neural Machine Translation

Toward Joint Language Modeling for Speech Units and Text

Plug, Play, and Fuse: Zero-Shot Joint Decoding via Word-Level Re-ranking Across Diverse Vocabularies

Opportunistic Decoding with Timely Correction for Simultaneous Translation