Abstract:Sign language translation (SLT) is a challenging weakly supervised task without word-level annotations. An effective method of SLT is to leverage multimodal complementarity and to explore implicit temporal cues. In this work, we propose a graph-based multimodal sequential embedding network (MSeqGraph), in which multiple sequential modalities are densely correlated. Specifically, we build a graph structure to realize the intra-modal and inter-modal correlations. First, we design a graph embedding unit (GEU), which embeds a parallel convolution with channel-wise and temporal-wise learning into the graph convolution to learn the temporal cues in each modal sequence and cross-modal complementarity. Then, a hierarchical GEU stacker with a pooling-based skip connection is proposed. Unlike the state-of-the-art methods, to obtain a compact and informative representation of multimodal sequences, the GEU stacker gradually compresses the channel <span class="mjpage"><svg xmlns:xlink="http://www.w3.org/1999/xlink" width="1.216ex" height="2.176ex" style="vertical-align: -0.338ex;" viewBox="0 -791.3 523.5 936.9" role="img" focusable="false" xmlns="http://www.w3.org/2000/svg"><g stroke="currentColor" fill="currentColor" stroke-width="0" transform="matrix(1 0 0 -1 0 0)"> <use xlink:href="#MJMATHI-64" x="0" y="0"></use></g></svg></span> with multi-modalities <span class="mjpage"><svg xmlns:xlink="http://www.w3.org/1999/xlink" width="2.04ex" height="1.676ex" style="vertical-align: -0.338ex;" viewBox="0 -576.1 878.5 721.6" role="img" focusable="false" xmlns="http://www.w3.org/2000/svg"><g stroke="currentColor" fill="currentColor" stroke-width="0" transform="matrix(1 0 0 -1 0 0)"> <use xlink:href="#MJMATHI-6D" x="0" y="0"></use></g></svg></span> rather than the temporal dimension <span class="mjpage"><svg xmlns:xlink="http://www.w3.org/1999/xlink" width="0.84ex" height="2.009ex" style="vertical-align: -0.338ex;" viewBox="0 -719.6 361.5 865.1" role="img" focusable="false" xmlns="http://www.w3.org/2000/svg"><g stroke="currentColor" fill="currentColor" stroke-width="0" transform="matrix(1 0 0 -1 0 0)"> <use xlink:href="#MJMATHI-74" x="0" y="0"></use></g></svg></span>. Finally, we adopt the connectionist temporal decoding strategy to explore the entire video's temporal transition and translate the sentence. Extensive experiments on the USTC-CSL and BOSTON-104 datasets demonstrate the effectiveness of the proposed method.<svg xmlns="http://www.w3.org/2000/svg" style="display: none;"><defs id="MathJax_SVG_glyphs"><path stroke-width="1" id="MJMATHI-64" d="M366 683Q367 683 438 688T511 694Q523 694 523 686Q523 679 450 384T375 83T374 68Q374 26 402 26Q411 27 422 35Q443 55 463 131Q469 151 473 152Q475 153 483 153H487H491Q506 153 506 145Q506 140 503 129Q490 79 473 48T445 8T417 -8Q409 -10 393 -10Q359 -10 336 5T306 36L300 51Q299 52 296 50Q294 48 292 46Q233 -10 172 -10Q117 -10 75 30T33 157Q33 205 53 255T101 341Q148 398 195 420T280 442Q336 442 364 400Q369 394 369 396Q370 400 396 505T424 616Q424 629 417 632T378 637H357Q351 643 351 645T353 664Q358 683 366 683ZM352 326Q329 405 277 405Q242 405 210 374T160 293Q131 214 119 129Q119 126 119 118T118 106Q118 61 136 44T179 26Q233 26 290 98L298 109L352 326Z"></path><path stroke-width="1" id="MJMATHI-6D" d="M21 287Q22 293 24 303T36 341T56 388T88 425T132 442T175 435T205 417T221 395T229 376L231 369Q231 367 232 367L243 378Q303 442 384 442Q401 442 415 440T441 433T460 423T475 411T485 398T493 385T497 373T500 364T502 357L510 367Q573 442 659 442Q713 442 746 415T780 336Q780 285 742 178T704 50Q705 36 709 31T724 26Q752 26 776 56T815 138Q818 149 821 151T837 153Q857 153 857 145Q857 144 853 130Q845 101 831 73T785 17T716 -10Q669 -10 648 17T627 73Q627 92 663 193T700 345Q700 404 656 404H651Q565 404 506 303L499 291L466 157Q433 26 428 16Q415 -11 385 -11Q372 -11 364 -4T353 8T350 18Q350 29 384 161L420 307Q423 322 423 345Q423 404 379 404H374Q288 404 229 303L222 291L189 157Q156 26 151 16Q138 -11 108 -11Q95 -11 87 -5T76 7T74 17Q74 30 112 181Q151 335 151 342Q154 357 154 369Q154 405 129 405Q107 405 92 377T69 316T57 280Q55 278 41 278H27Q21 284 21 287Z"></path><path stroke-width="1" id="MJMATHI-74" d="M26 385Q19 392 19 395Q19 399 22 411T27 425Q29 430 36 430T87 431H140L159 511Q162 522 166 540T173 566T179 586T187 603T197 615T211 624T229 626Q247 625 254 615T261 596Q261 589 252 549T232 470L222 433Q222 431 272 431H323Q330 424 330 420Q330 398 317 385H210L174 240Q135 80 135 68Q135 26 162 26Q197 26 230 60T283 144Q285 150 288 151T303 153H307Q322 153 322 145Q322 142 319 133Q314 117 301 95T267 48T216 6T155 -11Q125 -11 98 4T59 56Q57 64 57 83V101L92 241Q127 382 128 383Q128 385 77 385H26Z"></path></defs></svg>

MS2SL: Multimodal Spoken Data-Driven Continuous Sign Language Production

Continuous 3D Multi-Channel Sign Language Production via Progressive Transformers and Mixture Density Networks

DualSign: Semi-Supervised Sign Language Production with Balanced Multi-Modal Multi-Task Dual Transformation

Collaborative Multilingual Continuous Sign Language Recognition: A Unified Framework

Text2Sign: Towards Sign Language Production Using Neural Machine Translation and Generative Adversarial Networks

Sign Language Production with Latent Motion Transformer

Discrete to Continuous: Generating Smooth Transition Poses from Sign Language Observation

Boosting Continuous Sign Language Recognition via Cross Modality Augmentation

Neural Sign Actors: A diffusion model for 3D sign language production from text

SignLLM: Sign Language Production Large Language Models

CSLNSpeech: solving the extended speech separation problem with the help of Chinese Sign Language

G2P-DDM: Generating Sign Pose Sequence from Gloss Sequence with Discrete Diffusion Model

CSLNSpeech: solving extended speech separation problem with the help of Chinese sign language

Video-Based Sign Language Recognition Without Temporal Segmentation

Multi-View Spatial-Temporal Network for Continuous Sign Language Recognition

SignDiff: Diffusion Models for American Sign Language Production

Graph-Based Multimodal Sequential Embedding for Sign Language Translation

Towards Online Continuous Sign Language Recognition and Translation

Attentional bias for hands: Cascade dual‐decoder transformer for sign language production

Signs as Tokens: An Autoregressive Multilingual Sign Language Generator

SignVTCL: Multi-Modal Continuous Sign Language Recognition Enhanced by Visual-Textual Contrastive Learning