Abstract:Some companies(e.g., Microsoft Research and Google DeepMind) have discovered some of the limitations of GPTs autoregressive paradigm next-word prediction, manifested in the model lack of planning, working memory, backtracking, and reasoning skills. GPTs rely on a local and greedy process of generating the next word, without a global understanding of the task or the output.We have confirmed the above limitations through specialized empirical studies of code comprehension. Although GPT4 is good at producing fluent and coherent text, it cannot handle complex logic and generate new code that haven not been seen, and it relies too much on the formatting of the prompt to generate the correct code.We propose a new paradigm for code understanding that goes beyond the next-word prediction paradigm, inspired by the successful application of diffusion techniques to image generation(Dalle2, Sora) and protein structure generation(AlphaFold3), which have no autoregressive constraints.Instead of encoding the code in a form that mimics natural language, we encode the code as a heterogeneous image paradigm with a memory of global information that mimics both images and protein structures.We then refer to Sora's CLIP upstream text-to-image encoder model to design a text-to-code encoder model that can be applied to various downstream code understanding tasks.The model learns the global understanding of code under the new paradigm heterogeneous image, connects the encoding space of text and code, and encodes the input of text into the vector of code most similar to it.Using self-supervised comparative learning on 456,360 text-code pairs, the model achieved a zero-shot prediction of new data. This work is the basis for future work on code generation using diffusion techniques under a new paradigm to avoid autoregressive limitations.

Improving pix2code based Bi-directional LSTM

An Improved Error-Correcting Output Coding Framework with Kernel-Based Decoding

Image Captioning with Deep Bidirectional LSTMs

BENet: bi-directional enhanced network for image captioning

i-Code V2: An Autoregressive Generation Framework over Vision, Language, and Speech Data

Predictive Coding Based Multiscale Network with Encoder-Decoder LSTM for Video Prediction

Exploiting long-term temporal dynamics for video captioning

Improvement of image description using bidirectional LSTM

Pix2Code: Learning to Compose Neural Visual Concepts as Programs

CC-LSTM: Cross and Conditional Long-Short Time Memory for Video Captioning

LSTM Pose Machines.

A new approach for encoding code and assisting code understanding

Deep-AutoCoder: Learning to Complete Code Precisely with Induced Code Tokens

Is Next Token Prediction Sufficient for GPT? Exploration on Code Logic Comprehension

ITERATED DILATED CONVOLUTIONAL NEURAL NETWORKS FOR WORD SEGMENTATION

LSTM-in-LSTM for Generating Long Descriptions of Images.

Show, Conceive and Tell: Image Captioning with Prospective Linguistic Information

Improvements to code2vec: Generating path vectors using RNN

Reference Based LSTM for Image Captioning.

Improved image captioning with subword units training and transformer

SPT-Code: Sequence-to-Sequence Pre-Training for Learning Source Code Representations