Abstract:Fiducial markers have been broadly used to identify objects or embed messages that can be detected by a camera. Primarily, existing detection methods assume that markers are printed on ideally planar surfaces. Markers often fail to be recognized due to various imaging artifacts of optical/perspective distortion and motion blur. To overcome these limitations, we propose a novel deformable fiducial marker system that consists of three main parts: First, a fiducial marker generator creates a set of free-form color patterns to encode significantly large-scale information in unique visual codes. Second, a differentiable image simulator creates a training dataset of photorealistic scene images with the deformed markers, being rendered during optimization in a differentiable manner. The rendered images include realistic shading with specular reflection, optical distortion, defocus and motion blur, color alteration, imaging noise, and shape deformation of markers. Lastly, a trained marker detector seeks the regions of interest and recognizes multiple marker patterns simultaneously via inverse deformation transformation. The deformable marker creator and detector networks are jointly optimized via the differentiable photorealistic renderer in an end-to-end manner, allowing us to robustly recognize a wide range of deformable markers with high accuracy. Our deformable marker system is capable of decoding 36-bit messages successfully at ~29 fps with severe shape deformation. Results validate that our system significantly outperforms the traditional and data-driven marker methods. Our learning-based marker system opens up new interesting applications of fiducial markers, including cost-effective motion capture of the human body, active 3D scanning using our fiducial markers' array as structured light patterns, and robust augmented reality rendering of virtual objects on dynamic surfaces.

Visual FUDGE: Form Understanding via Dynamic Graph Editing

FUDGE: Controlled Text Generation With Future Discriminators

Language Independent Neuro-Symbolic Semantic Parsing for Form Understanding

FormNetV2: Multimodal Graph Contrastive Learning for Form Document Information Extraction

TreeForm: End-to-end Annotation and Evaluation for Form Document Parsing

FlexEdit: Marrying Free-Shape Masks to VLLM for Flexible Image Editing

Let's Fuse Step by Step: A Generative Fusion Decoding Algorithm with LLMs for Multi-modal Text Recognition

Florence-VL: Enhancing Vision-Language Models with Generative Vision Encoder and Depth-Breadth Fusion

Towards Flexible Visual Relationship Segmentation

Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone

Flexible ViG: Learning the Self-Saliency for Flexible Object Recognition

FFA-GPT: an automated pipeline for fundus fluorescein angiography interpretation and question-answer

HADA: A Graph-based Amalgamation Framework in Image-text Retrieval

3MVRD: Multimodal Multi-task Multi-teacher Visually-Rich Form Document Understanding

Fuse Your Latents: Video Editing with Multi-source Latent Diffusion Models

F2FLDM: Latent Diffusion Models with Histopathology Pre-Trained Embeddings for Unpaired Frozen Section to FFPE Translation

DocFormerv2: Local Features for Document Understanding

DeepFormableTag: End-to-end Generation and Recognition of Deformable Fiducial Markers

FEditNet: Few-Shot Editing of Latent Semantics in GAN Spaces.

FADE: Few-shot/zero-shot Anomaly Detection Engine using Large Vision-Language Model

Brain-like Flexible Visual Inference by Harnessing Feedback-Feedforward Alignment