CLIPSwarm: Converting text into formations of robots

Pablo Pueyo,Eduardo Montijano,Ana C. Murillo,Mac Schwager
2023-11-18
Abstract:We present CLIPSwarm, an algorithm to generate robot swarm formations from natural language descriptions. CLIPSwarm receives an input text and finds the position of the robots to form a shape that corresponds to the given text. To do so, we implement a variation of the Montecarlo particle filter to obtain a matching formation iteratively. In every iteration, we generate a set of new formations and evaluate their Clip Similarity with the given text, selecting the best formations according to this metric. This metric is obtained using Clip, [1], an existing foundation model trained to encode images and texts into vectors within a common latent space. The comparison between these vectors determines how likely the given text describes the shapes. Our initial proof of concept shows the potential of this solution to generate robot swarm formations just from natural language descriptions and demonstrates a novel application of foundation models, such as CLIP, in the field of multi-robot systems. In this first approach, we create formations using a Convex-Hull approach. Next steps include more robust and generic representation and optimization steps in the process of obtaining a suitable swarm formation.
Robotics
What problem does this paper attempt to address?
The paper proposes a new algorithm called CLIPSwarm, which aims to automatically generate formations for robot swarms (especially drone swarms) based on natural language descriptions. Specifically, CLIPSwarm takes a text input and then generates a set of robot positions such that the pattern formed by these robots matches the input text description. The main contributions of the paper include: 1. **Algorithm Introduction**: The CLIPSwarm algorithm is based on a variant of the Monte Carlo particle filter to iteratively generate different robot formations and uses the CLIP model to evaluate the similarity between these formations and the given text. CLIP is a pre-trained foundational model that can encode images and text into the same vector space, allowing for the calculation of their similarity. 2. **Experimental Validation**: The authors demonstrate the robot formations generated by CLIPSwarm under different natural language descriptions and showcase these formations through simulations in the high-fidelity drone simulator AirSim, proving the algorithm's effectiveness and potential practical application value. 3. **Discussion of Limitations**: The paper also mentions some limitations of the current method, such as the use of convex hull contours to simplify the evaluation process, which may lead to the loss of shape details, and the sole reliance on CLIP similarity, which may not fully meet the user's expected shapes. In summary, CLIPSwarm provides a novel approach for automatically creating robot swarm formations from natural language descriptions, opening new directions for research in multi-robot systems, particularly in the field of artistic robots. Future work will include improving the algorithm to handle more complex inputs and using more diverse metrics to enhance the accuracy and expressiveness of the formations.