Abstract:Humans judge the similarity of two objects not just based on their visual appearance but also based on their semantic relatedness. However, it remains unclear how humans learn about semantic relationships between objects and categories. One important source of semantic knowledge is that semantically related objects frequently co-occur in the same context. For instance, forks and plates are perceived as similar, at least in part, because they are often experienced together in a ``kitchen" or ``eating'' context. Here, we investigate whether a bio-inspired learning principle exploiting such co-occurrence statistics suffices to learn a semantically structured object representation {\em de novo} from raw visual or combined visual and linguistic input. To this end, we simulate temporal sequences of visual experience by binding together short video clips of real-world scenes showing objects in different contexts. A bio-inspired neural network model aligns close-in-time visual representations while also aligning visual and category label representations to simulate visuo-language alignment. Our results show that our model clusters object representations based on their context, e.g. kitchen or bedroom, in particular in high-level layers of the network, akin to humans. In contrast, lower-level layers tend to better reflect object identity or category. To achieve this, the model exploits two distinct strategies: the visuo-language alignment ensures that different objects of the same category are represented similarly, whereas the temporal alignment leverages that objects from the same context are frequently seen in succession to make their representations more similar. Overall, our work suggests temporal and visuo-language alignment as plausible computational principles for explaining the origins of certain forms of semantic knowledge in humans.

What problem does this paper attempt to address?

### Problems the Paper Aims to Solve This paper aims to explore how humans learn semantic relationships between objects through the spatiotemporal structure of visual experiences and form context-based object representations. Specifically, the researchers utilize a biologically inspired learning principle to construct semantically structured object representations through the co-occurrence statistics of objects in different scenes. The main objectives of the paper include: 1. **Exploring the Temporal Slowness Principle**: The researchers aim to verify whether the temporal slowness principle is sufficient to automatically generate semantically structured object representations from raw visual input or from combined visual and language input. 2. **Simulating Visuo-Language Alignment**: By simulating the alignment process of visual and language inputs, the researchers further optimize the learning of object representations. The researchers simulate continuous visual experiences over time by constructing a series of first-person perspective video sequences of real-world objects and use a biologically inspired neural network model to achieve these goals. Experimental results show that the model can cluster object representations based on the context in which the objects are located (such as kitchen, bedroom, etc.) at higher layers, while at lower layers, it more reflects the identity or category of the objects. Overall, this paper reveals temporal and visuo-language alignment as reasonable computational principles for explaining the source of certain forms of human semantic knowledge.

Learning Object Semantic Similarity with Self-Supervision