Embodied Cognition in Virtual Environments with Diachronic Analysis of Linguistic and Visual Inputs

Jason Armitage

Recent work in machine learning opens the way to perform analysis on changes in learned spatial structures with embodied AI, that is by placing artificial agents in virtual environments. Vision and language multi-view alignment maps input pairs where at least one modality is a collection of sets. This process is the object of measurement and forms the assumed or explicit modeling objectives in problems with shared references between visual and linguistic inputs. In this work, methods are developed from three fundamental approaches to mapping information in the source modalities: 1) Multivariate mutual information measures with a zeroth-order optimisation algorithm. 2) Score Distillation Sampling in image generation conditioned on text in the setting where visual representations are observed from multiple viewpoints. 3) A priority map module for the transformer architecture to conduct a hierarchical process of high-level alignment of textual spans and visual perspectives.