Multi-Modal Representation Learning with Tactile Data

Jan 1, 2024·

Hyung-Gun Chi

Jose Barreiros

Jean Mercat

Karthik Ramani

Thomas Kollar

· 0 min read

Cite

Abstract

Advancements in embodied language models like PALM-E and RT-2 have significantly enhanced language-conditioned robotic manipulation. However, these advances remain predominantly focused on vision and language, often overlooking the pivotal role of tactile feedback which is advantageous in contact-rich interactions. Our research introduces a novel approach that synergizes tactile information with vision and language. We present the Multi-Modal Wand (MMWand) dataset enriched with linguistic descriptions and tactile data. By integrating tactile feedback, we aim to bridge the divide between human linguistic understanding and robotic sensory interpretation. Our multi-modal representation model is trained on these datasets by employing the multi-modal embedding alignment principle from ImageBind which has shown promising results, emphasizing the potential of tactile data in robotic applications. The validation of our approach in downstream robotics tasks, such as texture-based object classification, cross-modality retrieval, and the dense reward function for visuomotor control, attests to its effectiveness. Our contributions underscore the importance of tactile feedback in multi-modal robotic learning and its potential to enhance robotic tasks. The MMWand dataset is publicly available at .

Publication

In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)

Last updated on Apr 27, 2026

← Language Models Scale Reliably with Over-Training and on Downstream Tasks Mar 14, 2024

Residual Q-Learning: Offline and Online Policy Customization without Value Jan 1, 2023 →