Tactile beyond pixels: Multisensory touch representations
for robot manipulation

Carolina Higuera^{* 2,3}, Akash Sharma^{* 1,2}, Taosha Fan^{* 2}, Chaitanya Krishna Bodduluri², Byron Boots³ Michael Kaess¹ Mike Lambeta², Tingfan Wu² Zixi Liu², Francois R. Hogan^†2 Mustafa Mukadam^†2

¹Carnegie Mellon University, ²FAIR at Meta, ³University of Washington,

^*Equal contribution ^†Equal advising

PDF arXiv Video Code Dataset

Sparsh-X multisensory touch fusion transformer for general-purpose representations: Touch in robotics can be sensed through multiple modalities, including tactile images, vibrations, motion, and pressure. Sparsh-X is a transformer-based backbone that fuses these modalities from the Digit360 sensor. We show its versatility across diverse downstream tasks: manipulation via imitation learning (plug insertion), tactile adaptation (in-hand object rotation), and benchmark tasks to probe the understanding of physical properties.

Abstract

We present Sparsh-X, the first multisensory touch representations across four tactile modalities: image, audio, motion, and pressure. Trained on \( \sim \) 1M contact-rich interactions collected with the Digit 360 sensor, Sparsh-X captures complementary touch signals at diverse temporal and spatial scales. By leveraging self-supervised learning, Sparsh-X fuses these modalities into a unified representation that captures physical properties useful for robot manipulation tasks. We study how to effectively integrate real-world touch representations for both imitation learning and tactile adaptation of sim-trained policies, showing that Sparsh-X boosts policy success rates by 63% over an end-to-end model using tactile images and improves robustness by 90% in recovering object states from touch. Finally, we benchmark Sparsh-X's ability to make inferences about physical properties, such as object-action identification, material-quantity estimation, and force estimation. Sparsh-X improves accuracy in characterizing physical properties by 48% compared to end-to-end approaches, demonstrating the advantages of multisensory pretraining for capturing features essential for dexterous manipulation.

Walkthrough Video

Sparsh-X model overview

Sparsh-X uses a transformer-based backbone where each input tactile modality is first processed independently though \( L_f \) layers through self-distillation, followed by cross-modal information flow through a fusion transformer facilitated by \( B \) bottleneck fusion tokens. After each cross-modal update the fusion tokens are averaged across modalities to promote information sharing.

We train Sparsh-X on a large-scale dataset of \( \sim \)1M contact-rich interactions collected with the Digit 360 sensor, generated primarily from two sources: a) an Allegro hand with Digit360 sensors on each fingertip b) a manual picker with the same sensor adapted to gripping mechanism, used to execute atomic manipulation actions such as picking up, sliding, tapping, dropping, against various surfaces and objects.

Sparsh-X downstream tasks

Plug insertion with Sparsh-X representations

For the first policy learning task, we demonstrate that Sparsh-X representation improves policies needing tactile sensing, when trained with behavior cloning. Here, Here, we show the results of a plug insertion task where the goal is to insert a pre-grasped plug into the first socket of an extension power strip. Specifically, we train a transformer decoder to predict action sequences. The policy is given as observations input, images from three third-person cameras and a wrist camera, as well as tactile observations in the form of Sparsh-X representations.

Sparsh-X (All modalities) + Wrist camera

End-to-end tactile with only images

In hand rotation

For many in-hand manipulation tasks, it is quite difficult to teleoperate the robot hand to accomplish tasks as robot teleoperation without haptic feedback is quite challenging today. For these cases, an alternate possibility is to train a policy in simulation using reinforcment learning and then play it in the real world. However, sim-to-real transfer for tactile sensing is also quite challenging as tactile sensors are difficult to simulate. In this paper, we propose a real-world tactile adaptation method that uses Sparsh-X representations to adapt a sim-trained policy to the real world. Details are provided in the paper. Here we show results on the in-hand rotation task, where we demonstrate that after tactile adaptation using Sparsh-X representations, the robot policy is more robust to perturbations such as changes in weight of the object and slippage.

with Sparsh-X representation

Hora without Sparsh-X representation

Sparsh-X - slippery surface

Hora - slippery surface

Sparsh-X 20g weight addition

Hora 20g weight addition

Related research

Sparsh: Self-supervised touch representations for vision-based tactile sensing
Website
Self supervised perception for tactile skin covered dexterous hands
Website

BibTeX

If you find our work useful, please consider citing our paper:

@article{higuera2025sparshx,
  title = {Tactile Beyond Pixels: Multisensory Touch Representations for Robot Manipulation},
  author = {Carolina Higuera*, Akash Sharma*, Taosha Fan*, Chaithanya Krishna Bodduluri,
Byron Boots, Michael Kaess, Mike Lambeta, Tingfan Wu, Zixi Liu, Francois Robert Hogan+, Mustafa Mukadam+},
  year = {2025},
  eprint={2506.14754},
  arxivPrefix = {arxiv},
  primaryClass = {cs.RO},
  url = {https://arxiv.org/abs/2506.14754}
}

Tactile beyond pixels: Multisensory touch representations for robot manipulation