Sparsh-Xmultisensory touch fusion transformer for general-purpose
representations:
Touch in robotics can be sensed through multiple modalities, including tactile images,
vibrations, motion,
and pressure. Sparsh-X is a transformer-based backbone that fuses
these modalities from the
Digit360 sensor.
We show its versatility across diverse downstream tasks: manipulation via imitation learning
(plug insertion),
tactile adaptation (in-hand object rotation), and benchmark tasks to probe the understanding of
physical properties.
Abstract
We present Sparsh-X, the first multisensory touch representations
across four tactile modalities: image, audio, motion, and pressure. Trained on \( \sim \) 1M
contact-rich interactions collected with the Digit 360 sensor, Sparsh-X
captures complementary touch signals at diverse temporal and spatial scales. By leveraging
self-supervised learning, Sparsh-X fuses these modalities into a
unified representation that captures physical properties useful for robot manipulation
tasks.
We study how to effectively integrate real-world touch representations for both imitation
learning and tactile adaptation of sim-trained policies, showing that Sparsh-X
boosts policy success rates by 63% over an end-to-end model using tactile images and
improves
robustness by 90% in recovering object states from touch. Finally, we benchmark Sparsh-X's
ability to make inferences about physical properties, such as object-action identification,
material-quantity estimation, and force estimation. Sparsh-X
improves
accuracy in characterizing physical properties by 48% compared to end-to-end approaches,
demonstrating the advantages of multisensory pretraining for capturing features essential
for dexterous manipulation.
Walkthrough Video
Sparsh-X model overview
Sparsh-X uses a transformer-based backbone where each input tactile modality is first processed
independently though \( L_f \) layers through self-distillation, followed by cross-modal information flow through a fusion transformer
facilitated by \( B \) bottleneck fusion tokens. After each cross-modal update the fusion tokens are averaged across modalities to promote
information sharing.
We train Sparsh-X on a large-scale dataset of \( \sim \)1M contact-rich interactions collected with the Digit 360 sensor,
generated primarily from two sources: a) an Allegro hand with Digit360 sensors on each fingertip b) a manual picker with the same sensor adapted to gripping
mechanism, used to execute atomic manipulation actions such as picking up, sliding, tapping, dropping, against various surfaces and objects.
Sparsh-X downstream tasks
Plug insertion with Sparsh-X representations
For the first policy learning task, we demonstrate that Sparsh-X representation improves policies
needing tactile sensing, when trained with behavior cloning. Here,
Here, we show the results of a plug insertion task where the goal is to insert a pre-grasped plug
into the first socket of an extension power strip.
Specifically, we train a transformer decoder to predict action sequences. The policy is given as
observations input, images from three third-person cameras and a wrist camera,
as well as tactile observations in the form of Sparsh-X representations.
Sparsh-X (All modalities) + Wrist camera
End-to-end tactile with only images
In hand rotation
For many in-hand manipulation tasks, it is quite difficult to teleoperate the robot hand to accomplish tasks as robot teleoperation without haptic feedback is quite challenging today.
For these cases, an alternate possibility is to train a policy in simulation using reinforcment learning and then play it in the real world. However, sim-to-real transfer for tactile sensing
is also quite challenging as tactile sensors are difficult to simulate. In this paper, we propose a real-world tactile adaptation method that uses Sparsh-X representations to adapt a sim-trained policy to the real world.
Details are provided in the paper. Here we show results on the in-hand rotation task, where we demonstrate that after tactile adaptation using Sparsh-X representations, the robot policy is more robust to perturbations such as
changes in weight of the object and slippage.