Self-supervised perception for tactile skin
covered dexterous hands

Akash Sharma^1,2, Carolina Higuera^2,3, Chaitanya Krishna Bodduluri², Zixi Liu², Taosha Fan², Tess Hellebrekers², Mike Lambeta², Byron Boots³ Michael Kaess¹ Tingfan Wu² Francois R. Hogan² Mustafa Mukadam²

¹Carnegie Mellon University, ²FAIR at Meta, ³University of Washington,

CoRL 2025

PDF arXiv Video Code Dataset

Sparsh-skin is an approach to learn general representations for magnetic tactile skins covering dexterous robot hands. Sparsh-skin is trained via self-supervision on a large pretraining dataset (\(\sim 4\) hours) containing diverse atomic in-hand interactions. It takes as input a brief history of tactile observations \(\mathbf{x}_i\) and 3D sensor positions \(\mathbf{p}_i\) to produce performant full-hand contextual representations. Sparsh-skin representations are general purpose and can be used in a variety of contact-rich downstream tasks.

Abstract

We present Sparsh-skin, a pre-trained encoder for magnetic skin sensors distributed across the fingertips, phalanges, and palm of a dexterous robot hand. Magnetic tactile skins offer a flexible form factor for hand-wide coverage with fast response times, in contrast to vision-based tactile sensors that are restricted to the fingertips and limited by bandwidth. Full hand tactile perception is crucial for robot dexterity. However, a lack of general-purpose models, challenges with interpreting magnetic flux and calibration have limited the adoption of these sensors.Sparsh-skin, given a history of kinematic and tactile sensing across a hand, outputs a latent tactile embedding that can be used in any downstream task. The encoder is self-supervised via self-distillation on a variety of unlabeled hand-object interactions using an Allegro hand sensorized with Xela uSkin. In experiments across several benchmark tasks, from state estimation to policy learning, we find that pretrained Sparsh-skin representations are both sample efficient in learning downstream tasks and improve task performance by over 41% compared to prior work and over 56% compared to end-to-end learning.

Walkthrough Video

Sparsh-skin overview

We collect a pre-training dataset of the robot hand performing various atomic manipulation actions with 14 household object and toys including squeeze, slide, rotation, pick-and-drop, circrumduction, pressing, wiping, and articulation. Using a VR based teleoperation system with Meta Quest 3, we record \( \sim \) 4 hours of varied interactions. Then, Sparsh-skin uses a self-distillation approach to learn sensor-level representations over small windows of tactile data. Specifically, the student encoder network is given corrupted tactile signal data and trained to predict / match the representations that are predicted by the teacher network from complete tactile signal data. Once the representations are pre-trained they can be used for downstream tasks such as force estimation, pose estimation, and policy learning.

Sparsh-skin downstream tasks

A snapshot of our robot system for downstream tasks.

Signal auto reconstructions from Sparsh-skin features

We visualize the auto-reconstruction of tactile signals from the latent features of Sparsh-skin. Red dots indicate the tactile sensor locations on the robot hand. Normal force applied on the sensor directly correlates with the radius of the green blobs over a sensor. Further, the offset of the center of the green blob, higlighted by red vectors from the canonical sensor position to the offset position, indicates the shear force applied on the sensors.

Pose estimation

We also show that Sparsh-skin representations capture relative object pose information. (See comparisons in the paper)

Plug insertion task

Sparsh-skin representations can enable planning for manipulation. Here, we show the results of a plug insertion task where the goal is to insert a pre-grasped plug into the first socket of an extension power strip. Specifically, we train a transformer decoder to predict action sequences. The policy is given as observations input, images from three third-person cameras and a wrist camera, as well as tactile observations in the form of Sparsh-skin representations. The robot setup is shown in the setup figure in the overview section.

Sparsh-skin frozen

Vision only

BibTeX

If you find our work useful, please consider citing our paper:

@inproceedings{sharma2025selfsupervised,
  title = {Self-supervised perception for tactile skin covered dexterous hands},
  author={Akash Sharma and Carolina Higuera and Chaithanya Krishna Bodduluri and Zixi Liu and Taosha Fan and Tess Hellebrekers and Mike Lambeta and Byron Boots and Michael Kaess and Tingfan Wu and Francois Robert Hogan and Mustafa Mukadam},   year = {2025},
  booktitle={9th Annual Conference on Robot Learning},
  url={https://openreview.net/forum?id=eLeCrM5PEO}
}