We present Sparsh-X, the first multisensory touch representations
across four tactile modalities: image, audio, motion, and pressure. Trained on \( \sim \) 1M
contact-rich interactions collected with the Digit 360 sensor, Sparsh-X
captures complementary touch signals at diverse temporal and spatial scales. By leveraging
self-supervised learning, Sparsh-X fuses these modalities into a
unified representation that captures physical properties useful for robot manipulation
tasks.
We study how to effectively integrate real-world touch representations for both imitation
learning and tactile adaptation of sim-trained policies, showing that Sparsh-X
boosts policy success rates by 63% over an end-to-end model using tactile images and
improves
robustness by 90% in recovering object states from touch. Finally, we benchmark Sparsh-X's
ability to make inferences about physical properties, such as object-action identification,
material-quantity estimation, and force estimation. Sparsh-X
improves
accuracy in characterizing physical properties by 48% compared to end-to-end approaches,
demonstrating the advantages of multisensory pretraining for capturing features essential
for dexterous manipulation.