This web page supports a dataset of natural language instructions for object specification in manipulation scenarios. It is comprised of 1582 individual written instructions which were collected via online crowdsourcing. Each of these instructions was elicited from one of 28 included scenario images. This dataset is particularly useful for researchers who work in natural language processing, human-robot interaction, and robotic tabletop manipulation. In addition to serving as a rich corpus of domain specific language, it provides a benchmark of image/instruction pairs to be used in system evaluations as well as uncovers inherent challenges in tabletop object specification.
Referred Journal Publication
R. Scalise*, S. Li*, H. Admoni, S. Rosenthal, and S. Srinivasa "Natural Language Instructions for Human-Robot Collaborative Manipulation", International Journal of Robotics Research, in press.
Data and accessing code
Primary Dataset: Natural Language Instructions Corpus
(We downsampled the data from 1582 to 1400 and use the 1400 instructions in the evaluation study)
Example of one row from the table containing the primary dataset:
Supplementary Dataset: Instruction Evaluation
- Note: in accessing code of study 2,
r_target_block_indexis referring to the index of the target block. The index of all the blocks and the indices of target blocks in both versions of each scenario on the tabletop are annotated in images_code.pdf
Example of one row from the table containing the evaluation dataset (non-averaged):
Note that 'Scenario' denotes the corresponding stimulus image paired with the shown instruction when it was elicited during the first study. In the evaluation study stimulus images, there is no distinction between 'v0' and 'v1' for each image as there is no red arrow specifiying a particular block. The first (of 14 total images) is Configuration_01.png.
Please refer to Table 3 in the IJRR Data Paper for further details on each header field.
The stimulus images used in Study 1
"Configuration_example_page.png" was only used in the example page of the initial study.
Images from "Configuration_01_**.png" to "Configuration_14_**.png" are the images used as stimuli. From each of the 14 configurations, there are 2 possible target blocks selected which are indicated by a red arrow ("Configuration_**_v1.png" and "Configuration_**_v2.png"). In total, there are 28 unique scenarios in the set of stimulus.
The stimulus images used in Study 2
Contains the image from the example page as well as 14 stimulus images (the original 28 sans red arrow).
Publications based on this dataset
Shen Li*, Rosario Scalise*, Henny Admoni, Stephanie Rosenthal, and Siddhartha S Srinivasa. Spatial references and perspective in natural language instructions for collaborative manipulation. In Proceedings of the International Symposium on Robot and Human Interactive Communication Conference. IEEE, 2016.
Shen Li*, Rosario Scalise*, Henny Admoni, Stephanie Rosenthal, and Siddhartha S Srinivasa. Perspective in Natural Language Instructions for Collaborative Manipulation. In Proceedings of the Robotics: Science and Systems Workshop on Model Learning for Human-Robot Communication. 2016.
Workshop at Robotics: Science and Systems 2016 - Model Learning for Human-Robot Communication
If you have any questions about the dataset, or intend to collaborate with us on human-robot communication, please contact us! We are excited to hear from you!
You can reach either of us via email:
Rosario Scalise firstname.lastname@example.org
Shen Li email@example.com
Source code is provided under the MIT License and the .csv data is available under a CC BY-SA 4.0 license.