Contributing Datasets#

We encourage all people to contribute their eye-tracking datasets to the pymovements library. This increases the visibility and impact of the associated research as included datasets can be easily discovered and downloaded using the pymovements software.

This contributing guide will provide you with all resources you need to contribute a dataset to the pymovements library. It is offered on three levels. Pick the one that corresponds best to your prior knowledge:

Basic: You have never used pymovements before and don’t have any programming experience. → You will learn how to create an issue that will allow pymovement maintainers to add your dataset
Intermediate: You have some experience with programming, but are not familiar with Git. → You will learn how to create a pull request with a draft of your dataset definition file
Advanced: You are proficient in Python and familiar with Git. → You will learn how to add and test your dataset definition on your local machine

Prerequisites: hosting your dataset#

pymovements does not host datasets, it only provides an interface for downloading and reading them. Therefore, you will need to upload your dataset somewhere, such that pymovements will be able to download it.

Your data must be openly available and downloadable from a simple link without requiring additional steps like logging in. We recommend OSF for hosting your files, but other platforms like Zenodo or GitHub will also work.
Your data must be stored in one of the supported formats: CSV, ASC (EyeLink), IPC/Feather.
Your dataset may consist of multiple files, including ZIP files containing nested folders.
Trial information (e.g., trial or participant IDs) may be stored as additional columns in the data files, in the filenames, or as messages in ASC files.

Basic#

To add your dataset, we will need some information on where and in what format you stored your data, as well as some metadata about your data collection. Specifically, we need:

Links to your data files (containing sample-based data, event-based data, and/or aggregated measures)
Information on where/how participant IDs, trial IDs, and related data are stored (within the data files, or in the filename)
Information on the screen you used to present the stimuli:
- Screen size in centimeters
- Screen resolution in pixels
- Eye-to-screen distance
Information on the eye-tracker you used:
- Model and manufacturer
- Sampling rate
- Where the origin (0, 0) of the gaze coordinates recorded by the eye-tracker is (e.g., top left of screen, center of screen)
Any paper(s) you would like to be referenced by users of your dataset

Once you have all the information, you can create an issue in the pymovements repository on GitHub. You need a GitHub account to do this. If you prefer not creating a GitHub account, please send the information above to pymovements@python.org instead.

After receiving your information, we will start working to include your dataset. It is likely that we will need some additional information from you, so please keep an eye on the GitHub issue. Once the inclusion is completed, your dataset will be included in the next release of pymovements. This process may take several weeks.

Intermediate#

To add a new dataset to the library, you will need to create a DatasetDefinition. This is a text file in the YAML format that contains information about where your dataset is hosted, what format it is stored in, how it was collected, and other metadata. You can find some examples of YAML files for existing datasets here.

You will need to draft a new YAML file and create a pull request on GitHub. This requires a GitHub account. You can use this link to create a new file directly in your browser: NEW YAML FILE

The most important fields are name, long_name, resources, which contains the links to the data files, and experiment, which contains metadata about the physical setup and the eye tracker:

name: MyDataset

long_name: "Long name of my dataset"

resources:
  - content: gaze
    url: "https://url.to/data/file/gaze.csv"
    filename: "gaze.csv"
    md5: "<MD5 hash of file content>"

experiment:
  - eyetracker:
    - sampling_rate: 1000
  - screen:
    - width_px: 1920
    - height_px: 1080
    - width_cm: 50
    - height_cm: 28
    - distance_cm: 60
    - origin: "upper left"

The field resources contains a list of ResourceDefinition instances, which contain the necessary data to download and load a specific resource (group) of a dataset. It also includes the type of content that is contained in files of that particular resource group. In our example we only have a single type of resource: gaze (samples). Other supported content types are precomputed_events and precomputed_reading_measures.

Detailed documentation on the different fields can be found in the API references for DatasetDefinition and ResourceDefinition.

To get the MD5 hash of a data file, you can either use the command line:

md5sum path/to/gaze.csv

or Python code:

from pymovements.datasets._utils._downloads import _calculate_md5
_calculate_md5("path/to/gaze.csv")

After adding your information in the file, click “Commit changes” to create a pull request. Feel free to create a pull request even if the file is still missing some information. You (or pymovements maintainers) will still be able to edit it later. If you are unsure about something, just add a comment on the pull request, and we will help you.

Once the pull request is created, we will start working to include your dataset. It is likely that we will need some additional information from you, so please keep an eye on the pull request. Once the pull request is completed and merged, your dataset will be included in the next release of pymovements. This process may take several weeks.

Advanced#

Follow the contributing guide to set up your development environment.

Setting up your dataset locally#

We recommend setting up and testing your dataset locally first. Please refer to the Working with a Local Dataset tutorial.

The DatasetDefinition that we get from that tutorial looks like this:

import pymovements as pm

experiment = pm.Experiment(
    sampling_rate=1000,
    screen=pm.Screen(
        width_px=1280,
        height_px=1024,
        width_cm=38,
        height_cm=30.2,
        distance_cm=68,
        origin='upper left',
    ),
)

dataset_definition = pm.DatasetDefinition(
    name='MyDataset',
    experiment=experiment,
    resources=[
        pm.ResourceDefinition(
            content='gaze',
            filename_pattern=r'trial_{text_id:d}_{page_id:d}.csv',
            filename_pattern_schema_overrides={
                'text_id': int, 'page_id': int,
            },
            load_kwargs={
                'read_csv_kwargs': {'separator': '\t'},
                'time_column': 'timestamp',
                'time_unit': 'ms',
                'pixel_columns': ['x', 'y'],
            },
        ),
    ],
)

Adding resource definitions#

The dataset definition above enables you to load your local data files into pymovements data structures. However, in order to download online dataset resources through pymovements we need to add the source url to ResourceDefinitions.

Let’s add the URL and checksum to the DatasetDefinition of our toy dataset:

url = 'https://github.com/pymovements/pymovements-toy-dataset/archive/refs/heads/main.zip'

dataset_definition = pm.DatasetDefinition(
    name='MyDataset',
    resources=[
        pm.ResourceDefinition(
            content='gaze',
            url=url,
            filename='pymovements-toy-dataset.zip',
            md5='256901852c1c07581d375eef705855d6',
            filename_pattern=r'trial_{text_id:d}_{page_id:d}.csv',
            filename_pattern_schema_overrides={
                'text_id': int, 'page_id': int,
            },
            load_kwargs={
                'read_csv_kwargs': {'separator': '\t'},
                'time_column': 'timestamp',
                'time_unit': 'ms',
                'pixel_columns': ['x', 'y'],
            },
        ),
    ],
    experiment=experiment,
)

Note that some of the information previously defined at the definition level (has_files, filename_pattern, filename_patterns_schema_overrides) are now defined at the level of individual resources.

To get the MD5 hash of a file, you can either use the command line:

md5sum path/to/pymovements-toy-dataset.zip

or Python code:

from pymovements.datasets._utils._downloads import _calculate_md5
_calculate_md5("path/to/pymovements-toy-dataset.zip")

Let’s test if the data files can be downloaded and loaded into memory:

dataset = pm.Dataset(
    definition=dataset_definition,
    path='data/my_dataset',
)
dataset.download()

INFO:pymovements.dataset.dataset:
        You are downloading the MyDataset dataset. Please be aware that pymovements does not
        host or distribute any dataset resources and only provides a convenient interface to
        download the public dataset resources that were published by their respective authors.

        Please cite the referenced publication if you intend to use the dataset in your research.
        

Downloading https://github.com/pymovements/pymovements-toy-dataset/archive/refs/heads/main.zip to data/my_dataset/downloads/pymovements-toy-dataset.zip

Checking integrity of pymovements-toy-dataset.zip
Extracting pymovements-toy-dataset.zip to data/my_dataset/raw

Extracting archive:   0%|          | 0/23 [00:00<?, ?file/s]

Extracting archive: 100%|██████████| 23/23 [00:00<00:00, 352.38file/s]

Dataset

definition:
DatasetDefinition
DatasetDefinition
- acceleration_columns:
  None
  
  None
- column_map:
  None
  
  None
- custom_read_kwargs:
  None
  
  None
- distance_column:
  None
  
  None
- experiment:
  Experiment
  Experiment
  - eyetracker:
    EyeTracker
    
    EyeTracker
    
    left:
    None
    
    None
    
    model:
    None
    
    None
    
    mount:
    None
    
    None
    
    right:
    None
    
    None
    
    sampling_rate:
    1000
    
    1000
    
    vendor:
    None
    
    None
    
    version:
    None
    
    None
  - sampling_rate:
    1000
    
    1000
  - screen:
    Screen
    
    Screen
    
    distance_cm:
    68
    
    68
    
    height_cm:
    30.2
    
    30.2
    
    height_px:
    1024
    
    1024
    
    origin:
    'upper left'
    
    'upper left'
    
    width_cm:
    38
    
    38
    
    width_px:
    1280
    
    1280
    
    x_max_dva:
    15.599386487782953
    
    15.599386487782953
    
    x_min_dva:
    -15.599386487782953
    
    -15.599386487782953
    
    y_max_dva:
    12.508044410882546
    
    12.508044410882546
    
    y_min_dva:
    -12.508044410882546
    
    -12.508044410882546
- extract:
  None
  
  None
- filename_format:
  dict (1 items)
  - gaze:
    'trial_{text_id:d}_{page_id:d}.csv'
    
    'trial_{text_id:d}_{page_id:d}.csv'
- filename_format_schema_overrides:
  dict (1 items)
  - gaze:
    dict (2 items)
    
    text_id:
    <class 'int'>
    
    <class 'int'>
    
    page_id:
    <class 'int'>
    
    <class 'int'>
- has_resources:
  True
  
  True
- long_name:
  None
  
  None
- mirrors:
  dict (0 items)
- name:
  'MyDataset'
  
  'MyDataset'
- pixel_columns:
  None
  
  None
- position_columns:
  None
  
  None
- resources:
  list (1 items)
  - ResourceDefinition
    
    content:
    'gaze'
    
    'gaze'
    
    filename:
    'pymovements-toy-dataset.zip'
    
    'pymovements-toy-dataset.zip'
    
    filename_pattern:
    'trial_{text_id:d}_{page_id:d}.csv'
    
    'trial_{text_id:d}_{page_id:d}.csv'
    
    filename_pattern_schema_overrides:
    dict (2 items)
    
    text_id:
    <class 'int'>
    
    <class 'int'>
    
    page_id:
    <class 'int'>
    
    <class 'int'>
    
    load_function:
    None
    
    None
    
    load_kwargs:
    dict (4 items)
    
    read_csv_kwargs:
    dict (1 items)
    
    separator:
    '\t'
    
    '\t'
    
    time_column:
    'timestamp'
    
    'timestamp'
    
    (2 more)
    
    md5:
    '256901852c1c07581d375eef705855d6'
    
    '256901852c1c07581d375eef705855d6'
    
    mirrors:
    None
    
    None
    
    url:
    str
    
    'https://github.com/pymovements/pymovements-toy-dataset/archive/refs/heads/main.zip'
- time_column:
  None
  
  None
- time_unit:
  None
  
  None
- trial_columns:
  None
  
  None
- velocity_columns:
  None
  
  None
events:
tuple (0 items)
fileinfo:
DataFrame (0 columns, 0 rows)

shape: (0, 0)
gaze:
list (0 items)
path:
PosixPath('data/my_dataset')

PosixPath('data/my_dataset')
paths:
DatasetPaths
DatasetPaths
- dataset:
  PosixPath('data/my_dataset')
  
  PosixPath('data/my_dataset')
- downloads:
  PosixPath('data/my_dataset/downloads')
  
  PosixPath('data/my_dataset/downloads')
- events:
  PosixPath('data/my_dataset/events')
  
  PosixPath('data/my_dataset/events')
- precomputed_events:
  PosixPath('data/my_dataset/precomputed_events')
  
  PosixPath('data/my_dataset/precomputed_events')
- precomputed_reading_measures:
  PosixPath
  
  PosixPath('data/my_dataset/precomputed_reading_measures')
- preprocessed:
  PosixPath('data/my_dataset/preprocessed')
  
  PosixPath('data/my_dataset/preprocessed')
- raw:
  PosixPath('data/my_dataset/raw')
  
  PosixPath('data/my_dataset/raw')
- root:
  PosixPath('data/my_dataset')
  
  PosixPath('data/my_dataset')
precomputed_events:
list (0 items)
precomputed_reading_measures:
list (0 items)

And load the dataset into memory:

dataset.load()

Dataset

definition:
DatasetDefinition
DatasetDefinition
- acceleration_columns:
  None
  
  None
- column_map:
  None
  
  None
- custom_read_kwargs:
  None
  
  None
- distance_column:
  None
  
  None
- experiment:
  Experiment
  Experiment
  - eyetracker:
    EyeTracker
    
    EyeTracker
    
    left:
    None
    
    None
    
    model:
    None
    
    None
    
    mount:
    None
    
    None
    
    right:
    None
    
    None
    
    sampling_rate:
    1000
    
    1000
    
    vendor:
    None
    
    None
    
    version:
    None
    
    None
  - sampling_rate:
    1000
    
    1000
  - screen:
    Screen
    
    Screen
    
    distance_cm:
    68
    
    68
    
    height_cm:
    30.2
    
    30.2
    
    height_px:
    1024
    
    1024
    
    origin:
    'upper left'
    
    'upper left'
    
    width_cm:
    38
    
    38
    
    width_px:
    1280
    
    1280
    
    x_max_dva:
    15.599386487782953
    
    15.599386487782953
    
    x_min_dva:
    -15.599386487782953
    
    -15.599386487782953
    
    y_max_dva:
    12.508044410882546
    
    12.508044410882546
    
    y_min_dva:
    -12.508044410882546
    
    -12.508044410882546
- extract:
  None
  
  None
- filename_format:
  dict (1 items)
  - gaze:
    'trial_{text_id:d}_{page_id:d}.csv'
    
    'trial_{text_id:d}_{page_id:d}.csv'
- filename_format_schema_overrides:
  dict (1 items)
  - gaze:
    dict (2 items)
    
    text_id:
    <class 'int'>
    
    <class 'int'>
    
    page_id:
    <class 'int'>
    
    <class 'int'>
- has_resources:
  True
  
  True
- long_name:
  None
  
  None
- mirrors:
  dict (0 items)
- name:
  'MyDataset'
  
  'MyDataset'
- pixel_columns:
  None
  
  None
- position_columns:
  None
  
  None
- resources:
  list (1 items)
  - ResourceDefinition
    
    content:
    'gaze'
    
    'gaze'
    
    filename:
    'pymovements-toy-dataset.zip'
    
    'pymovements-toy-dataset.zip'
    
    filename_pattern:
    'trial_{text_id:d}_{page_id:d}.csv'
    
    'trial_{text_id:d}_{page_id:d}.csv'
    
    filename_pattern_schema_overrides:
    dict (2 items)
    
    text_id:
    <class 'int'>
    
    <class 'int'>
    
    page_id:
    <class 'int'>
    
    <class 'int'>
    
    load_function:
    None
    
    None
    
    load_kwargs:
    dict (4 items)
    
    read_csv_kwargs:
    dict (1 items)
    
    separator:
    '\t'
    
    '\t'
    
    time_column:
    'timestamp'
    
    'timestamp'
    
    (2 more)
    
    md5:
    '256901852c1c07581d375eef705855d6'
    
    '256901852c1c07581d375eef705855d6'
    
    mirrors:
    None
    
    None
    
    url:
    str
    
    'https://github.com/pymovements/pymovements-toy-dataset/archive/refs/heads/main.zip'
- time_column:
  None
  
  None
- time_unit:
  None
  
  None
- trial_columns:
  None
  
  None
- velocity_columns:
  None
  
  None
events:
tuple (20 items)
- Events
  - frame:
    DataFrame (4 columns, 0 rows)
    
    shape: (0, 4)
    name onset offset duration
    str i64 i64 i64
  - trial_columns:
    None
    
    None
- Events
  - frame:
    DataFrame (4 columns, 0 rows)
    
    shape: (0, 4)
    name onset offset duration
    str i64 i64 i64
  - trial_columns:
    None
    
    None
- (18 more)

fileinfo:

dict (1 items)

gaze:

DataFrame (3 columns, 20 rows)

shape: (20, 3)

text_id	page_id	filepath
i64	i64	str
0	1	"pymovements-toy-dataset-main/d…
0	2	"pymovements-toy-dataset-main/d…
0	3	"pymovements-toy-dataset-main/d…
0	4	"pymovements-toy-dataset-main/d…
0	5	"pymovements-toy-dataset-main/d…
…	…	…
3	1	"pymovements-toy-dataset-main/d…
3	2	"pymovements-toy-dataset-main/d…
3	3	"pymovements-toy-dataset-main/d…
3	4	"pymovements-toy-dataset-main/d…
3	5	"pymovements-toy-dataset-main/d…

gaze:

list (20 items)

Gaze

samples:

DataFrame (4 columns, 17223 rows)

shape: (17_223, 4)

time	stimuli_x	stimuli_y	pixel
i64	f64	f64	list[f64]
1988145	-1.0	-1.0	[206.8, 152.4]
1988146	-1.0	-1.0	[206.9, 152.1]
1988147	-1.0	-1.0	[207.0, 151.8]
1988148	-1.0	-1.0	[207.1, 151.7]
1988149	-1.0	-1.0	[207.0, 151.5]
…	…	…	…
2005363	-1.0	-1.0	[361.0, 415.4]
2005364	-1.0	-1.0	[358.0, 414.5]
2005365	-1.0	-1.0	[355.8, 413.8]
2005366	-1.0	-1.0	[353.1, 413.2]
2005367	-1.0	-1.0	[351.2, 412.9]

events:
Events
Events
- frame:
  DataFrame (4 columns, 0 rows)
  
  shape: (0, 4)
  name onset offset duration
  str i64 i64 i64
- trial_columns:
  None
  
  None
trial_columns:
None

None
experiment:
Experiment
Experiment
- eyetracker:
  EyeTracker
  EyeTracker
  - left:
    None
    
    None
  - model:
    None
    
    None
  - mount:
    None
    
    None
  - right:
    None
    
    None
  - sampling_rate:
    1000
    
    1000
  - vendor:
    None
    
    None
  - version:
    None
    
    None
- sampling_rate:
  1000
  
  1000
- screen:
  Screen
  Screen
  - distance_cm:
    68
    
    68
  - height_cm:
    30.2
    
    30.2
  - height_px:
    1024
    
    1024
  - origin:
    'upper left'
    
    'upper left'
  - width_cm:
    38
    
    38
  - width_px:
    1280
    
    1280
  - x_max_dva:
    15.599386487782953
    
    15.599386487782953
  - x_min_dva:
    -15.599386487782953
    
    -15.599386487782953
  - y_max_dva:
    12.508044410882546
    
    12.508044410882546
  - y_min_dva:
    -12.508044410882546
    
    -12.508044410882546

Gaze

samples:

DataFrame (4 columns, 29799 rows)

shape: (29_799, 4)

time	stimuli_x	stimuli_y	pixel
i64	f64	f64	list[f64]
2008305	-1.0	-1.0	[141.4, 153.6]
2008306	-1.0	-1.0	[141.1, 153.2]
2008307	-1.0	-1.0	[140.7, 152.8]
2008308	-1.0	-1.0	[140.6, 152.7]
2008309	-1.0	-1.0	[140.5, 152.6]
…	…	…	…
2038099	-1.0	-1.0	[273.8, 773.8]
2038100	-1.0	-1.0	[273.8, 774.1]
2038101	-1.0	-1.0	[273.9, 774.5]
2038102	-1.0	-1.0	[274.0, 774.4]
2038103	-1.0	-1.0	[274.0, 773.9]

events:
Events
Events
- frame:
  DataFrame (4 columns, 0 rows)
  
  shape: (0, 4)
  name onset offset duration
  str i64 i64 i64
- trial_columns:
  None
  
  None
trial_columns:
None

None
experiment:
Experiment
Experiment
- eyetracker:
  EyeTracker
  EyeTracker
  - left:
    None
    
    None
  - model:
    None
    
    None
  - mount:
    None
    
    None
  - right:
    None
    
    None
  - sampling_rate:
    1000
    
    1000
  - vendor:
    None
    
    None
  - version:
    None
    
    None
- sampling_rate:
  1000
  
  1000
- screen:
  Screen
  Screen
  - distance_cm:
    68
    
    68
  - height_cm:
    30.2
    
    30.2
  - height_px:
    1024
    
    1024
  - origin:
    'upper left'
    
    'upper left'
  - width_cm:
    38
    
    38
  - width_px:
    1280
    
    1280
  - x_max_dva:
    15.599386487782953
    
    15.599386487782953
  - x_min_dva:
    -15.599386487782953
    
    -15.599386487782953
  - y_max_dva:
    12.508044410882546
    
    12.508044410882546
  - y_min_dva:
    -12.508044410882546
    
    -12.508044410882546

(18 more)

path:
PosixPath('data/my_dataset')

PosixPath('data/my_dataset')
paths:
DatasetPaths
DatasetPaths
- dataset:
  PosixPath('data/my_dataset')
  
  PosixPath('data/my_dataset')
- downloads:
  PosixPath('data/my_dataset/downloads')
  
  PosixPath('data/my_dataset/downloads')
- events:
  PosixPath('data/my_dataset/events')
  
  PosixPath('data/my_dataset/events')
- precomputed_events:
  PosixPath('data/my_dataset/precomputed_events')
  
  PosixPath('data/my_dataset/precomputed_events')
- precomputed_reading_measures:
  PosixPath
  
  PosixPath('data/my_dataset/precomputed_reading_measures')
- preprocessed:
  PosixPath('data/my_dataset/preprocessed')
  
  PosixPath('data/my_dataset/preprocessed')
- raw:
  PosixPath('data/my_dataset/raw')
  
  PosixPath('data/my_dataset/raw')
- root:
  PosixPath('data/my_dataset')
  
  PosixPath('data/my_dataset')
precomputed_events:
list (0 items)
precomputed_reading_measures:
list (0 items)

Writing the YAML file#

All public datasets in the library are defined in YAML files (see here for examples). These YAML files contain exactly the same fields as the DatasetDefinition objects, and the two can be easily converted into each other.

Let’s convert the DatasetDefinition of our toy dataset to a YAML file:

dataset_definition.to_yaml('my_dataset.yaml')

Let’s check the content of the written file:

with open('my_dataset.yaml', encoding='utf-8') as f:
    print(f.read())

name: MyDataset
resources:
- content: gaze
  filename: pymovements-toy-dataset.zip
  url: https://github.com/pymovements/pymovements-toy-dataset/archive/refs/heads/main.zip
  md5: 256901852c1c07581d375eef705855d6
  filename_pattern: trial_{text_id:d}_{page_id:d}.csv
  filename_pattern_schema_overrides:
    text_id: '!int'
    page_id: '!int'
  load_kwargs:
    read_csv_kwargs:
      separator: "\t"
    time_column: timestamp
    time_unit: ms
    pixel_columns:
    - x
    - y
experiment:
  eyetracker:
    sampling_rate: 1000
  screen:
    width_px: 1280
    height_px: 1024
    width_cm: 38
    height_cm: 30.2
    distance_cm: 68
    origin: upper left

This YAML file can now be added to src/pymovements/datasets/. Commit the file in your fork of the pymovements repository and create a pull request. We will then review the pull request and request additional information or changes if necessary, so please keep an eye on the pull request. Once the pull request is completed and merged, your dataset will be included in the next release of pymovements. This process may take several weeks.

If you run into problems, feel free to create a draft pull request and explain the issue so that we can provide support.

Running integration tests#

To check whether the integration in the dataset library is working properly, you can run the integration test for your dataset:

tox -e integration -- \
  'tests/integration/public_dataset_processing_test.py::test_public_dataset_processing[my_dataset]'