Working with a Local Dataset#

In this tutorial, we will show how to use your own local dataset with the Dataset class. The Dataset class can help you to manage and process your eyetracking data.

Preparations#

We import pymovements as the alias pm for convenience.

import pymovements as pm

For demonstration purposes, we will use the raw data provided by the Toy dataset, a sample dataset that comes with pymovements.

We will download the resources of this dataset the directory to simulate a local dataset for you. All downloaded archive files are automatically extracted and then removed. The directory of the dataset will be data/my_dataset.

After that we won’t use the python class anymore and delete the object (the files on your system will stay in place). Don’t worry if you’re confused about these lines as they are not relevant to your use case.

Just keep in mind that we now have some files with gaze data in the directory data/my_dataset.

toy_dataset = pm.Dataset('ToyDataset', path='data/my_dataset')
toy_dataset.download(remove_finished=True)

del toy_dataset

INFO:pymovements.dataset.dataset:
        You are downloading the pymovements Toy Dataset. Please be aware that pymovements does not
        host or distribute any dataset resources and only provides a convenient interface to
        download the public dataset resources that were published by their respective authors.

        Please cite the referenced publication if you intend to use the dataset in your research.
        

Downloading https://github.com/pymovements/pymovements-toy-dataset/archive/refs/heads/main.zip to data/my_dataset/downloads/pymovements-toy-dataset.zip

Checking integrity of pymovements-toy-dataset.zip
Extracting pymovements-toy-dataset.zip to data/my_dataset/raw

Extracting archive:   0%|          | 0/23 [00:00<?, ?file/s]

Extracting archive: 100%|██████████| 23/23 [00:00<00:00, 353.41file/s]

Defining your Dataset#

To load your dataset, you will need to specify a DatasetDefinition.

The following fields are required:

name: the (abbreviated) name of your dataset
experiment: the particular experiment setup
resources: metadata on your available dataset resources

Some additional fields are optional:

long_name: the long-form name of your dataset

Define your Experiment#

To use the Dataset class, we first need to create an Experiment instance. This class represents the properties of the experiment, such as the screen dimensions and sampling rate.

experiment = pm.Experiment(
    screen_width_px=1280,
    screen_height_px=1024,
    screen_width_cm=38,
    screen_height_cm=30.2,
    distance_cm=68,
    origin='upper left',
    sampling_rate=1000,
)

Defining your resources#

Next we will define our dataset resources by setting up a ResourceDefinition.

A ResourceDefinition should always include the following fields:

content: the type of content (e.g., gaze, precomputed_events)
filename_pattern: the filename pattern of resource files

Some additional fields are optional but might be necessary for your dataset:

filename_pattern_schema_overrides: specify datatypes of named groups in filename_pattern
load_function: the loading function, usually inferred automatically
load_kwargs: additional keyword arguments that are passed to the loading function

In our tutorial dataset we only have one type of content: gaze sample data stored in csv files, hence we only need to set up a single ResourceDefinition.

The filename_pattern is a pattern expression used to match dataset filenames. The named groups in the curly braces will be parsed as additional metadata.

In our tutorial dataset all files conform to the filename pattern:

filename_pattern = r'trial_{text_id:d}_{page_id:d}.csv'

This will match filenames like trial_1_2.csv and parse the values of text_id==1 and page_id==2.

As both text_id and page_id are numeric values, we can explicitly specify the these values as int:

filename_pattern_schema_overrides = {
    'text_id': int,
    'page_id': int,
}

Column Definitions#

The trial_columns argument can be used to specify which columns define a single trial.

This is important for correctly applying all preprocessing methods.

For this tiny single user dataset a trial is just defined by text_id and page_id.

trial_columns = ['text_id', 'page_id']

The time_column and pixel_columns arguments can be used to correctly map the columns in your dataframes. If the time unit differs from the default milliseconds ms one must also specify the time_unit for correct computations.

Depending on the content of your dataset, you can alternatively also provide position_columns, velocity_columns and acceleration_columns.

Specifying these columns is needed for correctly applying preprocessing methods. For example, if you want to apply the pix2deg() method, you will need to specify pixel_columns accordingly.

If your dataset has gaze positions available only in degrees of visual angle, you have to specify the position_columns instead.

time_column = 'timestamp'
time_unit = 'ms'
pixel_columns = ['x', 'y']

Setting up loading function parameters#

Now we must set up the parameters for our loading function.

As the content is gaze and the filename extension of the filename_pattern is .csv, the loading function is automatically inferred to be from_csv.

In case the loading function cannot be automatically inferred from your filename_pattern you will have to specify it explictly:

load_function = 'from_csv'

Have a look at the from_csv() reference to see what additional parameters you can set up.

We will use our defined values for time_column, time_unit and pixel_columns.

As our csv files are tab separated, we need to specify that separator via load_kwargs:

load_kwargs = {
    'time_column': time_column,
    'time_unit': time_unit,
    'pixel_columns': pixel_columns,
    'read_csv_kwargs': {'separator': '\t'},
}

We can now initialize our ResourceDefinition. The content keyword for our gaze sample files is ‘gaze’.

resource_definition = pm.ResourceDefinition(
    content='gaze',
    filename_pattern=filename_pattern,
    filename_pattern_schema_overrides=filename_pattern_schema_overrides,
    load_function=load_function,
    load_kwargs=load_kwargs,
)

Define and load the Dataset#

Next we use all these definitions and create a DatasetDefinition by passing in the root directory, Experiment instance, and other optional parameters such as the filename regular expression and custom CSV reading parameters.

dataset_definition = pm.DatasetDefinition(
    name='my_dataset',
    experiment=experiment,
    resources=[resource_definition],
)

Finally, we create a Dataset instance by using the DatasetDefinition and specifying the directory path.

dataset = pm.Dataset(
    definition=dataset_definition,
    path='data/my_dataset/',
)

If we have a root data directory which holds all your local datasets, we can further need to define the paths of the dataset.

The dataset, raw, preprocessed, and events parameters define the names of the directories for the dataset, raw data, preprocessed data, and events data, respectively.

dataset_paths = pm.DatasetPaths(
    root='data/',
    raw='raw',
    preprocessed='preprocessed',
    events='events',
)

dataset = pm.Dataset(
    definition=dataset_definition,
    path=dataset_paths,
)

Now let’s load the dataset into memory. Here we select a subset including the first page of texts with ID 1 and 2.

subset = {
    'text_id': [1, 2],
    'page_id': 1,
}

dataset.load(subset=subset)

Dataset

definition:
DatasetDefinition
DatasetDefinition
- acceleration_columns:
  None
  
  None
- column_map:
  None
  
  None
- custom_read_kwargs:
  None
  
  None
- distance_column:
  None
  
  None
- experiment:
  Experiment
  Experiment
  - eyetracker:
    EyeTracker
    
    EyeTracker
    
    left:
    None
    
    None
    
    model:
    None
    
    None
    
    mount:
    None
    
    None
    
    right:
    None
    
    None
    
    sampling_rate:
    1000
    
    1000
    
    vendor:
    None
    
    None
    
    version:
    None
    
    None
  - sampling_rate:
    1000
    
    1000
  - screen:
    Screen
    
    Screen
    
    distance_cm:
    68
    
    68
    
    height_cm:
    30.2
    
    30.2
    
    height_px:
    1024
    
    1024
    
    origin:
    'upper left'
    
    'upper left'
    
    width_cm:
    38
    
    38
    
    width_px:
    1280
    
    1280
    
    x_max_dva:
    15.599386487782953
    
    15.599386487782953
    
    x_min_dva:
    -15.599386487782953
    
    -15.599386487782953
    
    y_max_dva:
    12.508044410882546
    
    12.508044410882546
    
    y_min_dva:
    -12.508044410882546
    
    -12.508044410882546
- extract:
  None
  
  None
- filename_format:
  dict (1 items)
  - gaze:
    'trial_{text_id:d}_{page_id:d}.csv'
    
    'trial_{text_id:d}_{page_id:d}.csv'
- filename_format_schema_overrides:
  dict (1 items)
  - gaze:
    dict (2 items)
    
    text_id:
    <class 'int'>
    
    <class 'int'>
    
    page_id:
    <class 'int'>
    
    <class 'int'>
- has_resources:
  True
  
  True
- long_name:
  None
  
  None
- mirrors:
  dict (0 items)
- name:
  'my_dataset'
  
  'my_dataset'
- pixel_columns:
  None
  
  None
- position_columns:
  None
  
  None
- resources:
  list (1 items)
  - ResourceDefinition
    
    content:
    'gaze'
    
    'gaze'
    
    filename:
    None
    
    None
    
    filename_pattern:
    'trial_{text_id:d}_{page_id:d}.csv'
    
    'trial_{text_id:d}_{page_id:d}.csv'
    
    filename_pattern_schema_overrides:
    dict (2 items)
    
    text_id:
    <class 'int'>
    
    <class 'int'>
    
    page_id:
    <class 'int'>
    
    <class 'int'>
    
    load_function:
    'from_csv'
    
    'from_csv'
    
    load_kwargs:
    dict (4 items)
    
    time_column:
    'timestamp'
    
    'timestamp'
    
    time_unit:
    'ms'
    
    'ms'
    
    (2 more)
    
    md5:
    None
    
    None
    
    mirrors:
    None
    
    None
    
    url:
    None
    
    None
- time_column:
  None
  
  None
- time_unit:
  None
  
  None
- trial_columns:
  None
  
  None
- velocity_columns:
  None
  
  None
events:
tuple (2 items)
- Events
  - frame:
    DataFrame (4 columns, 0 rows)
    
    shape: (0, 4)
    name onset offset duration
    str i64 i64 i64
  - trial_columns:
    None
    
    None
- Events
  - frame:
    DataFrame (4 columns, 0 rows)
    
    shape: (0, 4)
    name onset offset duration
    str i64 i64 i64
  - trial_columns:
    None
    
    None
fileinfo:
dict (1 items)
- gaze:
  DataFrame (3 columns, 2 rows)
  
  shape: (2, 3)
  text_id page_id filepath
  i64 i64 str
  1 1 "pymovements-toy-dataset-main/d…
  2 1 "pymovements-toy-dataset-main/d…

gaze:

list (2 items)

Gaze

samples:

DataFrame (4 columns, 23054 rows)

shape: (23_054, 4)

time	stimuli_x	stimuli_y	pixel
i64	f64	f64	list[f64]
2415266	-1.0	-1.0	[176.8, 140.2]
2415267	-1.0	-1.0	[176.7, 139.8]
2415268	-1.0	-1.0	[176.7, 139.3]
2415269	-1.0	-1.0	[176.6, 139.3]
2415270	-1.0	-1.0	[176.7, 139.3]
…	…	…	…
2438315	-1.0	-1.0	[649.9, 633.9]
2438316	-1.0	-1.0	[650.1, 633.7]
2438317	-1.0	-1.0	[650.2, 633.5]
2438318	-1.0	-1.0	[650.0, 633.2]
2438319	-1.0	-1.0	[649.7, 633.1]

events:
Events
Events
- frame:
  DataFrame (4 columns, 0 rows)
  
  shape: (0, 4)
  name onset offset duration
  str i64 i64 i64
- trial_columns:
  None
  
  None
trial_columns:
None

None
experiment:
Experiment
Experiment
- eyetracker:
  EyeTracker
  EyeTracker
  - left:
    None
    
    None
  - model:
    None
    
    None
  - mount:
    None
    
    None
  - right:
    None
    
    None
  - sampling_rate:
    1000
    
    1000
  - vendor:
    None
    
    None
  - version:
    None
    
    None
- sampling_rate:
  1000
  
  1000
- screen:
  Screen
  Screen
  - distance_cm:
    68
    
    68
  - height_cm:
    30.2
    
    30.2
  - height_px:
    1024
    
    1024
  - origin:
    'upper left'
    
    'upper left'
  - width_cm:
    38
    
    38
  - width_px:
    1280
    
    1280
  - x_max_dva:
    15.599386487782953
    
    15.599386487782953
  - x_min_dva:
    -15.599386487782953
    
    -15.599386487782953
  - y_max_dva:
    12.508044410882546
    
    12.508044410882546
  - y_min_dva:
    -12.508044410882546
    
    -12.508044410882546

Gaze

samples:

DataFrame (4 columns, 29660 rows)

shape: (29_660, 4)

time	stimuli_x	stimuli_y	pixel
i64	f64	f64	list[f64]
1788369	-1.0	-1.0	[106.2, 90.3]
1788370	-1.0	-1.0	[107.2, 91.6]
1788371	-1.0	-1.0	[109.9, 94.4]
1788372	-1.0	-1.0	[113.3, 98.2]
1788373	-1.0	-1.0	[118.3, 102.7]
…	…	…	…
1818024	-1.0	-1.0	[357.0, 715.0]
1818025	-1.0	-1.0	[357.1, 714.9]
1818026	-1.0	-1.0	[357.1, 714.9]
1818027	-1.0	-1.0	[357.2, 714.5]
1818028	-1.0	-1.0	[357.2, 714.0]

events:
Events
Events
- frame:
  DataFrame (4 columns, 0 rows)
  
  shape: (0, 4)
  name onset offset duration
  str i64 i64 i64
- trial_columns:
  None
  
  None
trial_columns:
None

None
experiment:
Experiment
Experiment
- eyetracker:
  EyeTracker
  EyeTracker
  - left:
    None
    
    None
  - model:
    None
    
    None
  - mount:
    None
    
    None
  - right:
    None
    
    None
  - sampling_rate:
    1000
    
    1000
  - vendor:
    None
    
    None
  - version:
    None
    
    None
- sampling_rate:
  1000
  
  1000
- screen:
  Screen
  Screen
  - distance_cm:
    68
    
    68
  - height_cm:
    30.2
    
    30.2
  - height_px:
    1024
    
    1024
  - origin:
    'upper left'
    
    'upper left'
  - width_cm:
    38
    
    38
  - width_px:
    1280
    
    1280
  - x_max_dva:
    15.599386487782953
    
    15.599386487782953
  - x_min_dva:
    -15.599386487782953
    
    -15.599386487782953
  - y_max_dva:
    12.508044410882546
    
    12.508044410882546
  - y_min_dva:
    -12.508044410882546
    
    -12.508044410882546

path:
PosixPath('data/my_dataset')

PosixPath('data/my_dataset')
paths:
DatasetPaths
DatasetPaths
- dataset:
  PosixPath('data/my_dataset')
  
  PosixPath('data/my_dataset')
- downloads:
  PosixPath('data/my_dataset/downloads')
  
  PosixPath('data/my_dataset/downloads')
- events:
  PosixPath('data/my_dataset/events')
  
  PosixPath('data/my_dataset/events')
- precomputed_events:
  PosixPath('data/my_dataset/precomputed_events')
  
  PosixPath('data/my_dataset/precomputed_events')
- precomputed_reading_measures:
  PosixPath
  
  PosixPath('data/my_dataset/precomputed_reading_measures')
- preprocessed:
  PosixPath('data/my_dataset/preprocessed')
  
  PosixPath('data/my_dataset/preprocessed')
- raw:
  PosixPath('data/my_dataset/raw')
  
  PosixPath('data/my_dataset/raw')
- root:
  PosixPath('data')
  
  PosixPath('data')
precomputed_events:
list (0 items)
precomputed_reading_measures:
list (0 items)

Use the Dataset#

Once we have created the Dataset instance, we can use its methods to preprocess and analyze data in our local dataset.

dataset.gaze[0]

Gaze

samples:

DataFrame (4 columns, 23054 rows)

shape: (23_054, 4)

time	stimuli_x	stimuli_y	pixel
i64	f64	f64	list[f64]
2415266	-1.0	-1.0	[176.8, 140.2]
2415267	-1.0	-1.0	[176.7, 139.8]
2415268	-1.0	-1.0	[176.7, 139.3]
2415269	-1.0	-1.0	[176.6, 139.3]
2415270	-1.0	-1.0	[176.7, 139.3]
…	…	…	…
2438315	-1.0	-1.0	[649.9, 633.9]
2438316	-1.0	-1.0	[650.1, 633.7]
2438317	-1.0	-1.0	[650.2, 633.5]
2438318	-1.0	-1.0	[650.0, 633.2]
2438319	-1.0	-1.0	[649.7, 633.1]

events:
Events
Events
- frame:
  DataFrame (4 columns, 0 rows)
  
  shape: (0, 4)
  name onset offset duration
  str i64 i64 i64
- trial_columns:
  None
  
  None
trial_columns:
None

None
experiment:
Experiment
Experiment
- eyetracker:
  EyeTracker
  EyeTracker
  - left:
    None
    
    None
  - model:
    None
    
    None
  - mount:
    None
    
    None
  - right:
    None
    
    None
  - sampling_rate:
    1000
    
    1000
  - vendor:
    None
    
    None
  - version:
    None
    
    None
- sampling_rate:
  1000
  
  1000
- screen:
  Screen
  Screen
  - distance_cm:
    68
    
    68
  - height_cm:
    30.2
    
    30.2
  - height_px:
    1024
    
    1024
  - origin:
    'upper left'
    
    'upper left'
  - width_cm:
    38
    
    38
  - width_px:
    1280
    
    1280
  - x_max_dva:
    15.599386487782953
    
    15.599386487782953
  - x_min_dva:
    -15.599386487782953
    
    -15.599386487782953
  - y_max_dva:
    12.508044410882546
    
    12.508044410882546
  - y_min_dva:
    -12.508044410882546
    
    -12.508044410882546

Here we use the pix2deg() method to convert the pixel coordinates to degrees of visual angle.

dataset.pix2deg()

dataset.gaze[0]

Gaze

samples:

DataFrame (5 columns, 23054 rows)

shape: (23_054, 5)

time	stimuli_x	stimuli_y	pixel	position
i64	f64	f64	list[f64]	list[f64]
2415266	-1.0	-1.0	[176.8, 140.2]	[-11.420403, -9.148145]
2415267	-1.0	-1.0	[176.7, 139.8]	[-11.422806, -9.157834]
2415268	-1.0	-1.0	[176.7, 139.3]	[-11.422806, -9.169943]
2415269	-1.0	-1.0	[176.6, 139.3]	[-11.42521, -9.169943]
2415270	-1.0	-1.0	[176.7, 139.3]	[-11.422806, -9.169943]
…	…	…	…	…
2438315	-1.0	-1.0	[649.9, 633.9]	[0.260146, 3.038748]
2438316	-1.0	-1.0	[650.1, 633.7]	[0.265149, 3.033792]
2438317	-1.0	-1.0	[650.2, 633.5]	[0.26765, 3.028836]
2438318	-1.0	-1.0	[650.0, 633.2]	[0.262648, 3.021402]
2438319	-1.0	-1.0	[649.7, 633.1]	[0.255144, 3.018924]

events:
Events
Events
- frame:
  DataFrame (4 columns, 0 rows)
  
  shape: (0, 4)
  name onset offset duration
  str i64 i64 i64
- trial_columns:
  None
  
  None
trial_columns:
None

None
experiment:
Experiment
Experiment
- eyetracker:
  EyeTracker
  EyeTracker
  - left:
    None
    
    None
  - model:
    None
    
    None
  - mount:
    None
    
    None
  - right:
    None
    
    None
  - sampling_rate:
    1000
    
    1000
  - vendor:
    None
    
    None
  - version:
    None
    
    None
- sampling_rate:
  1000
  
  1000
- screen:
  Screen
  Screen
  - distance_cm:
    68
    
    68
  - height_cm:
    30.2
    
    30.2
  - height_px:
    1024
    
    1024
  - origin:
    'upper left'
    
    'upper left'
  - width_cm:
    38
    
    38
  - width_px:
    1280
    
    1280
  - x_max_dva:
    15.599386487782953
    
    15.599386487782953
  - x_min_dva:
    -15.599386487782953
    
    -15.599386487782953
  - y_max_dva:
    12.508044410882546
    
    12.508044410882546
  - y_min_dva:
    -12.508044410882546
    
    -12.508044410882546

We can use the pos2vel() method to calculate the velocity of the gaze position.

dataset.pos2vel(method='savitzky_golay', degree=2, window_length=7)

dataset.gaze[0]

Gaze

samples:

DataFrame (6 columns, 23054 rows)

shape: (23_054, 6)

time	stimuli_x	stimuli_y	pixel	position	velocity
i64	f64	f64	list[f64]	list[f64]	list[f64]
2415266	-1.0	-1.0	[176.8, 140.2]	[-11.420403, -9.148145]	[-0.772495, -4.238523]
2415267	-1.0	-1.0	[176.7, 139.8]	[-11.422806, -9.157834]	[-0.686663, -4.671012]
2415268	-1.0	-1.0	[176.7, 139.3]	[-11.422806, -9.169943]	[-0.257498, -3.806023]
2415269	-1.0	-1.0	[176.6, 139.3]	[-11.42521, -9.169943]	[1.459231, -1.557032]
2415270	-1.0	-1.0	[176.7, 139.3]	[-11.422806, -9.169943]	[4.034446, 1.556983]
…	…	…	…	…	…
2438315	-1.0	-1.0	[649.9, 633.9]	[0.260146, 3.038748]	[0.268004, -3.451512]
2438316	-1.0	-1.0	[650.1, 633.7]	[0.265149, 3.033792]	[-0.357339, -3.982536]
2438317	-1.0	-1.0	[650.2, 633.5]	[0.26765, 3.028836]	[-0.982682, -3.982549]
2438318	-1.0	-1.0	[650.0, 633.2]	[0.262648, 3.021402]	[-1.69736, -3.54005]
2438319	-1.0	-1.0	[649.7, 633.1]	[0.255144, 3.018924]	[-2.233368, -2.389544]

events:
Events
Events
- frame:
  DataFrame (4 columns, 0 rows)
  
  shape: (0, 4)
  name onset offset duration
  str i64 i64 i64
- trial_columns:
  None
  
  None
trial_columns:
None

None
experiment:
Experiment
Experiment
- eyetracker:
  EyeTracker
  EyeTracker
  - left:
    None
    
    None
  - model:
    None
    
    None
  - mount:
    None
    
    None
  - right:
    None
    
    None
  - sampling_rate:
    1000
    
    1000
  - vendor:
    None
    
    None
  - version:
    None
    
    None
- sampling_rate:
  1000
  
  1000
- screen:
  Screen
  Screen
  - distance_cm:
    68
    
    68
  - height_cm:
    30.2
    
    30.2
  - height_px:
    1024
    
    1024
  - origin:
    'upper left'
    
    'upper left'
  - width_cm:
    38
    
    38
  - width_px:
    1280
    
    1280
  - x_max_dva:
    15.599386487782953
    
    15.599386487782953
  - x_min_dva:
    -15.599386487782953
    
    -15.599386487782953
  - y_max_dva:
    12.508044410882546
    
    12.508044410882546
  - y_min_dva:
    -12.508044410882546
    
    -12.508044410882546

text_id	page_id	filepath
i64	i64	str
1	1	"pymovements-toy-dataset-main/d…
2	1	"pymovements-toy-dataset-main/d…

Working with a Local Dataset#

Preparations#

Defining your Dataset#

Define your Experiment#

Defining your resources#

Column Definitions#

Setting up loading function parameters#

Define and load the Dataset#

Use the Dataset#

This Page