{ "cells": [ { "cell_type": "markdown", "id": "0", "metadata": {}, "source": [ "# Downloading Public Datasets" ] }, { "cell_type": "markdown", "id": "1", "metadata": {}, "source": [ " ## What you will learn in this tutorial:\n", "\n", "* how to get an overview of the available public datasets\n", "* how to download and extract one of the available public datasets\n", "* how to customize the default directory structure" ] }, { "cell_type": "markdown", "id": "2", "metadata": {}, "source": [ "## Preparations" ] }, { "cell_type": "markdown", "id": "3", "metadata": {}, "source": [ "We import `pymovements` as the alias `pm` for convenience." ] }, { "cell_type": "code", "execution_count": null, "id": "4", "metadata": {}, "outputs": [], "source": [ "import pymovements as pm" ] }, { "cell_type": "markdown", "id": "5", "metadata": {}, "source": [ "pymovements provides a library of publicly available datasets.\n", "\n", "You can browse through the available dataset definitions here:\n", "{ref}`dataset_sec`\n", "\n", "To get the names of all currently available datasets, you can use the {py:meth}`DatasetLibrary.names() ` method:\n" ] }, { "cell_type": "code", "execution_count": null, "id": "6", "metadata": {}, "outputs": [], "source": [ "pm.DatasetLibrary.names()" ] }, { "cell_type": "markdown", "id": "7", "metadata": {}, "source": [ "For this tutorial we will limit ourselves to the {py:class}`~pymovements.datasets.ToyDataset` due to its minimal space requirements.\n", "\n", "Other datasets can be downloaded by simply replacing `ToyDataset` with one of the other available datasets.\n", "\n", "If you want to get more information about a specific dataset without downloading it yet, you can use the {py:meth}`DatasetLibrary.get() ` method:" ] }, { "cell_type": "code", "execution_count": null, "id": "8", "metadata": {}, "outputs": [], "source": [ "pm.DatasetLibrary.get('ToyDataset')" ] }, { "cell_type": "markdown", "id": "9", "metadata": {}, "source": [ "First, we initialize our public dataset by specifying its name and the root data directory.\n", "\n", "Our dataset will then be placed in a directory with the name of the dataset:" ] }, { "cell_type": "code", "execution_count": null, "id": "10", "metadata": {}, "outputs": [], "source": [ "dataset = pm.Dataset('ToyDataset', path='data/ToyDataset')\n", "\n", "dataset.path" ] }, { "cell_type": "markdown", "id": "11", "metadata": {}, "source": [ "If you only want to specify a root directory which contains all your datasets, you can pass a {py:class}`~pymovements.DatasetPaths` instance.\n", "\n", "The directory of your dataset will have the same name as in the dataset definition." ] }, { "cell_type": "code", "execution_count": null, "id": "12", "metadata": {}, "outputs": [], "source": [ "dataset_paths = pm.DatasetPaths(root='data/')\n", "dataset = pm.Dataset('ToyDataset', path=dataset_paths)\n", "\n", "dataset.path" ] }, { "cell_type": "markdown", "id": "13", "metadata": {}, "source": [ "Can also specify an alternative dataset directory for your downloaded dataset." ] }, { "cell_type": "code", "execution_count": null, "id": "14", "metadata": {}, "outputs": [], "source": [ "dataset_paths_alt = pm.DatasetPaths(root='data/', dataset='my_dataset')\n", "dataset_alt = pm.Dataset('ToyDataset', path=dataset_paths_alt)\n", "\n", "dataset_alt.path" ] }, { "cell_type": "markdown", "id": "15", "metadata": {}, "source": [ "## Downloading" ] }, { "cell_type": "markdown", "id": "16", "metadata": {}, "source": [ "The dataset will then be downloaded by calling:" ] }, { "cell_type": "code", "execution_count": null, "id": "17", "metadata": {}, "outputs": [], "source": [ "dataset.download()" ] }, { "cell_type": "markdown", "id": "18", "metadata": {}, "source": [ "As we see from the download message, the dataset resource has been downloaded to a downloads' directory.\n", "\n", "You can get the path to the downloads directory from the {py:attr}`~pymovements.DatasetPaths.downloads` attribute:" ] }, { "cell_type": "code", "execution_count": null, "id": "19", "metadata": {}, "outputs": [], "source": [ "dataset.paths.downloads" ] }, { "cell_type": "markdown", "id": "20", "metadata": {}, "source": [ "You can also specify a custom directory name during initialization:" ] }, { "cell_type": "code", "execution_count": null, "id": "21", "metadata": {}, "outputs": [], "source": [ "dataset_paths_3 = pm.DatasetPaths(root='data/', downloads='new_downloads')\n", "dataset_3 = pm.Dataset('ToyDataset', path=dataset_paths_3)\n", "\n", "dataset_3.paths.downloads" ] }, { "cell_type": "markdown", "id": "22", "metadata": {}, "source": [ "By default, all archives are recursively extracted to `Dataset.paths.raw`:" ] }, { "cell_type": "code", "execution_count": null, "id": "23", "metadata": {}, "outputs": [], "source": [ "dataset.paths.raw" ] }, { "cell_type": "markdown", "id": "24", "metadata": {}, "source": [ "If you want to remove the downloaded archives after extraction to save some space, you can set `remove_finished` to `True`:" ] }, { "cell_type": "code", "execution_count": null, "id": "25", "metadata": {}, "outputs": [], "source": [ "dataset.extract(remove_finished=True)" ] }, { "cell_type": "markdown", "id": "26", "metadata": {}, "source": [ "This is also available for the `PublicDataset.download()` method:" ] }, { "cell_type": "code", "execution_count": null, "id": "27", "metadata": {}, "outputs": [], "source": [ "dataset.download(remove_finished=True)" ] }, { "cell_type": "markdown", "id": "28", "metadata": {}, "source": [ "## Inspecting the dataset\n", "The {py:class}`~pymovements.Dataset` class provides a method to scan the dataset files and create a fileinfo table. This is useful to get an overview of the dataset structure and for example, to check if all files have been downloaded correctly and how to specify a subset of files for further processing." ] }, { "cell_type": "code", "execution_count": null, "id": "29", "metadata": {}, "outputs": [], "source": [ "dataset.scan()" ] }, { "cell_type": "markdown", "id": "30", "metadata": {}, "source": [ "## Loading into memory" ] }, { "cell_type": "markdown", "id": "31", "metadata": {}, "source": [ "Based on the `fileinfo` table, we can define a subset of the dataset that we want to load into our working memory.\n", "We can do this by specifying a dictionary of the format `dict[str, float | int | str | list[float | int | str]]` where the keys are the column names of the fileinfo table and the values are the specifications of the files to load:\n", "\n", "`dataset.load(subset={'text_id': [1, 2], 'page_id': 1})`\n", "\n", "However, in this case we will load the entire dataset, so we do not need to specify a subset.\n", "We simply load the data into our working memory by using the {py:meth}`Dataset.load() ` method without any additional arguments:" ] }, { "cell_type": "code", "execution_count": null, "id": "32", "metadata": {}, "outputs": [], "source": [ "dataset.load()" ] }, { "cell_type": "markdown", "id": "33", "metadata": {}, "source": [ "Let's verify that we have correctly scanned the dataset files:" ] }, { "cell_type": "code", "execution_count": null, "id": "34", "metadata": {}, "outputs": [], "source": [ "dataset.fileinfo" ] }, { "cell_type": "markdown", "id": "35", "metadata": {}, "source": [ "Wonderful, all of our data has been downloaded and loaded in successfully!" ] }, { "cell_type": "markdown", "id": "36", "metadata": {}, "source": [ "## What you have learned in this tutorial:\n", "\n", "* how to initialize a public dataset\n", "* how to download and extract dataset resources\n", "* how to customize the default directory structure\n", "* how to load the dataset into your working memory" ] } ], "metadata": {}, "nbformat": 4, "nbformat_minor": 5 }