{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "0",
   "metadata": {},
   "source": [
    "# Downloading Public Datasets"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1",
   "metadata": {},
   "source": [
    " ## What you will learn in this tutorial:\n",
    "\n",
    "* how to get an overview of the available public datasets\n",
    "* how to download and extract one of the available public datasets\n",
    "* how to customize the default directory structure"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2",
   "metadata": {},
   "source": [
    "## Preparations"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3",
   "metadata": {},
   "source": [
    "We import `pymovements` as the alias `pm` for convenience."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4",
   "metadata": {},
   "outputs": [],
   "source": [
    "import pymovements as pm"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5",
   "metadata": {},
   "source": [
    "pymovements provides a library of publicly available datasets.\n",
    "\n",
    "You can browse through the available dataset definitions here:\n",
    "{ref}`dataset_sec`\n",
    "\n",
    "To get the names of all currently available datasets, you can use the {py:meth}`DatasetLibrary.names() <pymovements.DatasetLibrary.names>` method:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6",
   "metadata": {},
   "outputs": [],
   "source": [
    "pm.DatasetLibrary.names()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7",
   "metadata": {},
   "source": [
    "For this tutorial we will limit ourselves to the {py:class}`~pymovements.datasets.ToyDataset` due to its minimal space requirements.\n",
    "\n",
    "Other datasets can be downloaded by simply replacing `ToyDataset` with one of the other available datasets.\n",
    "\n",
    "If you want to get more information about a specific dataset without downloading it yet, you can use the {py:meth}`DatasetLibrary.get() <pymovements.DatasetLibrary.get>` method:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8",
   "metadata": {},
   "outputs": [],
   "source": [
    "pm.DatasetLibrary.get('ToyDataset')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9",
   "metadata": {},
   "source": [
    "First, we initialize our public dataset by specifying its name and the root data directory.\n",
    "\n",
    "Our dataset will then be placed in a directory with the name of the dataset:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "10",
   "metadata": {},
   "outputs": [],
   "source": [
    "dataset = pm.Dataset('ToyDataset', path='data/ToyDataset')\n",
    "\n",
    "dataset.path"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "11",
   "metadata": {},
   "source": [
    "If you only want to specify a root directory which contains all your datasets, you can pass a {py:class}`~pymovements.DatasetPaths` instance.\n",
    "\n",
    "The directory of your dataset will have the same name as in the dataset definition."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "12",
   "metadata": {},
   "outputs": [],
   "source": [
    "dataset_paths = pm.DatasetPaths(root='data/')\n",
    "dataset = pm.Dataset('ToyDataset', path=dataset_paths)\n",
    "\n",
    "dataset.path"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "13",
   "metadata": {},
   "source": [
    "Can also specify an alternative dataset directory for your downloaded dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "14",
   "metadata": {},
   "outputs": [],
   "source": [
    "dataset_paths_alt = pm.DatasetPaths(root='data/', dataset='my_dataset')\n",
    "dataset_alt = pm.Dataset('ToyDataset', path=dataset_paths_alt)\n",
    "\n",
    "dataset_alt.path"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "15",
   "metadata": {},
   "source": [
    "## Downloading"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "16",
   "metadata": {},
   "source": [
    "The dataset will then be downloaded by calling:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "17",
   "metadata": {},
   "outputs": [],
   "source": [
    "dataset.download()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "18",
   "metadata": {},
   "source": [
    "As we see from the download message, the dataset resource has been downloaded to a downloads' directory.\n",
    "\n",
    "You can get the path to the downloads directory from the {py:attr}`~pymovements.DatasetPaths.downloads` attribute:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "19",
   "metadata": {},
   "outputs": [],
   "source": [
    "dataset.paths.downloads"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "20",
   "metadata": {},
   "source": [
    "You can also specify a custom directory name during initialization:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "21",
   "metadata": {},
   "outputs": [],
   "source": [
    "dataset_paths_3 = pm.DatasetPaths(root='data/', downloads='new_downloads')\n",
    "dataset_3 = pm.Dataset('ToyDataset', path=dataset_paths_3)\n",
    "\n",
    "dataset_3.paths.downloads"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "22",
   "metadata": {},
   "source": [
    "By default, all archives are recursively extracted to `Dataset.paths.raw`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "23",
   "metadata": {},
   "outputs": [],
   "source": [
    "dataset.paths.raw"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "24",
   "metadata": {},
   "source": [
    "If you want to remove the downloaded archives after extraction to save some space, you can set `remove_finished` to `True`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "25",
   "metadata": {},
   "outputs": [],
   "source": [
    "dataset.extract(remove_finished=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "26",
   "metadata": {},
   "source": [
    "This is also available for the `PublicDataset.download()` method:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "27",
   "metadata": {},
   "outputs": [],
   "source": [
    "dataset.download(remove_finished=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "28",
   "metadata": {},
   "source": [
    "## Inspecting the dataset\n",
    "The {py:class}`~pymovements.Dataset` class provides a method to scan the dataset files and create a fileinfo table. This is useful to get an overview of the dataset structure and for example, to check if all files have been downloaded correctly and how to specify a subset of files for further processing."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "29",
   "metadata": {},
   "outputs": [],
   "source": [
    "dataset.scan()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "30",
   "metadata": {},
   "source": [
    "## Loading into memory"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "31",
   "metadata": {},
   "source": [
    "Based on the `fileinfo` table, we can define a subset of the dataset that we want to load into our working memory.\n",
    "We can do this by specifying a dictionary of the format `dict[str, float | int | str | list[float | int | str]]` where the keys are the column names of the fileinfo table and the values are the specifications of the files to load:\n",
    "\n",
    "`dataset.load(subset={'text_id': [1, 2], 'page_id': 1})`\n",
    "\n",
    "However, in this case we will load the entire dataset, so we do not need to specify a subset.\n",
    "We simply load the data into our working memory by using the {py:meth}`Dataset.load() <pymovements.Dataset.load>` method without any additional arguments:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "32",
   "metadata": {},
   "outputs": [],
   "source": [
    "dataset.load()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "33",
   "metadata": {},
   "source": [
    "Let's verify that we have correctly scanned the dataset files:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "34",
   "metadata": {},
   "outputs": [],
   "source": [
    "dataset.fileinfo"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "35",
   "metadata": {},
   "source": [
    "Wonderful, all of our data has been downloaded and loaded in successfully!"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "36",
   "metadata": {},
   "source": [
    "## What you have learned in this tutorial:\n",
    "\n",
    "* how to initialize a public dataset\n",
    "* how to download and extract dataset resources\n",
    "* how to customize the default directory structure\n",
    "* how to load the dataset into your working memory"
   ]
  }
 ],
 "metadata": {},
 "nbformat": 4,
 "nbformat_minor": 5
}