{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Example Aggregations"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Aggregating data with MDF"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Searches using `Forge.search()` are limited to 10,000 results. However, there are two methods to circumvent this restriction: `Forge.aggregate_source()` and `Forge.aggregate()`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import json\n",
    "from mdf_forge.forge import Forge"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "mdf = Forge()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### aggregate_source - NIST XPS DB\n",
    "Example: We want to collect all records from the NIST XPS Database and analyze the binding energies. This database has almost 30,000 records, so we have to use `aggregate()`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "29190\n"
     ]
    }
   ],
   "source": [
    "# First, let's aggregate all the nist_xps_db data.\n",
    "all_entries = mdf.aggregate_sources(\"nist_xps_db\")\n",
    "print(len(all_entries))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{\n",
      "    \"0\": 29189\n",
      "}\n"
     ]
    }
   ],
   "source": [
    "# Now, let's parse out the enery_uncertainty_ev and print the results for analysis.\n",
    "uncertainties = {}\n",
    "for record in all_entries:\n",
    "    if record[\"mdf\"][\"resource_type\"] == \"record\":\n",
    "        unc = record.get(\"nist_xps_db_v1\", {}).get(\"energy_uncertainty_ev\", 0)\n",
    "        if not uncertainties.get(unc):\n",
    "            uncertainties[unc] = 1\n",
    "        else:\n",
    "            uncertainties[unc] += 1\n",
    "print(json.dumps(uncertainties, sort_keys=True, indent=4, separators=(',', ': ')))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### aggregate - Multiple Datasets\n",
    "Example: We want to analyze how often elements are studied with Gallium (Ga), and what the most frequent elemental pairing is. There are more than 10,000 records containing Gallium data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "18232\n"
     ]
    }
   ],
   "source": [
    "# First, let's aggregate everything that has \"Ga\" in the list of elements.\n",
    "all_results = mdf.aggregate(\"material.elements:Ga\")\n",
    "print(len(all_results))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{\n",
      "    \"Ac\": 267,\n",
      "    \"Ag\": 323,\n",
      "    \"Al\": 322,\n",
      "    \"Ar\": 2,\n",
      "    \"As\": 872,\n",
      "    \"Au\": 372,\n",
      "    \"B\": 301,\n",
      "    \"Ba\": 342,\n",
      "    \"Be\": 281,\n",
      "    \"Bi\": 4172,\n",
      "    \"Br\": 38,\n",
      "    \"C\": 87,\n",
      "    \"Ca\": 370,\n",
      "    \"Cd\": 174,\n",
      "    \"Ce\": 325,\n",
      "    \"Cl\": 57,\n",
      "    \"Co\": 381,\n",
      "    \"Cr\": 315,\n",
      "    \"Cs\": 160,\n",
      "    \"Cu\": 403,\n",
      "    \"Dy\": 317,\n",
      "    \"Er\": 321,\n",
      "    \"Eu\": 304,\n",
      "    \"F\": 84,\n",
      "    \"Fe\": 2989,\n",
      "    \"Ga\": 18232,\n",
      "    \"Gd\": 156,\n",
      "    \"Ge\": 333,\n",
      "    \"H\": 159,\n",
      "    \"Hf\": 310,\n",
      "    \"Hg\": 282,\n",
      "    \"Ho\": 323,\n",
      "    \"I\": 41,\n",
      "    \"In\": 364,\n",
      "    \"Ir\": 305,\n",
      "    \"K\": 313,\n",
      "    \"La\": 312,\n",
      "    \"Li\": 469,\n",
      "    \"Lu\": 291,\n",
      "    \"Mg\": 683,\n",
      "    \"Mn\": 4357,\n",
      "    \"Mo\": 437,\n",
      "    \"N\": 137,\n",
      "    \"Na\": 339,\n",
      "    \"Nb\": 296,\n",
      "    \"Nd\": 179,\n",
      "    \"Ni\": 363,\n",
      "    \"Np\": 252,\n",
      "    \"O\": 1390,\n",
      "    \"On\": 6,\n",
      "    \"Os\": 288,\n",
      "    \"Ox\": 39,\n",
      "    \"P\": 153,\n",
      "    \"Pa\": 272,\n",
      "    \"Pb\": 278,\n",
      "    \"Pd\": 361,\n",
      "    \"Pm\": 273,\n",
      "    \"Pr\": 312,\n",
      "    \"Pt\": 338,\n",
      "    \"Pu\": 280,\n",
      "    \"Rb\": 163,\n",
      "    \"Re\": 134,\n",
      "    \"Rh\": 320,\n",
      "    \"Ru\": 304,\n",
      "    \"S\": 161,\n",
      "    \"Sb\": 327,\n",
      "    \"Sc\": 331,\n",
      "    \"Se\": 138,\n",
      "    \"Si\": 412,\n",
      "    \"Sm\": 330,\n",
      "    \"Sn\": 303,\n",
      "    \"Sr\": 221,\n",
      "    \"Ta\": 160,\n",
      "    \"Tb\": 174,\n",
      "    \"Tc\": 139,\n",
      "    \"Te\": 361,\n",
      "    \"Th\": 287,\n",
      "    \"Ti\": 211,\n",
      "    \"Tl\": 295,\n",
      "    \"Tm\": 312,\n",
      "    \"U\": 223,\n",
      "    \"V\": 1646,\n",
      "    \"Va\": 2,\n",
      "    \"W\": 259,\n",
      "    \"Xe\": 1,\n",
      "    \"Y\": 332,\n",
      "    \"Yb\": 324,\n",
      "    \"Zn\": 315,\n",
      "    \"Zr\": 167\n",
      "}\n"
     ]
    }
   ],
   "source": [
    "# Now, let's parse out the other elements in each record and keep a running tally to print out.\n",
    "elements = {}\n",
    "for record in all_results:\n",
    "    if record[\"mdf\"][\"resource_type\"] == \"record\":\n",
    "        elems = record[\"material\"][\"elements\"]\n",
    "        for elem in elems:\n",
    "            if elem in elements.keys():\n",
    "                elements[elem] += 1\n",
    "            else:\n",
    "                elements[elem] = 1\n",
    "print(json.dumps(elements, sort_keys=True, indent=4, separators=(',', ': ')))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}