{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Example Aggregations" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Aggregating data with MDF" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Searches using `Forge.search()` are limited to 10,000 results. However, there are two methods to circumvent this restriction: `Forge.aggregate_source()` and `Forge.aggregate()`." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import json\n", "from mdf_forge.forge import Forge" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "mdf = Forge()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### aggregate_source - NIST XPS DB\n", "Example: We want to collect all records from the NIST XPS Database and analyze the binding energies. This database has almost 30,000 records, so we have to use `aggregate()`." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "29190\n" ] } ], "source": [ "# First, let's aggregate all the nist_xps_db data.\n", "all_entries = mdf.aggregate_sources(\"nist_xps_db\")\n", "print(len(all_entries))" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{\n", " \"0\": 29189\n", "}\n" ] } ], "source": [ "# Now, let's parse out the enery_uncertainty_ev and print the results for analysis.\n", "uncertainties = {}\n", "for record in all_entries:\n", " if record[\"mdf\"][\"resource_type\"] == \"record\":\n", " unc = record.get(\"nist_xps_db_v1\", {}).get(\"energy_uncertainty_ev\", 0)\n", " if not uncertainties.get(unc):\n", " uncertainties[unc] = 1\n", " else:\n", " uncertainties[unc] += 1\n", "print(json.dumps(uncertainties, sort_keys=True, indent=4, separators=(',', ': ')))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### aggregate - Multiple Datasets\n", "Example: We want to analyze how often elements are studied with Gallium (Ga), and what the most frequent elemental pairing is. There are more than 10,000 records containing Gallium data." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "18232\n" ] } ], "source": [ "# First, let's aggregate everything that has \"Ga\" in the list of elements.\n", "all_results = mdf.aggregate(\"material.elements:Ga\")\n", "print(len(all_results))" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{\n", " \"Ac\": 267,\n", " \"Ag\": 323,\n", " \"Al\": 322,\n", " \"Ar\": 2,\n", " \"As\": 872,\n", " \"Au\": 372,\n", " \"B\": 301,\n", " \"Ba\": 342,\n", " \"Be\": 281,\n", " \"Bi\": 4172,\n", " \"Br\": 38,\n", " \"C\": 87,\n", " \"Ca\": 370,\n", " \"Cd\": 174,\n", " \"Ce\": 325,\n", " \"Cl\": 57,\n", " \"Co\": 381,\n", " \"Cr\": 315,\n", " \"Cs\": 160,\n", " \"Cu\": 403,\n", " \"Dy\": 317,\n", " \"Er\": 321,\n", " \"Eu\": 304,\n", " \"F\": 84,\n", " \"Fe\": 2989,\n", " \"Ga\": 18232,\n", " \"Gd\": 156,\n", " \"Ge\": 333,\n", " \"H\": 159,\n", " \"Hf\": 310,\n", " \"Hg\": 282,\n", " \"Ho\": 323,\n", " \"I\": 41,\n", " \"In\": 364,\n", " \"Ir\": 305,\n", " \"K\": 313,\n", " \"La\": 312,\n", " \"Li\": 469,\n", " \"Lu\": 291,\n", " \"Mg\": 683,\n", " \"Mn\": 4357,\n", " \"Mo\": 437,\n", " \"N\": 137,\n", " \"Na\": 339,\n", " \"Nb\": 296,\n", " \"Nd\": 179,\n", " \"Ni\": 363,\n", " \"Np\": 252,\n", " \"O\": 1390,\n", " \"On\": 6,\n", " \"Os\": 288,\n", " \"Ox\": 39,\n", " \"P\": 153,\n", " \"Pa\": 272,\n", " \"Pb\": 278,\n", " \"Pd\": 361,\n", " \"Pm\": 273,\n", " \"Pr\": 312,\n", " \"Pt\": 338,\n", " \"Pu\": 280,\n", " \"Rb\": 163,\n", " \"Re\": 134,\n", " \"Rh\": 320,\n", " \"Ru\": 304,\n", " \"S\": 161,\n", " \"Sb\": 327,\n", " \"Sc\": 331,\n", " \"Se\": 138,\n", " \"Si\": 412,\n", " \"Sm\": 330,\n", " \"Sn\": 303,\n", " \"Sr\": 221,\n", " \"Ta\": 160,\n", " \"Tb\": 174,\n", " \"Tc\": 139,\n", " \"Te\": 361,\n", " \"Th\": 287,\n", " \"Ti\": 211,\n", " \"Tl\": 295,\n", " \"Tm\": 312,\n", " \"U\": 223,\n", " \"V\": 1646,\n", " \"Va\": 2,\n", " \"W\": 259,\n", " \"Xe\": 1,\n", " \"Y\": 332,\n", " \"Yb\": 324,\n", " \"Zn\": 315,\n", " \"Zr\": 167\n", "}\n" ] } ], "source": [ "# Now, let's parse out the other elements in each record and keep a running tally to print out.\n", "elements = {}\n", "for record in all_results:\n", " if record[\"mdf\"][\"resource_type\"] == \"record\":\n", " elems = record[\"material\"][\"elements\"]\n", " for elem in elems:\n", " if elem in elements.keys():\n", " elements[elem] += 1\n", " else:\n", " elements[elem] = 1\n", "print(json.dumps(elements, sort_keys=True, indent=4, separators=(',', ': ')))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.4" } }, "nbformat": 4, "nbformat_minor": 2 }