{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Example Statistics - MDF Datasets\n", "Example: We want to know how many datasets are in MDF and which datasets have the most records.\n", "\n", "**Note: This example is not kept up-to-date with the latest statistics.**\n", "\n", "If you want the current MDF statistics, you must run this code yourself." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "from tqdm import tqdm\n", "import pandas as pd\n", "from mdf_forge.forge import Forge" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "mdf = Forge()" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "100%|██████████| 373/373 [03:21<00:00, 1.85it/s]\n" ] } ], "source": [ "# First, let's search for all the datasets. There are less than 10,000 currently, so `search()` will work fine.\n", "res = mdf.search(\"mdf.resource_type:dataset\", advanced=True)\n", "# Now, let's pull out the source_name, title, and number of records for each dataset.\n", "mdf_resources = []\n", "for r in tqdm(res):\n", " q = \"mdf.resource_type:record AND mdf.source_name:\" + r[\"mdf\"][\"source_name\"]\n", " x, info = mdf.search(q, advanced=True, info=True, limit=0)\n", " mdf_resources.append((r['mdf']['source_name'], r['dc'][\"titles\"][0]['title'], info[\"total_query_matches\"]))\n", "df = pd.DataFrame(mdf_resources, columns=['source_name', 'title', 'num_records'])" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Number of data resources: 373\n" ] }, { "data": { "text/html": [ "
\n", " | source_name | \n", "title | \n", "num_records | \n", "
---|---|---|---|
372 | \n", "sstein_stein_bandgap_2019 | \n", "Machine learning of optical properties of mate... | \n", "478111 | \n", "
78 | \n", "oqmd | \n", "The Open Quantum Materials Database | \n", "395348 | \n", "
338 | \n", "stein_bandgap_2019 | \n", "Machine learning of optical properties of mate... | \n", "180900 | \n", "
75 | \n", "h2o_13 | \n", "Machine-learning approach for one- and two-bod... | \n", "45482 | \n", "
74 | \n", "ab_initio_solute_database | \n", "High-throughput Ab-initio Dilute Solute Diffus... | \n", "31488 | \n", "
249 | \n", "nist_xps_db | \n", "NIST X-ray Photoelectron Spectroscopy Database | \n", "29189 | \n", "
4 | \n", "jarvis | \n", "JARVIS - Joint Automated Repository for Variou... | \n", "26559 | \n", "
6 | \n", "amcs | \n", "The American Mineralogist Crystal Structure Da... | \n", "19842 | \n", "
330 | \n", "w_14 | \n", "Accuracy and transferability of Gaussian appro... | \n", "9693 | \n", "
76 | \n", "bfcc13 | \n", "Cluster expansion made easy with Bayesian comp... | \n", "3783 | \n", "
246 | \n", "cip | \n", "Evaluation and comparison of classical interat... | \n", "3291 | \n", "
2 | \n", "sluschi | \n", "Solid and Liquid in Ultra Small Coexistence wi... | \n", "1618 | \n", "
331 | \n", "surface_crystal_energy | \n", "Data from: Surface energies of elemental crystals | \n", "1216 | \n", "
5 | \n", "khazana_polymer | \n", "Khazana (Polymer) | \n", "1073 | \n", "
327 | \n", "mdr_item_1496 | \n", "Ultrahigh Carbon Steel Micrographs | \n", "1007 | \n", "