Example Statistics - MDF Datasets¶
Example: We want to know how many datasets are in MDF and which datasets have the most records.
Note: This example is not kept up-to-date with the latest statistics.
If you want the current MDF statistics, you must run this code yourself.
[1]:
from tqdm import tqdm
import pandas as pd
from mdf_forge.forge import Forge
[2]:
mdf = Forge()
[3]:
# First, let's search for all the datasets. There are less than 10,000 currently, so `search()` will work fine.
res = mdf.search("mdf.resource_type:dataset", advanced=True)
# Now, let's pull out the source_name, title, and number of records for each dataset.
mdf_resources = []
for r in tqdm(res):
q = "mdf.resource_type:record AND mdf.source_name:" + r["mdf"]["source_name"]
x, info = mdf.search(q, advanced=True, info=True, limit=0)
mdf_resources.append((r['mdf']['source_name'], r['dc']["titles"][0]['title'], info["total_query_matches"]))
df = pd.DataFrame(mdf_resources, columns=['source_name', 'title', 'num_records'])
100%|██████████| 373/373 [03:21<00:00, 1.85it/s]
[4]:
# Finally, we can print the data we gathered.
print("Number of data resources: {n_datasets}".format(n_datasets=len(df)))
df.sort_values(by="num_records", ascending=False).head(15)
Number of data resources: 373
[4]:
source_name | title | num_records | |
---|---|---|---|
372 | sstein_stein_bandgap_2019 | Machine learning of optical properties of mate... | 478111 |
78 | oqmd | The Open Quantum Materials Database | 395348 |
338 | stein_bandgap_2019 | Machine learning of optical properties of mate... | 180900 |
75 | h2o_13 | Machine-learning approach for one- and two-bod... | 45482 |
74 | ab_initio_solute_database | High-throughput Ab-initio Dilute Solute Diffus... | 31488 |
249 | nist_xps_db | NIST X-ray Photoelectron Spectroscopy Database | 29189 |
4 | jarvis | JARVIS - Joint Automated Repository for Variou... | 26559 |
6 | amcs | The American Mineralogist Crystal Structure Da... | 19842 |
330 | w_14 | Accuracy and transferability of Gaussian appro... | 9693 |
76 | bfcc13 | Cluster expansion made easy with Bayesian comp... | 3783 |
246 | cip | Evaluation and comparison of classical interat... | 3291 |
2 | sluschi | Solid and Liquid in Ultra Small Coexistence wi... | 1618 |
331 | surface_crystal_energy | Data from: Surface energies of elemental crystals | 1216 |
5 | khazana_polymer | Khazana (Polymer) | 1073 |
327 | mdr_item_1496 | Ultrahigh Carbon Steel Micrographs | 1007 |
[5]:
# Bonus: How many records are in MDF in total?
df["num_records"].sum()
[5]:
1230958
[ ]: