Part 4 - General Helper Functions

[1]:
from mdf_forge.forge import Forge
[2]:
mdf = Forge()

Generally Useful Help

current_query

You can see the query you’re currently building with current_query().

Note that your query may be enclosed in parentheses automatically. This does not alter the results of the query.

[3]:
mdf.match_field("mdf.source_name", "oqmd")
mdf.current_query()
[3]:
'(mdf.source_name:oqmd)'

reset_query

If you have a query in memory that you don’t want, you can use reset_query() to start a new query. This method will clear the current query entirely.

[4]:
mdf.reset_query()
[5]:
mdf.current_query()
[5]:
''

Query info

We can build a query using exclude_field() and match_field() and execute it with search(). But if you are interested in knowing more about the query, including the actual query string that was made, you can use the info=True argument to search().

[6]:
mdf.exclude_field("mdf.source_name", "sluschi").match_field("material.elements", "Al").exclude_field("mdf.source_name", "oqmd")
res, info = mdf.search(limit=10, info=True)

When you use the info=True argument, search() will return a tuple instead of a list. The first element in the tuple will be the same list of results you’re used to, but the second tuple element will be a dictionary of query info.

[7]:
res[0]
[7]:
{'crystal_structure': {'number_of_atoms': 108.0,
  'space_group_number': 225,
  'stoichiometry': 'A',
  'volume': 1779.162},
 'files': [{'data_type': 'ASCII text',
   'filename': 'INCAR',
   'globus': 'globus://e38ee745-6d04-11e5-ba46-22000b92c6ec/MDF/mdf_connect/prod/data/ab_initio_solute_database_v1-2/data/FCC_solute_AlCu_20140918T204831/perfect_stat/INCAR',
   'length': 169,
   'mime_type': 'text/plain',
   'sha512': 'da3b28318b6c8496dda80d81f89176edc55997c2b75dafbcf92fdd8bb6c30d0dc27d2c3cfff8383e541ccd85bd84629da60b1e68da6411b5974f31bd85de0f8a',
   'url': 'https://e38ee745-6d04-11e5-ba46-22000b92c6ec.e.globus.org/MDF/mdf_connect/prod/data/ab_initio_solute_database_v1-2/data/FCC_solute_AlCu_20140918T204831/perfect_stat/INCAR'},
  {'data_type': 'ASCII text',
   'filename': 'CONTCAR',
   'globus': 'globus://e38ee745-6d04-11e5-ba46-22000b92c6ec/MDF/mdf_connect/prod/data/ab_initio_solute_database_v1-2/data/FCC_solute_AlCu_20140918T204831/perfect_stat/CONTCAR',
   'length': 3348,
   'mime_type': 'text/plain',
   'sha512': '613498249ad2d01dc3cb4aa37a41bd63ba2c95a599ab0b0cb43f4e30c3f5381a8c2a1404d3c20d24da434b0785189bb603859a99f5abbcc242cc87cc9c3e0ca8',
   'url': 'https://e38ee745-6d04-11e5-ba46-22000b92c6ec.e.globus.org/MDF/mdf_connect/prod/data/ab_initio_solute_database_v1-2/data/FCC_solute_AlCu_20140918T204831/perfect_stat/CONTCAR'},
  {'data_type': 'ASCII text',
   'filename': 'KPOINTS',
   'globus': 'globus://e38ee745-6d04-11e5-ba46-22000b92c6ec/MDF/mdf_connect/prod/data/ab_initio_solute_database_v1-2/data/FCC_solute_AlCu_20140918T204831/perfect_stat/KPOINTS',
   'length': 42,
   'mime_type': 'text/plain',
   'sha512': '56f819a7cff23127409c48d69cef684f578600b02b9727fc3ab46aa297bb201b890dabcfc4b232088fb4d0cb938514283986f8eae68a85bdf303960d2f9058dd',
   'url': 'https://e38ee745-6d04-11e5-ba46-22000b92c6ec.e.globus.org/MDF/mdf_connect/prod/data/ab_initio_solute_database_v1-2/data/FCC_solute_AlCu_20140918T204831/perfect_stat/KPOINTS'},
  {'data_type': 'ASCII text',
   'filename': 'POSCAR',
   'globus': 'globus://e38ee745-6d04-11e5-ba46-22000b92c6ec/MDF/mdf_connect/prod/data/ab_initio_solute_database_v1-2/data/FCC_solute_AlCu_20140918T204831/perfect_stat/POSCAR',
   'length': 3348,
   'mime_type': 'text/plain',
   'sha512': '613498249ad2d01dc3cb4aa37a41bd63ba2c95a599ab0b0cb43f4e30c3f5381a8c2a1404d3c20d24da434b0785189bb603859a99f5abbcc242cc87cc9c3e0ca8',
   'url': 'https://e38ee745-6d04-11e5-ba46-22000b92c6ec.e.globus.org/MDF/mdf_connect/prod/data/ab_initio_solute_database_v1-2/data/FCC_solute_AlCu_20140918T204831/perfect_stat/POSCAR'}],
 'material': {'composition': 'Al108', 'elements': ['Al']},
 'mdf': {'ingest_date': '2018-11-24T08:12:11.852893Z',
  'resource_type': 'record',
  'scroll_id': 28093,
  'source_id': 'ab_initio_solute_database_v1.2',
  'source_name': 'ab_initio_solute_database',
  'version': 1}}
[8]:
info
[8]:
{'advanced': True,
 'errors': [],
 'index_uuid': '1a57bbe5-5272-477f-9d31-343b8258b7a5',
 'limit': 10,
 'query': '( NOT mdf.source_name:sluschi AND material.elements:Al AND  NOT mdf.source_name:oqmd)',
 'retries': 0,
 'total_query_matches': 14886}

Repeat a query

You can stop a query from being cleared out of memory after a search by using the reset_query=False argument.

[9]:
mdf.match_field("mdf.source_name", "nist_xps_db")
[9]:
<mdf_forge.forge.Forge at 0x7fe389fc6208>
[10]:
res, info = mdf.search(limit=10, info=True, reset_query=False)
info["query"]
[10]:
'(mdf.source_name:nist_xps_db)'
[11]:
res, info = mdf.search(limit=10, info=True)
info["query"]
[11]:
'(mdf.source_name:nist_xps_db)'

show_fields

How do you know what fields there are to search on? Use show_fields() to find out. If you just call show_fields() by itself, it will show you all fields currently in the MDF Search index.

[12]:
mdf.show_fields()
[12]:
{'calphad.phases': 'text',
 'cip.bv': 'text',
 'cip.energy': 'text',
 'cip.forcefield': 'text',
 'cip.gv': 'text',
 'cip.mpid': 'text',
 'cip.totenergy': 'text',
 'crystal_structure.cross_reference.icsd': 'long',
 'crystal_structure.number_of_atoms': 'long',
 'crystal_structure.space_group_number': 'long',
 'crystal_structure.stoichiometry': 'text',
 'crystal_structure.volume': 'float',
 'custom.all_materials_included': 'text',
 'custom.atom_fractions': 'text',
 'custom.experiment_holding_temperature': 'text',
 'custom.experiment_nominal_alloy_composition': 'text',
 'custom.experiment_pixel_size': 'text',
 'custom.experiment_time_between_scans': 'text',
 'custom.experiment_total_duration': 'text',
 'custom.experiment_xray_energy': 'text',
 'custom.funding_details': 'text',
 'custom.plate_id': 'text',
 'custom.processing_reconstruction_method': 'text',
 'custom.processing_segmentation_method': 'text',
 'custom.reduction_method': 'text',
 'custom.sample_id': 'text',
 'data.endpoint_path': 'text',
 'data.link': 'text',
 'dc.alternateIdentifiers.alternateIdentifier': 'text',
 'dc.alternateIdentifiers.alternateIdentifierType': 'text',
 'dc.contributors.affiliations': 'text',
 'dc.contributors.contributorName': 'text',
 'dc.contributors.contributorType': 'text',
 'dc.contributors.familyName': 'text',
 'dc.contributors.givenName': 'text',
 'dc.creators.affiliations': 'text',
 'dc.creators.creatorName': 'text',
 'dc.creators.familyName': 'text',
 'dc.creators.givenName': 'text',
 'dc.dates.date': 'date',
 'dc.dates.dateType': 'text',
 'dc.descriptions.description': 'text',
 'dc.descriptions.descriptionType': 'text',
 'dc.geoLocations.geoLocationPlace': 'text',
 'dc.identifier.identifier': 'text',
 'dc.identifier.identifierType': 'text',
 'dc.publicationYear': 'text',
 'dc.publisher': 'text',
 'dc.relatedIdentifiers.relatedIdentifier': 'text',
 'dc.relatedIdentifiers.relatedIdentifierType': 'text',
 'dc.relatedIdentifiers.relationType': 'text',
 'dc.resourceType.resourceType': 'text',
 'dc.resourceType.resourceTypeGeneral': 'text',
 'dc.rightsList.rights': 'text',
 'dc.rightsList.rightsURI': 'text',
 'dc.subjects.subject': 'text',
 'dc.titles.title': 'text',
 'dft.converged': 'text',
 'dft.cutoff_energy': 'float',
 'dft.exchange_correlation_functional': 'text',
 'electron_microscopy.beam_energy': 'float',
 'files.data_type': 'text',
 'files.filename': 'text',
 'files.globus': 'text',
 'files.length': 'long',
 'files.mime_type': 'text',
 'files.sha512': 'text',
 'files.url': 'text',
 'image.format': 'text',
 'image.height': 'long',
 'image.megapixels': 'float',
 'image.width': 'long',
 'jarvis.__custom.band_gap_desc': 'text',
 'jarvis.__custom.crossreference_desc': 'text',
 'jarvis.__custom.dimensionality_desc': 'text',
 'jarvis.__custom.elastic_moduli_desc': 'text',
 'jarvis.__custom.formation_enthalpy_desc': 'text',
 'jarvis.__custom.id_desc': 'text',
 'jarvis.__custom.landing_page_desc': 'text',
 'jarvis.__custom.total_energy_desc': 'text',
 'jarvis.bandgap.mbj': 'float',
 'jarvis.bandgap.optb88vdw': 'float',
 'jarvis.crossreference.materials_project': 'text',
 'jarvis.dimensionality': 'text',
 'jarvis.elastic_moduli.bulk': 'float',
 'jarvis.elastic_moduli.shear': 'float',
 'jarvis.formation_enthalpy': 'float',
 'jarvis.id': 'text',
 'jarvis.landing_page': 'text',
 'jarvis.total_energy': 'float',
 'material.composition': 'text',
 'material.elements': 'text',
 'mdf.ingest_date': 'date',
 'mdf.mdf_id': 'text',
 'mdf.organizations': 'text',
 'mdf.resource_type': 'text',
 'mdf.scroll_id': 'long',
 'mdf.source_id': 'text',
 'mdf.source_name': 'text',
 'mdf.version': 'long',
 'mrr.characterizationMethod': 'text',
 'mrr.materialType': 'text',
 'mrr.structuralFeature': 'text',
 'nist_xps_db.binding_energy_ev': 'text',
 'nist_xps_db.energy_uncertainty_ev': 'text',
 'nist_xps_db.notes': 'text',
 'nist_xps_db.temperature_k': 'text',
 'oqmd.__custom.band_gap_desc': 'text',
 'oqmd.__custom.configuration_desc': 'text',
 'oqmd.__custom.delta_e_desc': 'text',
 'oqmd.__custom.magnetic_moment_desc': 'text',
 'oqmd.__custom.stability_desc': 'text',
 'oqmd.__custom.total_energy_desc': 'text',
 'oqmd.__custom.volume_pa_desc': 'text',
 'oqmd.band_gap.units': 'text',
 'oqmd.band_gap.value': 'float',
 'oqmd.configuration': 'text',
 'oqmd.delta_e.units': 'text',
 'oqmd.delta_e.value': 'float',
 'oqmd.magnetic_moment.units': 'text',
 'oqmd.magnetic_moment.value': 'float',
 'oqmd.stability.units': 'text',
 'oqmd.stability.value': 'float',
 'oqmd.total_energy.units': 'text',
 'oqmd.total_energy.value': 'float',
 'oqmd.volume_pa.units': 'text',
 'oqmd.volume_pa.value': 'float',
 'origin.creator': 'text',
 'origin.name': 'text',
 'origin.type': 'text',
 'services.citrine': 'text',
 'services.mdf_publish': 'text',
 'services.mdf_search': 'text',
 'services.mrr': 'text'}

If you give show_fields() a top-level block, it will show you the mapping for that block, including the expected datatypes.

[13]:
mdf.show_fields("mdf")
[13]:
{'mdf.ingest_date': 'date',
 'mdf.mdf_id': 'text',
 'mdf.organizations': 'text',
 'mdf.resource_type': 'text',
 'mdf.scroll_id': 'long',
 'mdf.source_id': 'text',
 'mdf.source_name': 'text',
 'mdf.version': 'long'}

describe_field

To learn more about specific fields, use describe_field(). This method can tell you what a field means, what unit of measurement it uses, or other useful information. When you call describe_field(), you must pass in the resource_type you’re interested in (such as dataset or record). Since the full schema for a resource_type is very long, you can also pass in a field you’re interested in, in the standard dot notation (if you don’t, you will get the full schema for the resource_type instead).

[14]:
mdf.describe_field("dataset", field="mdf")
- acl (array of string): The IDs of users or groups allowed to view this entry (or ["public"]). Note that this field does not appear in Search results for security reasons.
  Must have at least 1 item(s)

- ingest_date (string): The RFC 3339 date of ingest.

- mdf_id (string): The BSON ID of the entry, which is not static between dataset versions.

- organizations (array of string): The organizations associated with the dataset.

- parent_id (string): The BSON ID of the entry's parent.

- resource_type (string): The type of entry.

- scroll_id (integer): A number to enable aggregating (via simulated scrolling) in Forge.

- source_id (string): A unique (globally) identifier for the dataset.

- source_name (string): A unique (to this dataset) program-friendly name for the dataset.

- version (integer): The version number for the dataset.

Required: ['source_name', 'source_id', 'mdf_id', 'acl', 'ingest_date', 'resource_type']

If you want your results in a dictionary instead of being printed out, you can set raw=True.

[15]:
mdf.describe_field("record", field="mdf.source_name", raw=True)
[15]:
{'error': None,
 'schema': {'description': 'A unique (to this dataset) program-friendly name for the dataset.',
  'type': 'string'},
 'status_code': 200,
 'success': True}

describe_organization

To learn more about an organization registered with MDF, use describe_organization(). This method can tell you more about an organization, including the provided description, homepage, and submission rules. When you call describe_organization(), you just pass in the name or alias of an organization (capitalization doesn’t matter).

[16]:
mdf.describe_organization("argonne national laboratory")

 Argonne National Laboratory
        aliases: ANL
        canonical_name: Argonne National Laboratory
        description: Argonne serves America as a science and energy laboratory distinguished by the breadth of our R&D capabilities in concert with our powerful suite of experimental and computational facilities.
        homepage: https://www.anl.gov/
        parent_organizations: None
        permission_groups: public
[17]:
mdf.describe_organization("CHiMaD")

 Center for Hierarchical Materials Design
        aliases: CHiMaD
        canonical_name: Center for Hierarchical Materials Design
        description: Center for Hierarchical Materials Design (CHiMaD) is a NIST-sponsored center of excellence for advanced materials research focusing on developing the next generation of computational tools, databases and experimental techniques in order to enable the accelerated design of novel materials and their integration to industry, one of the primary goals of the U.S. Government's Materials Genome Initiative (MGI).
        homepage: http://chimad.northwestern.edu/
        parent_organizations: National Institute of Standards and Technology
        permission_groups: public

You can also get a brief overview of an organization without the technical details by setting summary=True. describe_organization() also supports the raw argument to get results back as a dictionary (raw overrides summary).

[18]:
mdf.describe_organization("NIST", summary=True)

 National Institute of Standards and Technology
        aliases: NIST
        description: The National Institute of Standards and Technology (NIST) was founded in 1901 and is now part of the U.S. Department of Commerce. NIST is one of the nation's oldest physical science laboratories.
        homepage: https://www.nist.gov/
[19]:
mdf.describe_organization("NIST MDR", raw=True)
[19]:
{'error': None,
 'organization': {'aliases': ['NIST MDR', 'MDR'],
  'canonical_name': 'NIST Materials Data Repository',
  'description': 'The National Institute of Standards and Technology has created a materials science data repository as part of an effort in coordination with the Materials Genome Initiative (MGI) to establish data exchange protocols and mechanisms that will foster data sharing and reuse across a wide community of researchers, with the goal of enhancing the quality of materials data and models.',
  'homepage': 'https://materialsdata.nist.gov/',
  'parent_organizations': ['National Institute of Standards and Technology'],
  'permission_groups': ['public']},
 'status_code': 200,
 'success': True}

Fetching Datasets

fetch_datasets_from_results

This method allows you to automatically collect all the datasets that have records returned from a search. In other words, if you search for mdf.elements:Al and a record from OQMD is returned, you can pass that record to fetch_datasets_from_results() and get the OQMD dataset entry back.

[20]:
records = mdf.search("dft.converged:true AND mdf.resource_type:record")
[21]:
res = mdf.fetch_datasets_from_results(records)
res[0]
[21]:
{'data': {'endpoint_path': 'globus://e38ee745-6d04-11e5-ba46-22000b92c6ec/MDF/mdf_connect/prod/data/mdr_item_775_v1/',
  'link': 'https://app.globus.org/file-manager?origin_id=e38ee745-6d04-11e5-ba46-22000b92c6ec&origin_path=/MDF/mdf_connect/prod/data/mdr_item_775_v1/'},
 'dc': {'alternateIdentifiers': [{'alternateIdentifier': 'http://hdl.handle.net/11115/166',
    'alternateIdentifierType': 'Handle'},
   {'alternateIdentifier': '775',
    'alternateIdentifierType': 'NIST DSpace ID'}],
  'creators': [{'creatorName': 'Valencia and P.N. Quested, J.J.',
    'familyName': 'Valencia and P.N. Quested',
    'givenName': 'J.J.'}],
  'publicationYear': '2013',
  'publisher': 'NIST Materials Data Repository',
  'resourceType': {'resourceType': 'Dataset',
   'resourceTypeGeneral': 'Dataset'},
  'titles': [{'title': 'Thermophysical Properties'}]},
 'mdf': {'ingest_date': '2018-11-15T19:09:44.202046Z',
  'organizations': ['National Institute of Standards and Technology',
   'U.S. Department of Commerce',
   'DOC',
   'MDR',
   'NIST',
   'NIST Materials Data Repository',
   'NIST MDR'],
  'resource_type': 'dataset',
  'scroll_id': 0,
  'source_id': 'mdr_item_775_v1.1',
  'source_name': 'mdr_item_775',
  'version': 1},
 'services': {'mdf_search': 'This dataset was ingested to MDF Search.',
  'mrr': 'This dataset was registered with the MRR.'}}

If you don’t want to keep the results at all, you can also use fetch_datasets_from_results() to execute a search and use those results instead of passing it your own results.

[22]:
res = mdf.match_field("material.elements", "Al").fetch_datasets_from_results()
res[0]
[22]:
{'data': {'endpoint_path': 'globus://e38ee745-6d04-11e5-ba46-22000b92c6ec/MDF/mdf_connect/prod/data/schleife_al_channel_v1-1/',
  'link': 'https://app.globus.org/file-manager?origin_id=e38ee745-6d04-11e5-ba46-22000b92c6ec&origin_path=/MDF/mdf_connect/prod/data/schleife_al_channel_v1-1/'},
 'dc': {'contributors': [{'affiliations': ['University of Illinois Urbana-Champaign'],
    'contributorName': 'Schleife, Andre',
    'contributorType': 'ContactPerson',
    'familyName': 'Schleife',
    'givenName': 'Andre'}],
  'creators': [{'affiliations': ['University of Illinois Urbana-Champaign'],
    'creatorName': 'Schleife, Andre',
    'familyName': 'Schleife',
    'givenName': 'Andre'}],
  'dates': [{'date': '2017-10-10T15:45:40.065761Z', 'dateType': 'Collected'}],
  'publicationYear': '2015',
  'publisher': 'MDF (placeholder)',
  'resourceType': {'resourceType': 'JSON', 'resourceTypeGeneral': 'Dataset'},
  'subjects': [{'subject': 'data_link'}],
  'titles': [{'title': 'Schleife Al 256 Channel'}]},
 'mdf': {'ingest_date': '2018-11-30T21:04:03.431302Z',
  'resource_type': 'dataset',
  'scroll_id': 0,
  'source_id': 'schleife_al_channel_v1.1',
  'source_name': 'schleife_al_channel',
  'version': 1},
 'services': {'mdf_search': 'This dataset was ingested to MDF Search.'}}

Aggregations

aggregate

Queries submitted with search() are limited to returning 10,000 results. If this limit is too low, you can use aggregate() to retrieve all results from a query, no matter how many. Please be careful with this function, as you can easily accidentally retrieve a very large number of results without meaning to. Consider using search(your_query, limit=0, info=True) first to discover how many results you will get beforehand (see Query info above for more information).

For this example, we will see how many results the query will retrieve before aggregating.

[23]:
mdf.match_field("mdf.source_name", "oqmd*").match_field("material.elements", "Pb").exclude_field("material.elements", "Al")
res, info = mdf.search(limit=0, info=True, reset_query=False)
print("Number of results:", info["total_query_matches"])
Number of results: 15057

Assuming we want all of these results, we can use aggregate() on the same query.

[24]:
res = mdf.aggregate()
print("Number of results:", len(res))
Number of results: 15057
[ ]: