MDF Forge Client¶

class mdf_forge.Forge(index='mdf', local_ep=None, anonymous=False, clear_old_tokens=False, **kwargs)[source]¶

Forge fetches metadata and files from the Materials Data Facility. Forge is intended to be the best way to access MDF data for all users. An internal Query object is used to make queries. From the user’s perspective, an instantiation of Forge will black-box searching.

__init__(index='mdf', local_ep=None, anonymous=False, clear_old_tokens=False, **kwargs)[source]¶

Create an MDF Forge Client.

Parameters:

Parameters:	index (str) – The Search index to search on. Default: `"mdf"`. local_ep (str) – The endpoint ID of the local Globus Connect Personal endpoint. If needed but not provided, the local endpoint will be autodetected if possible. anonymous (bool) – If `True`, will not authenticate with Globus Auth. If `False`, will require authentication. Default: `False`. Caution Authentication is required for some Forge functionality, including viewing private datasets and using Globus Transfer. clear_old_tokens (bool) – If `True`, will force reauthentication. If `False`, will use existing tokens if possible. Has no effect if `anonymous` is `True`. Default: `False`.
Keyword Arguments:
	services (list of str) – Advanced users only. The services to authenticate with, using Toolbox. An empty list will disable authenticating with Toolbox. Note that even overwriting clients (with other keyword arguments) does not stop Toolbox authentication. Only a blank `services` argument will disable Toolbox authentication. search_client (globus_sdk.SearchClient) – An authenticated SearchClient to overwrite the default. transfer_client (globus_sdk.TransferClient) – An authenticated TransferClient to override the default. data_mdf_authorizer (globus_sdk.GlobusAuthorizer) – An authenticated GlobusAuthorizer to overwrite the default for accessing the MDF NCSA endpoint. petrel_authorizer (globus_sdk.GlobusAuthorizer) – An authenticated GlobusAuthorizer to override the default.

index (str) – The Search index to search on. Default: "mdf".
local_ep (str) – The endpoint ID of the local Globus Connect Personal endpoint. If needed but not provided, the local endpoint will be autodetected if possible.
anonymous (bool) –
If True, will not authenticate with Globus Auth. If False, will require authentication. Default: False.

Caution

Authentication is required for some Forge functionality, including viewing private datasets and using Globus Transfer.
clear_old_tokens (bool) – If True, will force reauthentication. If False, will use existing tokens if possible. Has no effect if anonymous is True. Default: False.

Keyword Arguments:

services (list of str) – Advanced users only. The services to authenticate with, using Toolbox. An empty list will disable authenticating with Toolbox. Note that even overwriting clients (with other keyword arguments) does not stop Toolbox authentication. Only a blank services argument will disable Toolbox authentication.
search_client (globus_sdk.SearchClient) – An authenticated SearchClient to overwrite the default.
transfer_client (globus_sdk.TransferClient) – An authenticated TransferClient to override the default.
data_mdf_authorizer (globus_sdk.GlobusAuthorizer) – An authenticated GlobusAuthorizer to overwrite the default for accessing the MDF NCSA endpoint.
petrel_authorizer (globus_sdk.GlobusAuthorizer) – An authenticated GlobusAuthorizer to override the default.

aggregate_sources(source_names, index=None)[source]¶

Aggregate all records with the given source_name values. There is no limit to the number of results returned. Please beware of aggregating very large datasets.

Caution

It is recommended that you check how many entries will be returned from your chosen datasets by running match_source_names(source_names).search(limit=0, info=True) before using aggregate_sources().

Note

This method will use terms from the current query, and resets the current query.

Parameters:	source_names (str or list of str) – The `source_name` values to aggregate. index (str) – The Search index to search on. Default: The current index.
Returns:	All of the entries from the `source_name` matches.
Return type:	list of dict

describe_field(resource_type, field=None, raw=False)[source]¶

Fetch and display the description of a field in MDF, along with any subfields.

Parameters:

Parameters:	resource_type (str) – The type of MDF entry to describe a field from. This value can be `"dataset"` or `"record`. field (str) – The field to describe, in dot notation. The field must be a part of the provided `resource_type`. To see all fields in the given `resource_type`, use the value `None`. Default: `None` raw (bool) – When `False`, will format and print the schema. When `True`, will return the raw JSON dictionary instead. For human consumption, `False` is recommended. Default: `False`

resource_type (str) – The type of MDF entry to describe a field from. This value can be "dataset" or "record.
field (str) – The field to describe, in dot notation. The field must be a part of the provided resource_type. To see all fields in the given resource_type, use the value None. Default: None
raw (bool) – When False, will format and print the schema. When True, will return the raw JSON dictionary instead. For human consumption, False is recommended. Default: False

describe_organization(organization, summary=False, raw=False)[source]¶

Fetch and display the description of an organization registered with MDF.

Parameters:

Parameters:	organization (str) – The organization to describe. This value can also be `"list"` to list all organizations’ names, or `"all"` to fetch the metadata for every organization (not recommended). summary (bool) – When `True`, will summarize the organization metadata. The summary just contains the non-technical information about the organization itself. When `False`, will print all of the metadata. This parameter has no effect if `raw=True`. Default: `False` raw (bool) – When `False`, will format and print the organization metadata. When `True`, will return the raw JSON dictionary instead. For human consumption, `False` is recommended. Default: `False`

organization (str) – The organization to describe. This value can also be "list" to list all organizations’ names, or "all" to fetch the metadata for every organization (not recommended).
summary (bool) – When True, will summarize the organization metadata. The summary just contains the non-technical information about the organization itself. When False, will print all of the metadata. This parameter has no effect if raw=True. Default: False
raw (bool) – When False, will format and print the organization metadata. When True, will return the raw JSON dictionary instead. For human consumption, False is recommended. Default: False

fetch_datasets_from_results(entries=None, query=None, reset_query=True)[source]¶

Retrieve the dataset entries for given records. Note that this method may use the current query.

Note

This method will use terms from the current query, and resets the current query.

Parameters:	entries (dict, list of dict, or tuple of dict) – The records to parse to find the datasets. This argument can be a single entry, a list of entries, or a tuple with a list of entries. The latter two options support both return values of the `search()` method. If entries is `None`, the current query is executed and those results are used instead. Default: `None`. query (str) – If not `None`, search for entries using this query instead of the current query. Has no effect if `entries` is not `None`. Default: `None`. reset_query (bool) – Has no effect unless `entries` and `query` are both `None`. If `True`, will reset the current query after searching for entries. If `False`, will not reset the current query. Default: `True`.
Returns:	The dataset entries.
Return type:	list

get_dataset_version(source_name)[source]¶

Get the version of a certain dataset.

Parameters:	source_name (string) – The `source_name` of the dataset.
Returns:	Version of the dataset in question.
Return type:	int

globus_download(results, dest='.', dest_ep=None, preserve_dir=False, inactivity_time=None, download_datasets=False, verbose=True)[source]¶

Download data files from the provided results using Globus Transfer. This method requires Globus Connect to be installed on the destination endpoint.

Parameters:	results (dict) – The records from which files should be fetched. This should be the return value of a search method. dest (str) – The destination path for the data files on the local machine. Default: The current directory. dest_ep (str) – The destination endpoint ID. Default: The autodetected local GCP. preserve_dir (bool) – If `True`, the directory structure for the data files will be recreated at the destination. The path to the new files will be relative to the `dest` path If `False`, only the data files themselves will be saved. Default: `False`. inactivity_time (int) – Number of seconds the Transfer is allowed to go without progress before being cancelled. Default: `self.__inactivity_time`. download_datasets (bool) – If `True`, will download the full dataset for any dataset entries given. If `False`, will skip dataset entries with a notification. Default: `False`. Caution Datasets can be large. Additionally, if you do not filter out records from a dataset you provide, you may end up with duplicate files. Use with care. verbose (bool) – If `True`, status and progress messages will be printed, and errors will prompt for continuation confirmation. If `False`, only error messages will be printed, and the Transfer will always continue. Default: `True`.
Returns:	The task IDs of the Globus transfers.
Return type:	list of str

http_download(results, dest='.', preserve_dir=False, verbose=True)[source]¶

Download data files from the provided results using HTTPS. For a large number of files, you should use globus_download() instead, which uses Globus Transfer.

Parameters:

Parameters:	results (dict) – The records from which files should be fetched. This should be the return value of a search method. dest (str) – The destination path for the data files on the local machine. Default: The current directory. preserve_dir (bool) – If `True`, the directory structure for the data files will be recreated at the destination. If `False`, only the data files themselves will be saved. Default: `False`. verbose (bool) – If `True`, status and progress messages will be printed. If `False`, only error messages will be printed. Default: `True`.
Returns:	The status information for the download: success (bool): `True` if the download succeeded. `False` if it failed. message (str): The error message, if the download failed.
Return type:	dict

results (dict) – The records from which files should be fetched. This should be the return value of a search method.
dest (str) – The destination path for the data files on the local machine. Default: The current directory.
preserve_dir (bool) – If True, the directory structure for the data files will be recreated at the destination. If False, only the data files themselves will be saved. Default: False.
verbose (bool) – If True, status and progress messages will be printed. If False, only error messages will be printed. Default: True.

Returns:

The status information for the download:

success (bool): True if the download succeeded. False

if it failed.
message (str): The error message, if the download failed.

Return type:

dict

http_stream(results, verbose=True)[source]¶

Yield data files from the provided results using HTTPS, through a generator. For a large number of files, you should use globus_download() instead, which uses Globus Transfer.

Parameters:	results (dict) – The records from which files should be fetched. This should be the return value of a search method. verbose (bool) – If `True`, status and progress messages will be printed. If `False`, only error messages will be printed. Default: `True`.
Yields:	str – Text of each data file.

match_dois(dois)[source]¶

Match the given Digital Object Identifiers.

Parameters:	dois (str or list of str) – DOIs to match and return.
Returns:	self
Return type:	Forge

match_elements(elements, match_all=True)[source]¶

Add elemental abbreviations to the query.

Parameters:	elements (str or list of str) – The elements to match. For example, “Fe” for iron. match_all (bool) – If `True`, will add with `AND`. If `False`, will use `OR`. Default `True`.
Returns:	Self
Return type:	Forge

match_organizations(organizations, match_all=True)[source]¶

Match the given Organizations. Organizations are MDF-registered groups that can apply rules to datasets.

Parameters:	organizations (str or list of str) – The organizations to match. match_all (bool) – If `True`, will add with `AND`. If `False`, will use `OR`. Default: `True`.
Returns:	Self
Return type:	Forge

match_records(source_name, scroll_ids)[source]¶

Match specific records from a given dataset. Multiple records may be matched, but only one dataset per call.

Parameters:	source_name (str) – The `source_name` of the records’ dataset. The `source_id` is also accepted for convenience. scroll_ids (int or list of int) – The `scroll_id` values of the records to match.
Returns:	self
Return type:	Forge

match_resource_types(types)[source]¶

Match the given resource types.

Parameters:	types (str or list of str) – The `resource_type` values to match.
Returns:	Self
Return type:	Forge

match_source_names(source_names)[source]¶

Add sources to match to the query.

Parameters:	source_names (str or list of str) – The `source_name` values to match. `source_id` values are also accepted, but are matched without the additional version information they have.
Returns:	Self
Return type:	Forge

match_titles(titles)[source]¶

Add titles to the query.

Parameters:	titles (str or list of str) – The titles to match.
Returns:	Self
Return type:	Forge

match_years(years=None, start=None, stop=None, inclusive=True)[source]¶

Add years and limits to the query.

Parameters:	years (int or string, or list of int or strings) – The years to match. Note that this argument overrides the start, stop, and inclusive arguments. start (int or string) – The lower range of years to match. stop (int or string) – The upper range of years to match. inclusive (bool) – If `True`, the start and stop values will be included in the search. If `False`, they will be excluded. Default: `True`.
Returns:	Self
Return type:	Forge

search_by_dois(dois, index=None, limit=None, info=False)[source]¶

Execute a search for the given Digital Object Identifiers. search_by_dois([x]) is equivalent to match_dois([x]).search()

Note

This method will use terms from the current query, and resets the current query.

Parameters:	dois (list of str) – The DOIs to find. index (str) – The Search index to search on. Default: The current index. limit (int) – The maximum number of results to return. The max for this argument is the `SEARCH_LIMIT` imposed by Globus Search. Default: `SEARCH_LIMIT`. info (bool) – If `False`, search will return a list of the results. If `True`, search will return a tuple containing the results list and other information about the query. Default: `False`.
Returns:	The search results. If `info` is `True`, tuple: The search results, and a dictionary of query information.
Return type:	If `info` is `False`, list

search_by_elements(elements, source_names=[], index=None, limit=None, match_all=True, info=False)[source]¶

Execute a search for the given elements in the given sources. search_by_elements([x], [y]) is equivalent to match_elements([x]).match_source_names([y]).search().

Note

This method will use terms from the current query, and resets the current query.

Parameters:	elements (list of str) – The elements to match. For example, “Fe” for iron. source_names (list of str) – The source_name``s to match. Default: ``[]. index (str) – The Search index to search on. Default: The current index. limit (int) – The maximum number of results to return. The max for this argument is the `SEARCH_LIMIT` imposed by Globus Search. Default: `SEARCH_LIMIT`. match_all (bool) – If `True`, will add elements with `AND`. If `False`, will use `OR`. Default: `True`. info (bool) – If `False`, search will return a list of the results. If `True`, search will return a tuple containing the results list and other information about the query. Default: `False`.
Returns:	The search results. If `info` is `True`, tuple: The search results, and a dictionary of query information.
Return type:	If `info` is `False`, list

search_by_titles(titles, index=None, limit=None, info=False)[source]¶

Execute a search for the given titles. search_by_titles([x]) is equivalent to match_titles([x]).search()

Note

This method will use terms from the current query, and resets the current query.

Parameters:	titles (list of str) – The titles to match. index (str) – The Search index to search on. Default: The current index. limit (int) – The maximum number of results to return. The max for this argument is the `SEARCH_LIMIT` imposed by Globus Search. Default: `SEARCH_LIMIT`. info (bool) – If `False`, search will return a list of the results. If `True`, search will return a tuple containing the results list and other information about the query. Default: `False`.
Returns:	The search results. If `info` is `True`, tuple: The search results, and a dictionary of query information.
Return type:	If `info` is `False`, list