MDF Forge Client¶
-
class
mdf_forge.
Forge
(index='mdf', local_ep=None, anonymous=False, clear_old_tokens=False, **kwargs)[source]¶ Forge fetches metadata and files from the Materials Data Facility. Forge is intended to be the best way to access MDF data for all users. An internal Query object is used to make queries. From the user’s perspective, an instantiation of Forge will black-box searching.
-
__init__
(index='mdf', local_ep=None, anonymous=False, clear_old_tokens=False, **kwargs)[source]¶ Create an MDF Forge Client.
Parameters: - index (str) – The Search index to search on. Default:
"mdf"
. - local_ep (str) – The endpoint ID of the local Globus Connect Personal endpoint. If needed but not provided, the local endpoint will be autodetected if possible.
- anonymous (bool) –
If
True
, will not authenticate with Globus Auth. IfFalse
, will require authentication. Default:False
.Caution
Authentication is required for some Forge functionality, including viewing private datasets and using Globus Transfer.
- clear_old_tokens (bool) – If
True
, will force reauthentication. IfFalse
, will use existing tokens if possible. Has no effect ifanonymous
isTrue
. Default:False
.
Keyword Arguments: - services (list of str) – Advanced users only. The services to authenticate with,
using Toolbox. An empty list will disable authenticating with Toolbox.
Note that even overwriting clients (with other keyword arguments)
does not stop Toolbox authentication. Only a blank
services
argument will disable Toolbox authentication. - search_client (globus_sdk.SearchClient) – An authenticated SearchClient to overwrite the default.
- transfer_client (globus_sdk.TransferClient) – An authenticated TransferClient to override the default.
- data_mdf_authorizer (globus_sdk.GlobusAuthorizer) – An authenticated GlobusAuthorizer to overwrite the default for accessing the MDF NCSA endpoint.
- petrel_authorizer (globus_sdk.GlobusAuthorizer) – An authenticated GlobusAuthorizer to override the default.
- index (str) – The Search index to search on. Default:
-
aggregate_sources
(source_names, index=None)[source]¶ Aggregate all records with the given
source_name
values. There is no limit to the number of results returned. Please beware of aggregating very large datasets.Caution
It is recommended that you check how many entries will be returned from your chosen datasets by running
match_source_names(source_names).search(limit=0, info=True)
before usingaggregate_sources()
.Note
This method will use terms from the current query, and resets the current query.
Parameters: - source_names (str or list of str) – The
source_name
values to aggregate. - index (str) – The Search index to search on. Default: The current index.
Returns: All of the entries from the
source_name
matches.Return type: list of dict
- source_names (str or list of str) – The
-
describe_field
(resource_type, field=None, raw=False)[source]¶ Fetch and display the description of a field in MDF, along with any subfields.
Parameters: - resource_type (str) – The type of MDF entry to describe a field from.
This value can be
"dataset"
or"record
. - field (str) – The field to describe, in dot notation. The field must be a part
of the provided
resource_type
. To see all fields in the givenresource_type
, use the valueNone
. Default:None
- raw (bool) – When
False
, will format and print the schema. WhenTrue
, will return the raw JSON dictionary instead. For human consumption,False
is recommended. Default:False
- resource_type (str) – The type of MDF entry to describe a field from.
This value can be
-
describe_organization
(organization, summary=False, raw=False)[source]¶ Fetch and display the description of an organization registered with MDF.
Parameters: - organization (str) – The organization to describe.
This value can also be
"list"
to list all organizations’ names, or"all"
to fetch the metadata for every organization (not recommended). - summary (bool) – When
True
, will summarize the organization metadata. The summary just contains the non-technical information about the organization itself. WhenFalse
, will print all of the metadata. This parameter has no effect ifraw=True
. Default:False
- raw (bool) – When
False
, will format and print the organization metadata. WhenTrue
, will return the raw JSON dictionary instead. For human consumption,False
is recommended. Default:False
- organization (str) – The organization to describe.
This value can also be
-
fetch_datasets_from_results
(entries=None, query=None, reset_query=True)[source]¶ Retrieve the dataset entries for given records. Note that this method may use the current query.
Note
This method will use terms from the current query, and resets the current query.
Parameters: - entries (dict, list of dict, or tuple of dict) – The records to parse
to find the datasets. This argument can be a single entry,
a list of entries, or a tuple with a list of entries.
The latter two options support both return values of the
search()
method. If entries isNone
, the current query is executed and those results are used instead. Default:None
. - query (str) – If not
None
, search for entries using this query instead of the current query. Has no effect ifentries
is notNone
. Default:None
. - reset_query (bool) – Has no effect unless
entries
andquery
are bothNone
. IfTrue
, will reset the current query after searching for entries. IfFalse
, will not reset the current query. Default:True
.
Returns: The dataset entries.
Return type: list
- entries (dict, list of dict, or tuple of dict) – The records to parse
to find the datasets. This argument can be a single entry,
a list of entries, or a tuple with a list of entries.
The latter two options support both return values of the
-
get_dataset_version
(source_name)[source]¶ Get the version of a certain dataset.
Parameters: source_name (string) – The source_name
of the dataset.Returns: Version of the dataset in question. Return type: int
-
globus_download
(results, dest='.', dest_ep=None, preserve_dir=False, inactivity_time=None, download_datasets=False, verbose=True)[source]¶ Download data files from the provided results using Globus Transfer. This method requires Globus Connect to be installed on the destination endpoint.
Parameters: - results (dict) – The records from which files should be fetched. This should be the return value of a search method.
- dest (str) – The destination path for the data files on the local machine. Default: The current directory.
- dest_ep (str) – The destination endpoint ID. Default: The autodetected local GCP.
- preserve_dir (bool) – If
True
, the directory structure for the data files will be recreated at the destination. The path to the new files will be relative to thedest
path IfFalse
, only the data files themselves will be saved. Default:False
. - inactivity_time (int) – Number of seconds the Transfer is allowed to go without progress
before being cancelled.
Default:
self.__inactivity_time
. - download_datasets (bool) –
If
True
, will download the full dataset for any dataset entries given. IfFalse
, will skip dataset entries with a notification. Default:False
.Caution
Datasets can be large. Additionally, if you do not filter out records from a dataset you provide, you may end up with duplicate files. Use with care.
- verbose (bool) – If
True
, status and progress messages will be printed, and errors will prompt for continuation confirmation. IfFalse
, only error messages will be printed, and the Transfer will always continue. Default:True
.
Returns: The task IDs of the Globus transfers.
Return type: list of str
-
http_download
(results, dest='.', preserve_dir=False, verbose=True)[source]¶ Download data files from the provided results using HTTPS. For a large number of files, you should use
globus_download()
instead, which uses Globus Transfer.Parameters: - results (dict) – The records from which files should be fetched. This should be the return value of a search method.
- dest (str) – The destination path for the data files on the local machine. Default: The current directory.
- preserve_dir (bool) – If
True
, the directory structure for the data files will be recreated at the destination. IfFalse
, only the data files themselves will be saved. Default:False
. - verbose (bool) – If
True
, status and progress messages will be printed. IfFalse
, only error messages will be printed. Default:True
.
Returns: - The status information for the download:
- success (bool):
True
if the download succeeded.False
- if it failed.
- success (bool):
- message (str): The error message, if the download failed.
Return type: dict
-
http_stream
(results, verbose=True)[source]¶ Yield data files from the provided results using HTTPS, through a generator. For a large number of files, you should use
globus_download()
instead, which uses Globus Transfer.Parameters: - results (dict) – The records from which files should be fetched. This should be the return value of a search method.
- verbose (bool) – If
True
, status and progress messages will be printed. IfFalse
, only error messages will be printed. Default:True
.
Yields: str – Text of each data file.
-
match_dois
(dois)[source]¶ Match the given Digital Object Identifiers.
Parameters: dois (str or list of str) – DOIs to match and return. Returns: self Return type: Forge
-
match_elements
(elements, match_all=True)[source]¶ Add elemental abbreviations to the query.
Parameters: - elements (str or list of str) – The elements to match. For example, “Fe” for iron.
- match_all (bool) – If
True
, will add withAND
. IfFalse
, will useOR
. DefaultTrue
.
Returns: Self
Return type:
-
match_organizations
(organizations, match_all=True)[source]¶ Match the given Organizations. Organizations are MDF-registered groups that can apply rules to datasets.
Parameters: - organizations (str or list of str) – The organizations to match.
- match_all (bool) – If
True
, will add withAND
. IfFalse
, will useOR
. Default:True
.
Returns: Self
Return type:
-
match_records
(source_name, scroll_ids)[source]¶ Match specific records from a given dataset. Multiple records may be matched, but only one dataset per call.
Parameters: - source_name (str) – The
source_name
of the records’ dataset. Thesource_id
is also accepted for convenience. - scroll_ids (int or list of int) – The
scroll_id
values of the records to match.
Returns: self
Return type: - source_name (str) – The
-
match_resource_types
(types)[source]¶ Match the given resource types.
Parameters: types (str or list of str) – The resource_type
values to match.Returns: Self Return type: Forge
-
match_source_names
(source_names)[source]¶ Add sources to match to the query.
Parameters: source_names (str or list of str) – The source_name
values to match.source_id
values are also accepted, but are matched without the additional version information they have.Returns: Self Return type: Forge
-
match_titles
(titles)[source]¶ Add titles to the query.
Parameters: titles (str or list of str) – The titles to match. Returns: Self Return type: Forge
-
match_years
(years=None, start=None, stop=None, inclusive=True)[source]¶ Add years and limits to the query.
Parameters: - years (int or string, or list of int or strings) – The years to match. Note that this argument overrides the start, stop, and inclusive arguments.
- start (int or string) – The lower range of years to match.
- stop (int or string) – The upper range of years to match.
- inclusive (bool) – If
True
, the start and stop values will be included in the search. IfFalse
, they will be excluded. Default:True
.
Returns: Self
Return type:
-
search_by_dois
(dois, index=None, limit=None, info=False)[source]¶ Execute a search for the given Digital Object Identifiers.
search_by_dois([x])
is equivalent tomatch_dois([x]).search()
Note
This method will use terms from the current query, and resets the current query.
Parameters: - dois (list of str) – The DOIs to find.
- index (str) – The Search index to search on. Default: The current index.
- limit (int) – The maximum number of results to return.
The max for this argument is the
SEARCH_LIMIT
imposed by Globus Search. Default:SEARCH_LIMIT
. - info (bool) – If
False
, search will return a list of the results. IfTrue
, search will return a tuple containing the results list and other information about the query. Default:False
.
Returns: The search results. If
info
isTrue
, tuple: The search results, and a dictionary of query information.Return type: If
info
isFalse
, list
-
search_by_elements
(elements, source_names=[], index=None, limit=None, match_all=True, info=False)[source]¶ Execute a search for the given elements in the given sources.
search_by_elements([x], [y])
is equivalent tomatch_elements([x]).match_source_names([y]).search()
.Note
This method will use terms from the current query, and resets the current query.
Parameters: - elements (list of str) – The elements to match. For example, “Fe” for iron.
- source_names (list of str) – The
source_name``s to match. **Default:** ``[]
. - index (str) – The Search index to search on. Default: The current index.
- limit (int) – The maximum number of results to return.
The max for this argument is the
SEARCH_LIMIT
imposed by Globus Search. Default:SEARCH_LIMIT
. - match_all (bool) – If
True
, will add elements withAND
. IfFalse
, will useOR
. Default:True
. - info (bool) – If
False
, search will return a list of the results. IfTrue
, search will return a tuple containing the results list and other information about the query. Default:False
.
Returns: The search results. If
info
isTrue
, tuple: The search results, and a dictionary of query information.Return type: If
info
isFalse
, list
-
search_by_titles
(titles, index=None, limit=None, info=False)[source]¶ Execute a search for the given titles.
search_by_titles([x])
is equivalent tomatch_titles([x]).search()
Note
This method will use terms from the current query, and resets the current query.
Parameters: - titles (list of str) – The titles to match.
- index (str) – The Search index to search on. Default: The current index.
- limit (int) – The maximum number of results to return.
The max for this argument is the
SEARCH_LIMIT
imposed by Globus Search. Default:SEARCH_LIMIT
. - info (bool) – If
False
, search will return a list of the results. IfTrue
, search will return a tuple containing the results list and other information about the query. Default:False
.
Returns: The search results. If
info
isTrue
, tuple: The search results, and a dictionary of query information.Return type: If
info
isFalse
, list
-