Skip to content

Update NMDC ingest script to include Study name in data_collections field #30

@eecavanna

Description

@eecavanna

As shown in the code snippet below, the ingest script currently only includes the study's ID and the URL to the study's page on the NMDC data portal; both of which it derives from information in the Biosample. Retrieving additional details about the study, such as its name and its description, will require fetching data from the study_set collection (via some Runtime API endpoint, such as GET /studies).

data/contrib/nmdc/ingest.py

Lines 152 to 169 in 87fab60

def get_part_of_collection(self) -> list[bertron.DataCollection]:
"""Returns a list of `DataCollection` instances, each describing one of the Biosample's associated studies.
References:
- https://ber-data.github.io/bertron-schema/DataCollection/
- https://microbiomedata.github.io/nmdc-schema/associated_studies/
TODO: Retrieve the name and description of the Study from the NMDC Runtime API, then include it here.
"""
data_collections = []
if self.associated_studies is not None and len(self.associated_studies) > 0:
for study_id in self.associated_studies:
data_collection = bertron.DataCollection(
id=study_id,
url=f"https://api.microbiomedata.org/studies/{study_id}",
)
data_collections.append(data_collection)
return data_collections

I think this will be a straightforward change to make, but may require renaming some variables and wrapping the cached data within a higher-level JSON object (e.g. one that has a biosamples property and a studies property).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels
    No fields configured for Feature.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions