CEDA/ESGF¶

CEDA+ESGF - an easy CMIP6 wrapper¶

After the lab meeting:¶

We discussed everyones different ways of getting at CMIP6 data
The two prevailing camps are CEDA and intake ESGF
CEDA is incomplete however does contain useful datasets
intake ESGF methods are easy to use but relies on unlimited storage capacity
Bridging the gap between these two/ identifiying where CEDA resouces start and end may lead to better data management practices

Goals:¶

ESGF Web browser like search interface (auto complete?)
CEDA first comprehensive search
Indentification of missing variables
Estimation of data download requirements
Easy download to cache/ repeatable access
Storage statistics

Potential Issues/ Planned fixes:¶

Different grids (gn, gr, gr1...) - could mess with things
No way of combining timelines at the moment - ie historical into SSPXXX
No way of grabbing land masks - relies on importing external ones
Would like a storage demand estimator/ way of managing storage
View of pivot table to visualize data avaliability

Note that this notebook uses functions from the ceda_esgf section of the climdyn_tools package.

In [1]:

Copied!





import os, intake_esgf
import glob
import xarray as xr
import numpy as np
import cftime
import pandas as pd
from intake_esgf import ESGFCatalog
from importlib import reload  # Python 3.4+
import matplotlib.pyplot as plt
import climdyn_tools.ceda_esgf.base as CEFunc
# import CEDAESGF_Funcs as CEFunc

reload(CEFunc)
import os, intake_esgf
import glob
import xarray as xr
import numpy as np
import cftime
import pandas as pd
from intake_esgf import ESGFCatalog
from importlib import reload  # Python 3.4+
import matplotlib.pyplot as plt
import climdyn_tools.ceda_esgf.base as CEFunc
# import CEDAESGF_Funcs as CEFunc

reload(CEFunc)

Out[1]:

<module 'climdyn_tools.ceda_esgf.base' from '/Users/joshduffield/Documents/StAndrews/Wiki/climdyn_tools/ceda_esgf/base.py'>

In [250]:

Copied!





#### Config
activity_id = 'CMIP'
experiment_id = 'abrupt-4xCO2'
source_id = "GFDL-CM4"
do1member = True

# CMIP variables of interest
variableList = ['tas', 'huss', 'rsds', 'rsus', 'rlds', 'rlus', 'hfls', 'hfss']
table_id = 'Amon'  ## Time step

### Change to personal cache's
intake_esgf.conf.set(local_cache="/gws/nopw/j04/global_ex/chingosa/cache")
intake_esgf.conf.set(indices={ ### Sets up which nodes it looks at - can be fiddly
    "esgf-node.llnl.gov": False,
    "esgf-node.ornl.gov": True,
    "esgf.ceda.ac.uk": True,
    "anl-dev": True,
    "ornl-dev": True,
    "ESGF2-US-1.5-Catalog": True,
    "esgf-data.dkrz.de": True,
    "esgf-node.ipsl.upmc.fr": True,
    "esg-dn1.nsc.liu.se": True,
    "esgf.nci.org.au": True,
})

### Meta Data
CMIP6Meta = CEFunc.load_cmip6_source_id()                             #JSON of CMIP6 Meta Data
source_id_list = CEFunc.source_id_in_activity(activity_id, CMIP6Meta) # Source_id_list of models participatingin activity_id - use this for looping over source_ids
M2I = CEFunc.getModel_to_inst(CMIP6Meta)                              # Dictionary linking institution name (for CEDA) to source_id

# Figure out what CEDA/ESGF have
CEDA_res = CEFunc.checkCEDA(source_id, activity_id, experiment_id, M2I, variableList, table_id, member_id = '*')
ESGF_res = CEFunc.checkESGF(source_id, activity_id, experiment_id, M2I, variableList, table_id, member_id = '*')

# Find Overlapping Variants and variables that make sense to grab
pivot = CEFunc.compare_cat_res_pivot(ESGF_res.df, CEDA_res)
ranking = CEFunc.rank_members_with_vars(pivot)
sensible_members = ranking[ranking.CEDA_CHOICE_count == ranking.CEDA_CHOICE_count.iloc[0]]

# Which variables are we getting from each source
ESGF_vars = sensible_members.ESGF_vars.iloc[0]

print(f'There are like {len(sensible_members)} that would make sense to analyse but will have to download {len(ESGF_vars)} of {len(variableList)} variables from ESGF')

# If looping through the sensible members heres where we decide which ones we are doing
if do1member: member_ids = [sensible_members.member_id.iloc[0]]
else:         member_ids = sensible_members.member_id

### Loop through Members
for member_id in member_ids:
    print(f'Starting Analysis for {member_id}')
    
    row = sensible_members[sensible_members.member_id == member_id].reset_index(drop=True)
    # Which variables are we getting from each source
    CEDA_vars = row.CEDA_vars.item()
    ESGF_vars = row.ESGF_vars.item()

    # if there are multiple sources - finds and divies them up and combines into one thing...
    ds = CEFunc.getCombinedData(source_id, activity_id, experiment_id, M2I,CEDA_vars, ESGF_vars, table_id, member_id, doReadOut = True)

    ## do whatever you need to do with ds
#### Config
activity_id = 'CMIP'
experiment_id = 'abrupt-4xCO2'
source_id = "GFDL-CM4"
do1member = True

# CMIP variables of interest
variableList = ['tas', 'huss', 'rsds', 'rsus', 'rlds', 'rlus', 'hfls', 'hfss']
table_id = 'Amon'  ## Time step

### Change to personal cache's
intake_esgf.conf.set(local_cache="/gws/nopw/j04/global_ex/chingosa/cache")
intake_esgf.conf.set(indices={ ### Sets up which nodes it looks at - can be fiddly
    "esgf-node.llnl.gov": False,
    "esgf-node.ornl.gov": True,
    "esgf.ceda.ac.uk": True,
    "anl-dev": True,
    "ornl-dev": True,
    "ESGF2-US-1.5-Catalog": True,
    "esgf-data.dkrz.de": True,
    "esgf-node.ipsl.upmc.fr": True,
    "esg-dn1.nsc.liu.se": True,
    "esgf.nci.org.au": True,
})

### Meta Data
CMIP6Meta = CEFunc.load_cmip6_source_id()                             #JSON of CMIP6 Meta Data
source_id_list = CEFunc.source_id_in_activity(activity_id, CMIP6Meta) # Source_id_list of models participatingin activity_id - use this for looping over source_ids
M2I = CEFunc.getModel_to_inst(CMIP6Meta)                              # Dictionary linking institution name (for CEDA) to source_id

# Figure out what CEDA/ESGF have
CEDA_res = CEFunc.checkCEDA(source_id, activity_id, experiment_id, M2I, variableList, table_id, member_id = '*')
ESGF_res = CEFunc.checkESGF(source_id, activity_id, experiment_id, M2I, variableList, table_id, member_id = '*')

# Find Overlapping Variants and variables that make sense to grab
pivot = CEFunc.compare_cat_res_pivot(ESGF_res.df, CEDA_res)
ranking = CEFunc.rank_members_with_vars(pivot)
sensible_members = ranking[ranking.CEDA_CHOICE_count == ranking.CEDA_CHOICE_count.iloc[0]]

# Which variables are we getting from each source
ESGF_vars = sensible_members.ESGF_vars.iloc[0]

print(f'There are like {len(sensible_members)} that would make sense to analyse but will have to download {len(ESGF_vars)} of {len(variableList)} variables from ESGF')

# If looping through the sensible members heres where we decide which ones we are doing
if do1member: member_ids = [sensible_members.member_id.iloc[0]]
else:         member_ids = sensible_members.member_id

### Loop through Members
for member_id in member_ids:
    print(f'Starting Analysis for {member_id}')
    
    row = sensible_members[sensible_members.member_id == member_id].reset_index(drop=True)
    # Which variables are we getting from each source
    CEDA_vars = row.CEDA_vars.item()
    ESGF_vars = row.ESGF_vars.item()

    # if there are multiple sources - finds and divies them up and combines into one thing...
    ds = CEFunc.getCombinedData(source_id, activity_id, experiment_id, M2I,CEDA_vars, ESGF_vars, table_id, member_id, doReadOut = True)

    ## do whatever you need to do with ds

Attempt 1 to initialize ESGFCatalog...
ESGFCatalog successfully initialized.

   Searching indices:   0%|          |0/9 [       ?index/s]

/home/users/chingosa/.local/lib/python3.11/site-packages/intake_esgf/catalog.py:316: UserWarning: SolrESGFIndex('esgf-node.ornl.gov') failed to return a response, results may be incomplete
  warnings.warn(

There are like 1 that would make sense to analyse but will have to download 2 of 8 variables from ESGF
Starting Analysis for r1i1p1f1
Attempt 1 to initialize ESGFCatalog...
ESGFCatalog successfully initialized.

   Searching indices:   0%|          |0/9 [       ?index/s]

/home/users/chingosa/.local/lib/python3.11/site-packages/intake_esgf/catalog.py:316: UserWarning: SolrESGFIndex('esgf-node.ornl.gov') failed to return a response, results may be incomplete
  warnings.warn(

Get file information:   0%|          |0/9 [       ?index/s]

/home/users/chingosa/.local/lib/python3.11/site-packages/intake_esgf/catalog.py:468: UserWarning: SolrESGFIndex('esgf-node.ornl.gov') failed to return a response, info may be incomplete
  warnings.warn(

Downloading 499.5 [Mb]...

rlus_Amon_GFDL-CM4_abrupt-4xCO2_r1i1p...:   0%|          |0.00/160M [?B/s]

rlus_Amon_GFDL-CM4_abrupt-4xCO2_r1i1p...:   0%|          |0.00/79.8M [?B/s]

rsus_Amon_GFDL-CM4_abrupt-4xCO2_r1i1p...:   0%|          |0.00/173M [?B/s]

rsus_Amon_GFDL-CM4_abrupt-4xCO2_r1i1p...:   0%|          |0.00/86.7M [?B/s]

In [251]:

Copied!

ds
ds

Out[251]:

<xarray.Dataset> Size: 3GB
Dimensions:    (time: 1800, lat: 180, lon: 288, bnds: 2)
Coordinates:
  * bnds       (bnds) float64 16B 1.0 2.0
  * lat        (lat) float64 1kB -89.5 -88.5 -87.5 -86.5 ... 86.5 87.5 88.5 89.5
  * lon        (lon) float64 2kB 0.625 1.875 3.125 4.375 ... 356.9 358.1 359.4
  * time       (time) object 14kB 0001-01-16 12:00:00 ... 0150-12-16 12:00:00
    height     float64 8B 2.0
Data variables:
    hfls       (time, lat, lon) float32 373MB dask.array<chunksize=(1, 180, 288), meta=np.ndarray>
    lat_bnds   (time, lat, bnds) float64 5MB dask.array<chunksize=(1200, 180, 2), meta=np.ndarray>
    lon_bnds   (time, lon, bnds) float64 8MB dask.array<chunksize=(1200, 288, 2), meta=np.ndarray>
    time_bnds  (time, bnds) object 29kB dask.array<chunksize=(1, 2), meta=np.ndarray>
    hfss       (time, lat, lon) float32 373MB dask.array<chunksize=(1, 180, 288), meta=np.ndarray>
    huss       (time, lat, lon) float32 373MB dask.array<chunksize=(1, 180, 288), meta=np.ndarray>
    rlds       (time, lat, lon) float32 373MB dask.array<chunksize=(1, 180, 288), meta=np.ndarray>
    rsds       (time, lat, lon) float32 373MB dask.array<chunksize=(1, 180, 288), meta=np.ndarray>
    tas        (time, lat, lon) float32 373MB dask.array<chunksize=(1, 180, 288), meta=np.ndarray>
    rlus       (time, lat, lon) float32 373MB dask.array<chunksize=(1, 180, 288), meta=np.ndarray>
    rsus       (time, lat, lon) float32 373MB dask.array<chunksize=(1, 180, 288), meta=np.ndarray>
Attributes: (12/46)
    external_variables:     areacella
    history:                File was processed by fremetar (GFDL analog of CM...
    table_id:               Amon
    activity_id:            CMIP
    branch_method:          standard
    branch_time_in_child:   0.0
    ...                     ...
    variable_id:            hfls
    variant_info:           N/A
    references:             see further_info_url attribute
    variant_label:          r1i1p1f1
    branch_time_in_parent:  36500.0
    parent_time_units:      days since 0001-1-1