CEDA/ESGF¶
CEDA+ESGF - an easy CMIP6 wrapper¶
After the lab meeting:¶
- We discussed everyones different ways of getting at CMIP6 data
- The two prevailing camps are CEDA and intake ESGF
- CEDA is incomplete however does contain useful datasets
- intake ESGF methods are easy to use but relies on unlimited storage capacity
- Bridging the gap between these two/ identifiying where CEDA resouces start and end may lead to better data management practices
Goals:¶
- ESGF Web browser like search interface (auto complete?)
- CEDA first comprehensive search
- Indentification of missing variables
- Estimation of data download requirements
- Easy download to cache/ repeatable access
- Storage statistics
Potential Issues/ Planned fixes:¶
- Different grids (gn, gr, gr1...) - could mess with things
- No way of combining timelines at the moment - ie historical into SSPXXX
- No way of grabbing land masks - relies on importing external ones
- Would like a storage demand estimator/ way of managing storage
- View of pivot table to visualize data avaliability
Note that this notebook uses functions from the ceda_esgf section of the climdyn_tools
package.
In [1]:
Copied!
import os, intake_esgf
import glob
import xarray as xr
import numpy as np
import cftime
import pandas as pd
from intake_esgf import ESGFCatalog
from importlib import reload # Python 3.4+
import matplotlib.pyplot as plt
import climdyn_tools.ceda_esgf.base as CEFunc
# import CEDAESGF_Funcs as CEFunc
reload(CEFunc)
import os, intake_esgf
import glob
import xarray as xr
import numpy as np
import cftime
import pandas as pd
from intake_esgf import ESGFCatalog
from importlib import reload # Python 3.4+
import matplotlib.pyplot as plt
import climdyn_tools.ceda_esgf.base as CEFunc
# import CEDAESGF_Funcs as CEFunc
reload(CEFunc)
Out[1]:
<module 'climdyn_tools.ceda_esgf.base' from '/Users/joshduffield/Documents/StAndrews/Wiki/climdyn_tools/ceda_esgf/base.py'>
In [250]:
Copied!
#### Config
activity_id = 'CMIP'
experiment_id = 'abrupt-4xCO2'
source_id = "GFDL-CM4"
do1member = True
# CMIP variables of interest
variableList = ['tas', 'huss', 'rsds', 'rsus', 'rlds', 'rlus', 'hfls', 'hfss']
table_id = 'Amon' ## Time step
### Change to personal cache's
intake_esgf.conf.set(local_cache="/gws/nopw/j04/global_ex/chingosa/cache")
intake_esgf.conf.set(indices={ ### Sets up which nodes it looks at - can be fiddly
"esgf-node.llnl.gov": False,
"esgf-node.ornl.gov": True,
"esgf.ceda.ac.uk": True,
"anl-dev": True,
"ornl-dev": True,
"ESGF2-US-1.5-Catalog": True,
"esgf-data.dkrz.de": True,
"esgf-node.ipsl.upmc.fr": True,
"esg-dn1.nsc.liu.se": True,
"esgf.nci.org.au": True,
})
### Meta Data
CMIP6Meta = CEFunc.load_cmip6_source_id() #JSON of CMIP6 Meta Data
source_id_list = CEFunc.source_id_in_activity(activity_id, CMIP6Meta) # Source_id_list of models participatingin activity_id - use this for looping over source_ids
M2I = CEFunc.getModel_to_inst(CMIP6Meta) # Dictionary linking institution name (for CEDA) to source_id
# Figure out what CEDA/ESGF have
CEDA_res = CEFunc.checkCEDA(source_id, activity_id, experiment_id, M2I, variableList, table_id, member_id = '*')
ESGF_res = CEFunc.checkESGF(source_id, activity_id, experiment_id, M2I, variableList, table_id, member_id = '*')
# Find Overlapping Variants and variables that make sense to grab
pivot = CEFunc.compare_cat_res_pivot(ESGF_res.df, CEDA_res)
ranking = CEFunc.rank_members_with_vars(pivot)
sensible_members = ranking[ranking.CEDA_CHOICE_count == ranking.CEDA_CHOICE_count.iloc[0]]
# Which variables are we getting from each source
ESGF_vars = sensible_members.ESGF_vars.iloc[0]
print(f'There are like {len(sensible_members)} that would make sense to analyse but will have to download {len(ESGF_vars)} of {len(variableList)} variables from ESGF')
# If looping through the sensible members heres where we decide which ones we are doing
if do1member: member_ids = [sensible_members.member_id.iloc[0]]
else: member_ids = sensible_members.member_id
### Loop through Members
for member_id in member_ids:
print(f'Starting Analysis for {member_id}')
row = sensible_members[sensible_members.member_id == member_id].reset_index(drop=True)
# Which variables are we getting from each source
CEDA_vars = row.CEDA_vars.item()
ESGF_vars = row.ESGF_vars.item()
# if there are multiple sources - finds and divies them up and combines into one thing...
ds = CEFunc.getCombinedData(source_id, activity_id, experiment_id, M2I,CEDA_vars, ESGF_vars, table_id, member_id, doReadOut = True)
## do whatever you need to do with ds
#### Config
activity_id = 'CMIP'
experiment_id = 'abrupt-4xCO2'
source_id = "GFDL-CM4"
do1member = True
# CMIP variables of interest
variableList = ['tas', 'huss', 'rsds', 'rsus', 'rlds', 'rlus', 'hfls', 'hfss']
table_id = 'Amon' ## Time step
### Change to personal cache's
intake_esgf.conf.set(local_cache="/gws/nopw/j04/global_ex/chingosa/cache")
intake_esgf.conf.set(indices={ ### Sets up which nodes it looks at - can be fiddly
"esgf-node.llnl.gov": False,
"esgf-node.ornl.gov": True,
"esgf.ceda.ac.uk": True,
"anl-dev": True,
"ornl-dev": True,
"ESGF2-US-1.5-Catalog": True,
"esgf-data.dkrz.de": True,
"esgf-node.ipsl.upmc.fr": True,
"esg-dn1.nsc.liu.se": True,
"esgf.nci.org.au": True,
})
### Meta Data
CMIP6Meta = CEFunc.load_cmip6_source_id() #JSON of CMIP6 Meta Data
source_id_list = CEFunc.source_id_in_activity(activity_id, CMIP6Meta) # Source_id_list of models participatingin activity_id - use this for looping over source_ids
M2I = CEFunc.getModel_to_inst(CMIP6Meta) # Dictionary linking institution name (for CEDA) to source_id
# Figure out what CEDA/ESGF have
CEDA_res = CEFunc.checkCEDA(source_id, activity_id, experiment_id, M2I, variableList, table_id, member_id = '*')
ESGF_res = CEFunc.checkESGF(source_id, activity_id, experiment_id, M2I, variableList, table_id, member_id = '*')
# Find Overlapping Variants and variables that make sense to grab
pivot = CEFunc.compare_cat_res_pivot(ESGF_res.df, CEDA_res)
ranking = CEFunc.rank_members_with_vars(pivot)
sensible_members = ranking[ranking.CEDA_CHOICE_count == ranking.CEDA_CHOICE_count.iloc[0]]
# Which variables are we getting from each source
ESGF_vars = sensible_members.ESGF_vars.iloc[0]
print(f'There are like {len(sensible_members)} that would make sense to analyse but will have to download {len(ESGF_vars)} of {len(variableList)} variables from ESGF')
# If looping through the sensible members heres where we decide which ones we are doing
if do1member: member_ids = [sensible_members.member_id.iloc[0]]
else: member_ids = sensible_members.member_id
### Loop through Members
for member_id in member_ids:
print(f'Starting Analysis for {member_id}')
row = sensible_members[sensible_members.member_id == member_id].reset_index(drop=True)
# Which variables are we getting from each source
CEDA_vars = row.CEDA_vars.item()
ESGF_vars = row.ESGF_vars.item()
# if there are multiple sources - finds and divies them up and combines into one thing...
ds = CEFunc.getCombinedData(source_id, activity_id, experiment_id, M2I,CEDA_vars, ESGF_vars, table_id, member_id, doReadOut = True)
## do whatever you need to do with ds
Attempt 1 to initialize ESGFCatalog... ESGFCatalog successfully initialized.
Searching indices: 0%| |0/9 [ ?index/s]
/home/users/chingosa/.local/lib/python3.11/site-packages/intake_esgf/catalog.py:316: UserWarning: SolrESGFIndex('esgf-node.ornl.gov') failed to return a response, results may be incomplete warnings.warn(
There are like 1 that would make sense to analyse but will have to download 2 of 8 variables from ESGF Starting Analysis for r1i1p1f1 Attempt 1 to initialize ESGFCatalog... ESGFCatalog successfully initialized.
Searching indices: 0%| |0/9 [ ?index/s]
/home/users/chingosa/.local/lib/python3.11/site-packages/intake_esgf/catalog.py:316: UserWarning: SolrESGFIndex('esgf-node.ornl.gov') failed to return a response, results may be incomplete warnings.warn(
Get file information: 0%| |0/9 [ ?index/s]
/home/users/chingosa/.local/lib/python3.11/site-packages/intake_esgf/catalog.py:468: UserWarning: SolrESGFIndex('esgf-node.ornl.gov') failed to return a response, info may be incomplete warnings.warn(
Downloading 499.5 [Mb]...
rlus_Amon_GFDL-CM4_abrupt-4xCO2_r1i1p...: 0%| |0.00/160M [?B/s]
rlus_Amon_GFDL-CM4_abrupt-4xCO2_r1i1p...: 0%| |0.00/79.8M [?B/s]
rsus_Amon_GFDL-CM4_abrupt-4xCO2_r1i1p...: 0%| |0.00/173M [?B/s]
rsus_Amon_GFDL-CM4_abrupt-4xCO2_r1i1p...: 0%| |0.00/86.7M [?B/s]
In [251]:
Copied!
ds
ds
Out[251]:
<xarray.Dataset> Size: 3GB Dimensions: (time: 1800, lat: 180, lon: 288, bnds: 2) Coordinates: * bnds (bnds) float64 16B 1.0 2.0 * lat (lat) float64 1kB -89.5 -88.5 -87.5 -86.5 ... 86.5 87.5 88.5 89.5 * lon (lon) float64 2kB 0.625 1.875 3.125 4.375 ... 356.9 358.1 359.4 * time (time) object 14kB 0001-01-16 12:00:00 ... 0150-12-16 12:00:00 height float64 8B 2.0 Data variables: hfls (time, lat, lon) float32 373MB dask.array<chunksize=(1, 180, 288), meta=np.ndarray> lat_bnds (time, lat, bnds) float64 5MB dask.array<chunksize=(1200, 180, 2), meta=np.ndarray> lon_bnds (time, lon, bnds) float64 8MB dask.array<chunksize=(1200, 288, 2), meta=np.ndarray> time_bnds (time, bnds) object 29kB dask.array<chunksize=(1, 2), meta=np.ndarray> hfss (time, lat, lon) float32 373MB dask.array<chunksize=(1, 180, 288), meta=np.ndarray> huss (time, lat, lon) float32 373MB dask.array<chunksize=(1, 180, 288), meta=np.ndarray> rlds (time, lat, lon) float32 373MB dask.array<chunksize=(1, 180, 288), meta=np.ndarray> rsds (time, lat, lon) float32 373MB dask.array<chunksize=(1, 180, 288), meta=np.ndarray> tas (time, lat, lon) float32 373MB dask.array<chunksize=(1, 180, 288), meta=np.ndarray> rlus (time, lat, lon) float32 373MB dask.array<chunksize=(1, 180, 288), meta=np.ndarray> rsus (time, lat, lon) float32 373MB dask.array<chunksize=(1, 180, 288), meta=np.ndarray> Attributes: (12/46) external_variables: areacella history: File was processed by fremetar (GFDL analog of CM... table_id: Amon activity_id: CMIP branch_method: standard branch_time_in_child: 0.0 ... ... variable_id: hfls variant_info: N/A references: see further_info_url attribute variant_label: r1i1p1f1 branch_time_in_parent: 36500.0 parent_time_units: days since 0001-1-1