PR suggestions for from_znap_dataset#106
Conversation
There was a problem hiding this comment.
Pull request overview
This PR refactors SNAP ZNAP ingestion to better support “SNAP OU”/ZNAP-style archives by introducing explicit datalayer name patterns and a helper for reading/normalizing a single ZNAP archive, then building a stacked xr.Dataset with complex/amplitude/phase outputs.
Changes:
- Add
RE_PATTERNS_SNAP_DATALAYERplus ZNAP variable-name constants inconf.py. - Refactor
from_snap_datasetto use_read_one_znap_archive, build a time stack, and derivecomplex/amplitude/phase. - Add
_read_one_znap_archivehelper to drop tiepoint dims and rename ZNAP datalayers into canonical variable names.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
sarxarray/conf.py |
Adds regex patterns/constants for canonicalizing SNAP ZNAP datalayer names and lists of expected ZNAP variables. |
sarxarray/_io.py |
Refactors SNAP ZNAP reading/stacking logic and adds a helper to normalize each archive before concatenation. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
|
Hi @Simon-van-Diepen , I finished my implementations. These changes are only for As we discussed the flexible data var names still need to be implemented. |
Simon-van-Diepen
left a comment
There was a problem hiding this comment.
See comments below, it's mostly related to creating a more generic implementation that the original facilitated but this version does not. It would work for TU Delft-specific processing but likely not for others
|
|
||
| # SNAP Znap file data variable names | ||
| ZNAP_DATA_VAR_MOTHER = ["longitude", "latitude", "elevation"] | ||
| ZNAP_DATA_VAR_DAUGHTERS = ["i", "q", "h2ph"] |
There was a problem hiding this comment.
This implementation does not allow for any other layers to be present. h2ph is afaik a TU Delft-specific layer, and given that sarxarray is not TU Delft-specific I don't think we should limit it to just h2ph
There was a problem hiding this comment.
Also, ZNAP_DATA_VAR_DAUGHTERS is not used?
| "product_scene_raster_height" : 84, | ||
| "time_coverage_start" : "2023-03-31T05:50:35.812250Z", | ||
| "time_coverage_end" : "2023-03-31T05:50:35.983086Z", | ||
| "time_coverage_start" : "2023-03-19T05:50:35.812250Z", |
There was a problem hiding this comment.
I looked into this a little bit. In our GPT graphs we read both the metadata from the mother and the daughter, where the daughter is read in second. Apparently this means that SNAP assigns all metadata based on the mother instead of the daughter. I think the cleanest way would be to read this epoch from the metadata since I don't know if we can assume these attributes to be always present.
I don't know what the best way forward is: either changing our SNAP parser so that the metadata of the daughter is present, or simply adding an extra argument to from_snap_dataset with the corresponding dates. Let's discuss
|
|
||
| # Regular expressions for reading SNAP Zarr file (ZNAP) datalayers. | ||
| RE_PATTERNS_SNAP_DATALAYER = { | ||
| "i": r"^i_(VV|VH|HH|HV)_\d{1,2}[A-Za-z]{3}\d{4}$", # e.g., i_VV_19Mar2023 |
There was a problem hiding this comment.
I don't know if these regex patterns are always true. Specifically, I am doubting the case of multiple polarisations. Though I don't know what exactly the format would be then (i_VV_HH_19Mar2023 or i_VVVH_19Mar2023)?
| RE_PATTERNS_SNAP_DATALAYER = { | ||
| "i": r"^i_(VV|VH|HH|HV)_\d{1,2}[A-Za-z]{3}\d{4}$", # e.g., i_VV_19Mar2023 | ||
| "q": r"^q_(VV|VH|HH|HV)_\d{1,2}[A-Za-z]{3}\d{4}$", # e.g., q_VV_19Mar2023 | ||
| "h2ph": r"^h2ph_(VV|VH|HH|HV)_\d{1,2}[A-Za-z]{3}\d{4}$", # e.g., h2ph_VV_19Mar2023 |
There was a problem hiding this comment.
This should be more generic. It is possible that other users may have a similar layer for all daughters which cannot be detected now sicne we only detect h2ph
| for layer in list(data.data_vars): | ||
| for key, pattern in RE_PATTERNS_SNAP_DATALAYER.items(): | ||
| if re.match(pattern, layer): | ||
| data = data.rename({layer: key}) |
There was a problem hiding this comment.
Here we only rename the hardcoded layers i, q, h2ph, latitude, longitude, and elevation. In my original implementation I scanned the current datalayer name for presence of polarisation and a date instead of hardcoding it, and stripped them when found. This means that it is not limited to these six layers, and can be used by anyone in a generic sense. I am in favour of reverting to a more generic implementation to facilitate other users as well
| data, is_mother = _read_one_znap_archive(file) | ||
|
|
||
| # Get current epoch | ||
| cur_epoch = data.attrs["time_coverage_start"] |
There was a problem hiding this comment.
We need to discuss this. I agree my original implementation of retrieving this from the data layer names is not ideal, but currently this metadata field always corresponds to the mother, so I do not think we can use this. There are three questions here:
- Can we assume this field is always present?
- If so, can we assume that this field always refers to the daughter image?
- If both are yes, is this just a TU Delft-specific config related to our processing infrastructure, or is this generically true?
We could also consider the user to define the epochs of each znap archive as a second variable or a dictionary-like structure. Let's discuss
There was a problem hiding this comment.
Discussion with @rogerkuou : we cannot assume this field to be correct by default since our best attempt at a sarxarray workflow apparently produces incorrect metadata in this field. Let's therefore use the second best option: in the metadata there is a field called:
"elements" : [ {
"name" : "sources",
"attributes" : [ {
"name" : "sourceProduct",
"readOnly" : false,
"data" : {
"type" : "ascii",
"elems" : "product:S1A_IW_SLC__1SDV_20230506T055033_20230506T055100_048409_05D2A0_F5DE"
}
} ]
The timestamp in the product file is always correct since this is configured by ESA. Let's read it from there
| (metadata["number_of_lines"], metadata["number_of_pixels"], 1) | ||
| data_mother = data[ZNAP_DATA_VAR_MOTHER] | ||
| data = data.drop_vars(ZNAP_DATA_VAR_MOTHER) | ||
| data = data.assign( |
There was a problem hiding this comment.
This should be generic, not just h2ph. I propose to move this outside the loop when we can be sure that data is fully formed, and we can simply add all the missing fields present in data as zero layers
Still need to be implemented:
h2ph). This can be done by having an input arg forfrom_snap_dataset