Skip to content

cam6_4_166: CLUBB update and clubb_intr improvements#1441

Merged
cacraigucar merged 39 commits intoESCOMP:cam_developmentfrom
huebleruwm:cam_dev_clubb_nov2025
Apr 21, 2026
Merged

cam6_4_166: CLUBB update and clubb_intr improvements#1441
cacraigucar merged 39 commits intoESCOMP:cam_developmentfrom
huebleruwm:cam_dev_clubb_nov2025

Conversation

@huebleruwm
Copy link
Copy Markdown

@huebleruwm huebleruwm commented Nov 24, 2025

There's two parts here - getting a new version of clubb in, and enabling the descending grid mode in clubb_intr. Both these goals were split up over many commits. Fixes #1411

New CLUBB

The first commits are dedicated to getting a new version of CLUBB in. Because of clubb_intr diverging between cam_development, which had significant changes to enable GPUization, and UWM's branch, which had redimensioning changes, the merging was done manually.

The first 6 commits here include changes that can be matched with a certain version of CLUBB release:
3d40d8c0e03a298ae3925564bc4db2f0df5e9437 works with clubb hash 1632cf12
e67fc4719fa21f8e449b024a0e3b6df2d0a7f8cb works with clubb hash 673beb05
187d7b536c2f36968fc7f5e1b9d1167e430ad03f works with clubb hash dc302b95
e4b71220b33aeaddb0afc68c9103555edccb59eb works with clubb hash dc302b95
703aca60ed1e0b6b24f2cd890c3a4497041d25b8 works with clubb hash d5957b30
4d9b1b8a528ca532d964c1799e1860e96e068a12 works with clubb hash d5957b30
(to use the clubb hash, go to src/physics/clubb and run the git checkout there)

These commits all have to do with just getting a new version of clubb in, so we need to ensure that at least 4d9b1b8a528ca532d964c1799e1860e96e068a12 is working correctly. The later commits have more complicated changes to clubb_intr, so if we find any problems with this branch, we should test commit 4d9b1b8a528ca532d964c1799e1860e96e068a12 as a next step.

clubb_intr improvements

The next commits after getting new clubb in are increasingly ambitious changes aimed at simplifying clubb_intr and reducing its runtime cost.

e60848b4ec4df90a3060ffd7f664fab42e847509 introduces a parameter, clubb_grid_dir, in clubb_intr that controls which direction the grid is when calling advance_clubb_core. When using -O0 to compile, and setting l_test_grid_generalization = .true. in src/clubb/src/CLUBB_core/model_flag.F90, the results are BFB in either grid direction.

dddff494966bf2bf4341fa4a7526b2f8b0f3d16e separates the flipping code from the copying code (copying cam sized arrays to clubb sizes arrays). Should all be BFB.

055e53f70741531a58d0f7da788b824c76fef087 pushes flipping code inward until it's directly around the call to advance_clubb_core, and the way of controlling the grid direction has been changed to a flag, l_ascending_grid. This should all be BFB as well, and I tested with clubb_cloudtop_cooling, clubb_rainevap_turb, and do_clubb_mf all true to ensure BFBness in either ascending or descending mode. One caveat - the clubb_mf code assumes an ascending grid, so before calling it we flip the arrays, then flip the outputs to descending.

2fb2ba9bd6b1ec5e5c60039f90d0c1020663d0c9 is a pretty safe intermediate commit, mainly moving stuff around in preparation for redimensioning.

bded8a561131e4dbbccad293f14226e5e8c0e856 is some very safe dimensionings.

fce8e1b1b5e8d3e93232c2129be7671fae79db23 is the big one that redimensions most pbuf arrays, and uses them instead of the clubb_sizes local ones. This allows us to avoid the vast majority of the the data copying, and delete the local version of the clubb arrays.

The rest are safe and easy things mainly, or commits to make GPU code work.

Testing

I plan to run an ECT test to compare the cam_development head I started with to the head of this branch. Answers are expected to change slightly due to the ghost point removal in clubb, so I think it unlikely that this passes, but if it does that would be fantastic, and might be the only testing we really need.

If that first ECT test doesn't pass, then I'll will assume that the difference is due to the ghost point removal (or other small bit changing commits in clubb over the past year), and rely on @adamrher to check if the results are good.

The biggest concern is if the answer changes are acceptable, and the only real differences expected are from the new version of CLUBB. If the answers from this branch look bad, we should go back to commit 4d9b1b8a528ca532d964c1799e1860e96e068a12 and check the answers from that, since it only includes the new version of CLUBB, and NO unnecessary clubb_intr changes. If the answers still look bad in that commit, then we have a few more we can step back through to try to figure out which commit introduces the differences. If the hypothetical problem is present in the first commit (3d40d8c0e03a298ae3925564bc4db2f0df5e9437), then the problem is harder, because that includes (pretty much only) the removal of the ghost point in clubb, which is expected to change answers, but hopefully not significantly.

Again if that first ECT test fails, then I can still run another ECT test between the version where clubb is up to date (4d9b1b8a528ca532d964c1799e1860e96e068a12) and the head of this branch. The changes between that commit and the head may be slightly bit changing without (-O0), but definitely shouldn't be answer changing. If this ECT test fails, then I've made a mistake in the changes meant to avoid the flipping/copying, and I'll have to step back through and figure out what bad thing I did.

Some other tests I've ran along the way help confirm at least some of these changes:

  • I've ensured BFBness between ascending and descending modes since it was introduces in commit e60848b4ec4df90a3060ffd7f664fab42e847509
  • The ERP test passes
  • Commit 4d9b1b8a528ca532d964c1799e1860e96e068a12 should be working on the GPU, but later commits are untested

Next steps

I left a number of things messy for now, such as comments and gptl timers. Once we confirm these changes, I'd like to go through and make some of those nicer as a final step.

Performance

In addition to the clubb_intr changes improving performance, we should use this as an opportunity to try to flags that should be significantly faster:

  • clubb_penta_solve_method = 2 uses our custom pentadiagonal matrix solvers, which should be significantly faster than lapack and should pass an ECT test
  • clubb_tridiag_solve_method = 2 uses our custom tridiagonal solver, which should also be faster and pass an ECT test
  • clubb_fill_holes_type = 4 uses a different hole filling algorithm, that in theory is just all around better and faster than our current one, and I suspect it will pass ECT, but I have yet to test it
  • clubb_l_call_pdf_closure_twice = .false. will avoid a pdf_closure call (which is significant in cost) and reduce the memory footprint , and I think will have minimal effect (based on my visual analysis of what that affects in the code), but is the most likely to break the ECT test

Diagnostic field Caveat

From an email from @huebleruwm:

Also, just in case this comes up, if you run with settings that cause top_lev to be greater than 1 (e.g by changing trop_cloud_top_press from 1.D2 to 1.D3), then there are another handful of diagnostic differences (RTM_CLUBB, RTP2_CLUBB, THLM_CLUBB, UP2_CLUBB, WP2_CLUBB, THLP2_CLUBB, UM_CLUBB, VM_CLUBB), but only above top_lev. Previously, these variables were either calculated or left uninitialized above top_lev. Since CLUBB is not supposed to be active up there, they should be (and now are) zero.

@cacraigucar cacraigucar marked this pull request as draft November 24, 2025 20:36
@cacraigucar
Copy link
Copy Markdown
Collaborator

@huebleruwm - After reading through this text, I moved this PR to draft. Once it is ready for the SEs to review and process it for bringing into CAM, please move it out of draft.

…grid, and flipping sections have been consolidated to directly around the time stepping loop. Next is to push them inward until the only clubb grid calculations are done inside advance_clubb_core.
…s descending BFB (except wp3_ta but that's internal and doesn't matter) - this is even true with clubb_cloudtop_cooling, clubb_rainevap_turb, and do_clubb_mf.
…ending mode, even though there is no notion of ascending in it. There must be some bug with the flipper perhaps? Otherwise an internal interaction between clubb and silhs somehow, it's very upsetting, but I think it's time to give up and come back to it someday.
…oning yet. All changes should be BFB, but a handful of field outputs are different above top_lev because I made it zero everything above top_lev. Among the fields that differ: some (RTM_CLUBB RTP2_CLUBB THLM_CLUBB UP2_CLUBB WP2_CLUBB) were being initiazed to a tolerance above top_lev, and others (THLP2_CLUBB RTP2_CLUBB UM_CLUBB VM_CLUBB) were never initialized or set above top_lev, so we were outputting random garbage.
…ootprint and data copying steps in clubb_intr, mainly by switching pbuf variables (which are mostly clubb_inouts) to have nzm or nzt dimensions, allowing them to be passed into clubb directly, making the copying/flipping step unnecesary. This was tested with a top_lev > 1, so it should be well tested. There were some above top_lev interactions found, which seem erroneous, so I've marked them with a TODO and a little explaination.
@huebleruwm huebleruwm force-pushed the cam_dev_clubb_nov2025 branch from ed11ffc to b2b3232 Compare November 26, 2025 10:11
@huebleruwm
Copy link
Copy Markdown
Author

@adamrher I have done a number of ECT tests and it was pretty much as expected. I made a baseline from ESCOMP/cam_development commit 4c480c8e8c7e3138d8d5cb3118f635c71886ecf7 (which has cam6_4_128 as the last tag). This commit from this branch (4d9b1b8a528ca532d964c1799e1860e96e068a12), which is just getting in the newest version of clubb) fails the ECT test when compared to the cam_development baseline. So we will need some help confirming the changes as acceptable.

To check the more "dangerous" changes to clubb_intr, I made another baseline from commit 4d9b1b8a528ca532d964c1799e1860e96e068a12 (with new clubb), then compared that to the head of this branch, and that passed the ECT test. I also made sure to run with settings that have top_lev = 3 so that we can test the redimensioning changes, which I did by setting trop_cloud_top_press to 1.D3 in the namelist instead of 1.D2.

Other things I tested:

  • When running with -O0 and l_test_grid_generalization = .true. (clubb internal parameter), then the code is BFB between the ascending and descending modes in clubb_intr (changed by setting l_ascending_grid)
  • Also when running with -O0 the code is to be BFB between 4d9b1b8a528ca532d964c1799e1860e96e068a12 and the head of this branch, outside of the few output variables that differ above top_lev, see the comment from this commit for more detail.
  • The ERP test passes
  • The GPU code runs to completion, haven't check results with ECT test yet, but that's the only thing on the list left to do, and any further changes to make that work (if needed) won't affect CPU results.

So I think the clubb_intr changes are pretty well tested and confirmed. But the changes from new clubb don't pass the ECT test, and we'll need to confirm them as acceptable.

Note

Also, I found a number of examples where clubb_intr is interacting above top_lev, and I'm not sure exactly what we should do about that. For now I've left the errors in place for BFBness, but I've made little code comments explaining the issue (all contain a TODO). There's also a TODO with how we're setting concld_pbuf, which seems suspicious, but not a clear bug. @adamrher could you take a look at these "TODO" comments and see if you have any insight please. I added what I think the "correct" lines should be, and what the errors were that I left in for BFBness.

@huebleruwm
Copy link
Copy Markdown
Author

I ran some ECT tests with the performance options:

  • clubb_tridiag_solve_method = 2 passes the ECT test and should be significantly faster
  • clubb_penta_solve_method = 2 passes the ECT test and should be significantly faster, but requires clubb_release commit bba14c86a748da7993af72c056a9f6505cf43f1b or newer, since I just fixed a bug in it
  • clubb_fill_holes_type = 4 FAILS the ECT test, which was expected since it's fills holes in the same conservative way, but over completely different ranges, making it answer changing. This should also be decently faster, so it would be nice to test if possible.

@adamrher
Copy link
Copy Markdown
Collaborator

adamrher commented Dec 5, 2025

@huebleruwm I checked out this PR branch, and while the clubb_intr.F90 changes seem to be there, the .gitmodules are pointing to the clubb externals currently on cam_development:

[submodule "clubb"]
        path = src/physics/clubb
        url = https://github.com/larson-group/clubb_release
        fxrequired = AlwaysRequired
        fxsparse = ../.clubb_sparse_checkout
        fxtag = clubb_4ncar_20240605_73d60f6_gpufixes_posinf
        fxDONOTUSEurl = https://github.com/larson-group/clubb_release

This should be pointing to a different tag / hash containing support for descending / ascending options, no?

@huebleruwm
Copy link
Copy Markdown
Author

This should be pointing to a different tag / hash containing support for descending / ascending options, no?

Yes. In hindsight, it would've made much more sense to update that with the right hash each commit, rather than just including it in the commit comment.

I've just been going to src/physics/clubb and running git checkout master to get the newest version. Or git checkout hash when testing the first handful of commits:
cam 3d40d8c0e03a298ae3925564bc4db2f0df5e9437 works with clubb hash 1632cf12
cam e67fc4719fa21f8e449b024a0e3b6df2d0a7f8cb works with clubb hash 673beb05
cam 187d7b536c2f36968fc7f5e1b9d1167e430ad03f works with clubb hash dc302b95
cam e4b71220b33aeaddb0afc68c9103555edccb59eb works with clubb hash dc302b95
cam 703aca60ed1e0b6b24f2cd890c3a4497041d25b8 works with clubb hash d5957b30 or newer
cam 4d9b1b8a528ca532d964c1799e1860e96e068a12 works with clubb hash d5957b30 or newer

@cacraigucar cacraigucar added the answer changing answer changing tag label Feb 11, 2026
Copy link
Copy Markdown
Collaborator

@adamrher adamrher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@huebleruwm I made some more progress on the code review, but I'll have to finish it up another day. I really like how you've added the _pbuf modifier to all the pbuf vars. It will make this code a lot easier to understand going forward!

Comment thread src/physics/cam/clubb_intr.F90
Comment thread src/physics/cam/clubb_intr.F90
Comment thread src/physics/cam/clubb_intr.F90 Outdated
Copy link
Copy Markdown
Collaborator

@adamrher adamrher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Publishing my morning review now so it doesn't get lost. Will pick up with a separate review this afternoon.

Comment thread src/physics/cam/clubb_intr.F90
Comment thread src/physics/cam/clubb_intr.F90
Comment thread src/physics/cam/clubb_intr.F90 Outdated
Comment thread src/physics/cam/clubb_intr.F90 Outdated
Comment thread src/physics/cam/clubb_intr.F90 Outdated
Comment thread src/physics/cam/clubb_intr.F90 Outdated
Comment thread src/physics/cam/clubb_intr.F90 Outdated
Comment thread src/physics/cam/clubb_intr.F90 Outdated
Copy link
Copy Markdown
Collaborator

@adamrher adamrher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@huebleruwm I've finished my review of your PR. I have several comments that should be addressed. I do think you found a bug in our code for top_lev > 1 in the energy fixer and the applied tendencies, so thanks for catching that!

Comment thread src/physics/cam/clubb_intr.F90 Outdated
Comment thread src/physics/cam/clubb_intr.F90 Outdated
Comment thread src/physics/cam/clubb_intr.F90 Outdated
Comment thread src/physics/cam/clubb_intr.F90 Outdated
Comment thread src/physics/cam/clubb_intr.F90 Outdated
Comment thread src/physics/cam/clubb_intr.F90 Outdated
Comment thread src/physics/cam/clubb_intr.F90 Outdated
Comment thread src/physics/cam/clubb_intr.F90 Outdated
Comment thread src/physics/cam/clubb_intr.F90 Outdated
Comment thread src/physics/cam/clubb_intr.F90 Outdated
@huebleruwm
Copy link
Copy Markdown
Author

huebleruwm commented Apr 14, 2026

@adamrher I think I've addressed all the PR comments now and made the changes over the last 2 commits. They were all pretty small changes - moving things around, renaming stuff, comment tweaks, etc. I ran the usual ECT tests on CPU and GPU and both are passing.

Unless I've missed something, all that's left to do is deal with the p_sfc issue.

Comment thread src/physics/cam/clubb_intr.F90 Outdated
Comment thread src/physics/cam/clubb_intr.F90
@cacraigucar
Copy link
Copy Markdown
Collaborator

The regression tests area still running, but one of the jobs is failing with what I believe is a valid failure (and not a machine hiccup). The failing test is a SILHS test.

See: /glade/derecho/scratch/cacraig/aux_cam_intel_20260420135315/SMS_Ln9.f19_f19_mt232.F2000climo.derecho_intel.cam-silhs.GC.aux_cam_intel_20260420135315/run

The error messages are:

dec1462.hsn.de.hpc.ucar.edu 127:  Error:  unrecognized variable in vars_zm:  wprtp_sicl
dec1462.hsn.de.hpc.ucar.edu 127:  Error:  unrecognized variable in vars_zm:  wpthlp_sicl
dec1462.hsn.de.hpc.ucar.edu 127:  Error:  unrecognized variable in vars_zm:  shear_sqd

The traceback is;

dec1462.hsn.de.hpc.ucar.edu 3: cesm.exe           0000000002BD3AA6  shr_abort_abort           110  shr_abort_mod.F90
dec1462.hsn.de.hpc.ucar.edu 3: cesm.exe           00000000014226C4  stats_init_clubb         5575  clubb_intr.F90
dec1462.hsn.de.hpc.ucar.edu 3: cesm.exe           0000000001398067  clubb_ini_cam            1866  clubb_intr.F90
dec1462.hsn.de.hpc.ucar.edu 3: cesm.exe           000000000097BCA2  phys_init                 939  physpkg.F90
dec1462.hsn.de.hpc.ucar.edu 3: cesm.exe           0000000000784344  cam_init                  196  cam_comp.F90
dec1462.hsn.de.hpc.ucar.edu 3: cesm.exe           00000000007671FC  initializerealize         642  atm_comp_nuopc.F90

@adamrher
Copy link
Copy Markdown
Collaborator

adamrher commented Apr 20, 2026

The regression tests area still running, but one of the jobs is failing with what I believe is a valid failure (and not a machine hiccup). The failing test is a SILHS test.

All I can tell is that the unrecognized variables e.g., wprtp_sicl were removed from the new clubb external in this PR, specifically clubb/src/CLUBB_core/stats_zm_module.F90. @huebleruwm is it possible we aren't calling the new stats_zm code correctly in cam/subcol_SILHS.F90?

@huebleruwm
Copy link
Copy Markdown
Author

The regression tests area still running, but one of the jobs is failing with what I believe is a valid failure (and not a machine hiccup). The failing test is a SILHS test.

All I can tell is that the unrecognized variables e.g., wprtp_sicl were removed from the new clubb external in this PR, specifically clubb/src/CLUBB_core/stats_zm_module.F90. @huebleruwm is it possible we aren't calling the new stats_zm code correctly in cam/subcol_SILHS.F90?

It sounds like those were removed from this new version of clubb. I found references to them still in the cam_development branch in these two files:

  • cime_config/testdefs/testmods_dirs/cam/outfrq9s_mg3/user_nl_cam
  • cime_config/testdefs/testmods_dirs/cam/silhs/user_nl_cam

So my guess is that some test uses these files to try to setup the stats variables, but causes an error since the ones with _sicl in the name aren't defined internally. I'm pretty sure we could fix it by just removing the _sicl terms from the stats lists defined in those files.

@cacraigucar
Copy link
Copy Markdown
Collaborator

@huebleruwm - Thank you for pointing those two files. I will remove the three variables it is complaining about and run the test again.

@cacraigucar
Copy link
Copy Markdown
Collaborator

@huebleruwm - that fix allowed the SILHS test to pass - thank you!

All of the regression tests passed (with answer changes for runs which use CLUBB as expected)

Awaiting okay from the CESM group later today for the answer changes

@cacraigucar cacraigucar merged commit 1d8a28b into ESCOMP:cam_development Apr 21, 2026
2 checks passed
@cacraigucar cacraigucar changed the title CLUBB update and clubb_intr improvements cam6_4_166: CLUBB update and clubb_intr improvements Apr 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: Tag

Development

Successfully merging this pull request may close these issues.

4 participants