Skip to content

Minor updates to OffloadPC#5164

Open
dsroberts wants to merge 12 commits into
firedrakeproject:mainfrom
dsroberts:dsroberts/minor-offload-fixes
Open

Minor updates to OffloadPC#5164
dsroberts wants to merge 12 commits into
firedrakeproject:mainfrom
dsroberts:dsroberts/minor-offload-fixes

Conversation

@dsroberts

Copy link
Copy Markdown
Contributor

Description

Two minor updates to OffloadPC based on continued local testing. The first change is preventing certain types of of PETSc matrix from being offloaded. When implicit matrices are converted to aijcusparse matrices, PETSc constructs a dense matrix row-by-row then converts that to a sparse matrix. Since PETSc can transparently manage host-device transfers when using non-aijcusparse matrix types on cuda vectors, it is far more efficient for PETSc to offload as needed than to do a full implicit to dense to sparse conversion. I'm not sure this is a complete list of implicit matrix types, but these are the types I run into with our testing.

The second change is in the construction of the device matrices. If the out kwarg is not specified or set to None in Mat.convert(), the conversion happens in place meaning e.g. both P and P_device are now the same aijcusparse matrix. This was probably not picked up earlier as an aijcusparse matrix is an aij matrix with a managed device buffer under the hood, so silently converting an aij matrix probably does not affect any other part of a job, but it might cause issues with other matrix types.

There is a check to make sure the silent conversion hasn't happened, but I don't know how to test for the implicit -> dense -> sparse conversion. The right thing to do I think would be to check that A is A_dev if A.getType() is one of the 'do not offload' types, but I don't know how to retrieve that information out of nested PCs.

@dsroberts dsroberts added the ci:cuda Run the test suite with a CUDA-enabled build label Jun 11, 2026
Comment thread firedrake/preconditioners/offload.py Outdated
Co-authored-by: Josh Hope-Collins <jhc.jss@gmail.com>
Comment thread firedrake/preconditioners/offload.py Outdated
Comment thread firedrake/preconditioners/offload.py Outdated
Comment thread tests/firedrake/offload/test_poisson_offloading_pc.py Outdated
Comment thread firedrake/preconditioners/offload.py
dsroberts and others added 2 commits June 25, 2026 15:48
Add comments describing the no-offload matrix types. Add a test to
ensure value-only copying is happening correctly. Add start of test to
ensure A is not being offloaded if it is a no-offload type matrix.
@dsroberts

dsroberts commented Jun 25, 2026

Copy link
Copy Markdown
Contributor Author

I've added another fix into this PR that copies P and A to P_dev and A_dev respectively by values only in OffloadPC.update(). We found that in parallel some configurations were diverging after a few timesteps. I think this was due to PETSc seeing the copied P_dev as a new matrix and performing its own assembly, and when it did that it lost the connectivity information into the halos. I have adapted the 04-burgers.ipynb demo into a test that runs until it reaches a steady state. When this test is run on 2 processes, the ("cg", "gamg") variant fails in 3 timesteps if a full copy of P is performed. With the value-only copy the solution reaches a steady state and the test passes.

Comment thread firedrake/preconditioners/offload.py Outdated
if A_dev.handle != P_dev.handle:
A.copy(A_dev)
# Perform a value-only copy
P.copy(P_dev, True)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think making it a kwarg would be more readable

Comment thread tests/firedrake/offload/test_advection_offload.py Outdated
Comment thread tests/firedrake/offload/test_advection_offload.py Outdated
Comment thread tests/firedrake/offload/test_advection_offload.py Outdated
# Zero pressure at outlow at x = 1
bc2 = DirichletBC(W[1], 0.0, 2)

bcs = bc0 + [bc1, bc2]

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not critical, but this seems a bit weird to have bc0 be a list and bc1 and bc2 not be.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copied straight from here

Comment on lines +79 to +83
# ksp0 = solver.snes.ksp.pc.getFieldSplitSchurGetSubKSP()[0]
# sub_pc = ksp0.pc.???
# offload_python_context = sub_pc.getPythonContext()
# assert offload_python_context.pc.P.type == "seqaijcusparse"
# assert offload_python_context.pc.A.type == "python"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm afraid I've not explained the issue well. This test returns the same values regardless of whether A is offloaded or not. What we're testing for here is whether the A operator has been offloaded at all. It makes no difference at this scale, but when you begin to scale up (~100k DOF) the matfree -> dense -> sparse conversion can take minutes, as opposed to microseconds for an aij -> aijcusparse conversion, so its critically important that we make sure this isn't happening.

This is where I need help. ksp0 = solver.snes.ksp.pc.getFieldSplitSchurGetSubKSP()[0].pc gives you the firedrake.AssembledPC, what I don't know is how to get the to the underlying firedrake.OffloadPC. Once we have that, we can get the python context and then extract the A and P operators from that. If you know what

sub_pc = ksp0.pc.???

should be in order to do that, that would be very helpful.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see now. I also don't know how to traverse the solver stack programmatically like that. @JHopeCollins do you know?

I am on leave currently but could dig into this next week if needed.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

firedrake.AssembledPC itself just creates a PETSc.PC, and you have specified that PC to be python type with the OffloadPC as the python context.
So looking at the solver parameters we need to walk down the solver stack in this order (where you had the first few steps already):

solver -> snes -> ksp -> pcfieldsplit -> A00 ksp -> pc -> python assembled pc context -> pc -> python offload pc context -> pc -> operators

you can do something like:

solver.snes.ksp.pc.getFieldSplitSchurGetSubKSP()[0].pc.getPythonContext().pc.getPythonContext().pc.getOperators()

Or, more readably, using your ksp0:

assembled_ctx = ksp0.pc.getPythonContext()
offload_ctx = assembled_ctx.pc.getPythonContext()
A, P = offload_ctx.pc.getOperators()

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I was missing getPythonContext()! I was going through every get... function in petsc4py.PC trying to figure out which one would get the 'subPC' or something along those lines. Thank you @JHopeCollins.

@JHopeCollins JHopeCollins Jun 25, 2026

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No problem!

For future reference, a python context is no longer a petsc4py thing, its a user-defined class with a bunch of callbacks (here Firedrake counts as a user), e.g. AssembledPC, OffloadPC etc, which are just (non-petsc4py) python classes.

So once you hit AssembledPC you're back to looking at Firedrake code not petsc4py, until of course you hit another petsc4py object.

Comment thread tests/firedrake/offload/test_no_offload_A.py Outdated
dsroberts and others added 5 commits June 25, 2026 20:35
Co-authored-by: Connor Ward <c.ward20@imperial.ac.uk>
Co-authored-by: Connor Ward <c.ward20@imperial.ac.uk>
Co-authored-by: Connor Ward <c.ward20@imperial.ac.uk>
Co-authored-by: Connor Ward <c.ward20@imperial.ac.uk>
Comment on lines +61 to +65
"ksp_monitor_true_residual": None,
"ksp_converged_reason": None,
"ksp_type": "cg",
"ksp_rtol": 1e-5,
"ksp_max_it": 1000,

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this really take this many iterations? Seems like too many for the test suite

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, sorry, left over from G-ADOPT. Will reduce to 50

}
},
"fieldsplit_1": {
"pc_type": "none",

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the right thing to do here? Isn't this just solving a zero matrix because PETSc will default to building the Schur complement from a11, which for this form is just 0?

If it isn't doing that, then how many iterations does this take? Doesn't the ksp_type here default to gmres?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took this from from the stokes_mini regression test, I only cared what was going on with fieldsplit_0 so left the fieldsplit_1 settings unchanged. The ksp_type does default to gmres and I see

     Linear firedrake_3_fieldsplit_1_ solve converged due to CONVERGED_RTOL iterations 130

when run with ksp_converged_reason: None added to the fieldsplit_1 solver options. Setting it back to 'fieldsplit_schur_fact_type': 'diag', as per stokes_mini, the iteration count drops to 125 and it fails the assertion at the end. Come to think of it, the assertion at the end doesn't matter, all we need to know is that A was not offloaded, so I'll just end the test after those asserts.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you don't care about the solution then you can add ksp_type: preonly to this split

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci:cuda Run the test suite with a CUDA-enabled build

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants