Skip to content

Use cmdstan config and decouple stanfit objects from RunSet#851

Draft
amas0 wants to merge 9 commits intostan-dev:developfrom
amas0:use-save-cmdstan-config
Draft

Use cmdstan config and decouple stanfit objects from RunSet#851
amas0 wants to merge 9 commits intostan-dev:developfrom
amas0:use-save-cmdstan-config

Conversation

@amas0
Copy link
Copy Markdown
Collaborator

@amas0 amas0 commented Apr 12, 2026

Draft PR for initial progress on config parsing for #785 and implementing ideas discussed in #848.

@WardBrian I'd like you to take a look at this initial draft which implements pydantic-based parsing of the config json files and uses that to refactor CmdStanPathfinder to be built entirely from output files (and decouples it from the runset).

The changes I'd like you to look at are in stanfit/pathfinder.py and stanfit/metadata.py -- the others files are mostly just temporary patches to have the library in a functional state.

The CmdStanPathfinder class has been rewritten as a dataclass that defines a from_files(...) method as a constructor to build it from the output files. I also prefer the clarity that this gives to what attributes the class has.

Some areas where I'd appreciate thoughts:

  • I don't include any reference to the runset in the object itself. I think there are some arguments for having it as an optional attribute, but I opted for a clean break here.
  • I dropped from the repr any reference to the method_args, building directly from files made it seem that was no longer desirable.
  • I include stdout_file as an optional attribute. With no reference to the runset, it still seems like something that might be nice to have? But the stdout info is more like process diagnostic than an essential component of the stanfit?
  • save_csvfiles probably needs to be called something like save_output_files, but I haven't changed that yet.
  • In a world where we no longer need to parse any config out of the stan csv, this makes the InferenceMetadata object more or less only scan the header line for stan variables, I think. So it may be worth a think as to what that should look like?

If the shape of looks to be in the right direction, I'll proceed with doing this for the other stanfit objects. Following that, we should be able to really clean up what we have going on in stanfit/__init__.py and should be able to drop all the code that we have for parsing out the headers of the stan csvs.

Copy link
Copy Markdown
Member

@WardBrian WardBrian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments/questions, but overall I like the look of the code!

Responding to your comments in the PR:

  • Runset being left out of the result entirely makes sense. We will have to see how this feels when we tackle something with more moving parts like the MCMC class, but for now I agree.
  • Repr changes make sense.
  • Stdout file is good. We may want to also do the same thing with diagnostic files and profiling files, which gets us closer to just including the whole Runset again... maybe we have a small dataclass that is like a cut-down runset and stores the relevant files, can be re-used across all of them?
  • I agree, it makes sense to replace csv with output across the board, both in the save functions and the from_csv function. This would make sense going forward anyway, c.f. stan-dev/design-docs#58
  • Yep, totally agree that InferenceMetadata might want to take a hike when all is said and done. The header could be stored in each class, or in the cut-down-Runset idea ("FileInfo"?)

model_name: str
csv_file: str
config: PathfinderConfig
config_file: str | None = None
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this really optional? Doesn't look like it in from_files

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's optional in the sense that if you pass a config in directly, you don't need to reference the file itself. Outside of simplifying some testing, I'm not sure if it will come up much? The intended entrypoint in practice is always going to be the from_files.

metadata = InferenceMetadata.from_csv(csv_file)
return cls(
metadata=metadata,
csv_file=str(csv_file),
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The super-correct way to convert a str|os.PathLike to a str is os.fspath()

metadata=metadata,
csv_file=str(csv_file),
model_name=stan_config.model_name,
config=stan_config.method_config,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's obviously nice that we don't have any downstream isinstance checks, but it is a bit of a bummer that we end up throwing away the extra information that is present in the StanConfig (e.g. model name, etc). Is there some trick to keep it all around, while maintaining that this is a Pathfinder-specific config?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. We could just store references to both stan_config and config? The top level StanConfig would still contain the inference config within it, but it would just be the same object. Storing both would keep the top level info and still let us have config be validated that the method is correct. Keeps the type checker happy and doesn't drop any info. Thoughts?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That seems fine. I was wondering if pydantic had a more elegant way, but it's not like doing that actually costs any extra memory or anything

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could do something fancy like make StanConfig a generic, which pydantic has good support for:

from typing import Generic, TypeVar

AnyMethodConfig = Annotated[                                                                                                                                               
    SampleConfig | OptimizeConfig | PathfinderConfig
    | LaplaceConfig | VariationalConfig | GeneratedQuantitiesConfig,                                                                                                       
    Discriminator("method"),                                                                                                                                               
]
                                                                                                                                                                             
MethodT = TypeVar("MethodT", bound=BaseModel, default=AnyMethodConfig)                                                                                                     
   
class StanConfig(BaseModel, Generic[MethodT]):                                                                                                                             
    model_config = ConfigDict(extra="allow")                                                                                                                               
   
    model_name: str                                                                                                                                                        
    stan_major_version: str
    stan_minor_version: str
    stan_patch_version: str

    method_config: MethodT

Then for example we could have:

@dataclass
class CmdStanPathfinder:
   # other fields ...
   config: StanConfig[PathfinderConfig]

And parsing like StanConfig[PathfinderConfig].model_validate(jsons) would validate it against the Pathfinder instance of the config. This would result in only one reference to the config but the type checker would know that it has the pathfinder configuration fields within method_config.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That seems like a good representation of what's actually happening under the hood

f'found {len(csvfiles)}'
)
csv_file = csvfiles[0]
config_file = os.path.splitext(csv_file)[0] + '_config.json'
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume this is temporary but it's worth putting a comment saying so, since this is kind of a nasty assumption

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah definitely temporary. I was planning on doing away with this function entirely once the full changes are in.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants