Skip to content

Read 64-bit multi-page SAS catalog files and skip format default values#370

Open
hpoettker wants to merge 3 commits intoWizardMac:devfrom
hpoettker:sas-formats
Open

Read 64-bit multi-page SAS catalog files and skip format default values#370
hpoettker wants to merge 3 commits intoWizardMac:devfrom
hpoettker:sas-formats

Conversation

@hpoettker
Copy link
Copy Markdown
Contributor

@hpoettker hpoettker commented Apr 29, 2026

Summary

This PR addresses three issues regarding the reading of SAS catalog files:

  • it fixes the issue that XLSR records beyond the first page in 64 bit files are currently not found due to a wrong offset,
  • it skips the "informats" in the file, which currently lead to parsing errors
  • it skips the default values of custom formats also in the cases where the default value is not written last

I've tested the PR locally with 64 bit files on Linux and 32 bit files on Windows.

Offset for XLSR record on later pages

The current implementation always uses the page offset 16 when looking for XLSR records on later pages. But the offset 16 is only correct for 32-bit files. For 64-bit files, it is 32.

Currently, the XLSR records on later pages are missed in 64-bit files, which leads to any formats that are referenced from those later records to be missed.

Informats in catalog files

The current implementation is only prepared for formats, which map from numbers to either other numbers or strings. This leads to parsing errors when "informats" are encountered, which also map from strings.

The PR proposes to just skip informats, which can be identified in the catalog file by names starting with @.

I might contribute the code for the informat parsing in a follow-up PR. But I think that would require a discussion before-hand on how to integrate informats into the existing API. They are not value labels that should be used for outward presentation but rather mappings that should be used on input data to derive an internal representation. I don't know whether such a concept exists for SPSS or Stata files.

Default values in custom formats

The PR fixes an issue that occurs when reading custom formats from SAS catalog files whose default value is not saved as the last value in the physical catalog file.

Currently, ReadStat skips the default value correctly in a format created like this:

proc format;
  value myfmt
    1 = 'Yes'
    2 = 'No'
    other = 'Unknown';
run;

as SAS writes the labels in the order of encounter.

But for a format created like this:

proc format;
  value myfmt
    other = 'Unknown'
    1 = 'Yes'
    2 = 'No';
run;

which logically creates the same format, ReadStat currently reads the format to map

  • from 1 to Unknown,
  • from 2 to Yes,
  • and skips the mapping to No.

It would be nice to also expose the default value through the API. But that would require a discussion on the API change before-hand as the current handlers for value label do not accept default values as far as I can tell. I don't know whether the concept of default value labels exists for SPSS or Stata files.

@hpoettker
Copy link
Copy Markdown
Contributor Author

hpoettker commented Apr 29, 2026

Sorry for the spam from the failed fuzzing. I'll look into running the fuzzer locally and/or on push before opening PRs.

I've probably opened up the problem by removing the bound on the counter i in pass 2 of sas7bcat_parse_value_labels on both label_count_used and label_count_capacity. With the PR, it's now only bound by label_count_used as that is larger by 1 than label_count_capacity for formats with default value.

But I've added the missing checks to guard against buffer overflow now, and the fuzzing is successful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant