Read 64-bit multi-page SAS catalog files and skip format default values#370
Open
hpoettker wants to merge 3 commits intoWizardMac:devfrom
Open
Read 64-bit multi-page SAS catalog files and skip format default values#370hpoettker wants to merge 3 commits intoWizardMac:devfrom
hpoettker wants to merge 3 commits intoWizardMac:devfrom
Conversation
Contributor
Author
|
Sorry for the spam from the failed fuzzing. I'll look into running the fuzzer locally and/or on push before opening PRs. I've probably opened up the problem by removing the bound on the counter But I've added the missing checks to guard against buffer overflow now, and the fuzzing is successful. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR addresses three issues regarding the reading of SAS catalog files:
XLSRrecords beyond the first page in 64 bit files are currently not found due to a wrong offset,I've tested the PR locally with 64 bit files on Linux and 32 bit files on Windows.
Offset for
XLSRrecord on later pagesThe current implementation always uses the page offset 16 when looking for
XLSRrecords on later pages. But the offset 16 is only correct for 32-bit files. For 64-bit files, it is 32.Currently, the
XLSRrecords on later pages are missed in 64-bit files, which leads to any formats that are referenced from those later records to be missed.Informats in catalog files
The current implementation is only prepared for formats, which map from numbers to either other numbers or strings. This leads to parsing errors when "informats" are encountered, which also map from strings.
The PR proposes to just skip informats, which can be identified in the catalog file by names starting with
@.I might contribute the code for the informat parsing in a follow-up PR. But I think that would require a discussion before-hand on how to integrate informats into the existing API. They are not value labels that should be used for outward presentation but rather mappings that should be used on input data to derive an internal representation. I don't know whether such a concept exists for SPSS or Stata files.
Default values in custom formats
The PR fixes an issue that occurs when reading custom formats from SAS catalog files whose default value is not saved as the last value in the physical catalog file.
Currently, ReadStat skips the default value correctly in a format created like this:
as SAS writes the labels in the order of encounter.
But for a format created like this:
which logically creates the same format, ReadStat currently reads the format to map
1toUnknown,2toYes,No.It would be nice to also expose the default value through the API. But that would require a discussion on the API change before-hand as the current handlers for value label do not accept default values as far as I can tell. I don't know whether the concept of default value labels exists for SPSS or Stata files.