Skip to content

Reduce seeks on file io#114

Merged
jcupitt merged 2 commits intoImagingDataCommons:mainfrom
rvause:main
Apr 28, 2026
Merged

Reduce seeks on file io#114
jcupitt merged 2 commits intoImagingDataCommons:mainfrom
rvause:main

Conversation

@rvause
Copy link
Copy Markdown
Contributor

@rvause rvause commented Apr 28, 2026

I was noticing very slow read times of dicom file regions using openslide. This issue was somewhat hidden until the code was running in an environment with a virtual filesystem (kubernetes managed containers). In such cases every seek and read incurs a noticeable cost.

Tracked the issue down to libdicom's file io behaviour. Using helper tool I am including here in the description it is apparent that the io layer is making many more lseek calls than necessary.

Made two changes that work together to reduce the number of calls being made. In dcm_io_seek_file use an absolute target and handle the cases where no seek is needed when target lands within the buffered window. Updated dcm_parse_encapsulated_frame to a single pass, using dcm_realloc to grow the allocated buffer.

Inspecting across reading the first 50 frames in y-x arrangement with strace -e trace=lseek -f builddir/dcm-visit -n 50 file.dcm | grep 'lseek(' | wc -l results in 3428 calls.
Adding the -r flag to read in x-y arrangement, this results in 672386 calls for this particular file.

With the patch included here, both variants are uniform in 99 calls for reading these 50 frames.

As a way to demonstrate the overall improvement using vips tiffsave will read the entire full resolution dicom image. Run on a linux vm using virtiofs with 8 cores.

Before:

vips tiffsave --vips-progress --tile CMU-1/DCM_0.dcm out.tiff
vips temp-1: 46000 x 32893 pixels, 8 threads, 128 x 128 tiles, 256 lines in buffer
vips temp-1: done in 320s

After:

vips tiffsave --vips-progress --tile CMU-1/DCM_0.dcm out.tiff
vips temp-1: 46000 x 32893 pixels, 8 threads, 128 x 128 tiles, 256 lines in buffer
vips temp-1: done in 20.6s

dcm-visit.c

#include <dicom/dicom.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>

typedef int64_t i64;
typedef uint32_t u32;

static const char usage[] = "usage: dcm-visit [-hviwnr] FILE_PATH ...";
static const char TotalPixelMatrixColums[] = "TotalPixelMatrixColumns";
static const char TotalPixelMatrixRows[] = "TotalPixelMatrixRows";
static const char Columns[] = "Columns";
static const char Rows[] = "Rows";

static bool get_dataset_int(const DcmDataSet *ds, const char *tag_name,
                            i64 *out) {
  u32 tag = dcm_dict_tag_from_keyword(tag_name);
  DcmElement *el = dcm_dataset_get(NULL, ds, tag);
  return el && dcm_element_get_value_integer(NULL, el, 0, out);
}

int main(int argc, char *argv[]) {
  int c;
  i64 n_frames = 0;
  bool reverse = false;
  while ((c = dcm_getopt(argc, argv, "hviwn:r")) != -1) {
    switch (c) {
    case 'h':
      printf("%s\n", usage);
      return EXIT_SUCCESS;

    case 'v':
    case 'i':
      dcm_log_set_level(DCM_LOG_INFO);
      break;

    case 'w':
      dcm_log_set_level(DCM_LOG_WARNING);
      break;

    case 'n':
      n_frames = (i64)atoi(dcm_optarg);
      break;

    case 'r':
      reverse = true;
      break;

    case '#':
    default:
      return EXIT_FAILURE;
    }
  }

  DcmError *error = NULL;

  if (dcm_optind + 1 != argc) {
    fprintf(stderr, "%s\n", usage);
    return EXIT_FAILURE;
  }
  const char *input_file = argv[dcm_optind];

  dcm_log_info("Read filehandle '%s'", input_file);
  DcmFilehandle *filehandle =
      dcm_filehandle_create_from_file(&error, input_file);

  if (filehandle == NULL) {
    dcm_error_print(error);
    dcm_error_clear(&error);
    return EXIT_FAILURE;
  }

  const DcmDataSet *metadata =
      dcm_filehandle_get_metadata_subset(&error, filehandle);

  if (metadata == NULL) {
    dcm_error_print(error);
    dcm_error_clear(&error);
    dcm_filehandle_destroy(filehandle);
    return EXIT_FAILURE;
  }

  i64 frame_w, frame_h, width, height;
  if (!get_dataset_int(metadata, Rows, &frame_h) ||
      !get_dataset_int(metadata, Columns, &frame_w) ||
      !get_dataset_int(metadata, TotalPixelMatrixRows, &height) ||
      !get_dataset_int(metadata, TotalPixelMatrixColums, &width)) {
    fprintf(stderr, "Failed to determine image dimensions\n");
    dcm_filehandle_destroy(filehandle);
    return EXIT_FAILURE;
  }

  i64 tiles_x = (width / frame_w) + !!(width % frame_w);
  i64 tiles_y = (height / frame_h) + !!(height % frame_h);

  dcm_log_warning("tiles %dx%d", tiles_x, tiles_y);

  i64 max_a, max_b;

  if (!reverse) {
      max_a = tiles_y;
      max_b = tiles_x;
  } else {
      max_a = tiles_x;
      max_b = tiles_y;
  }

  u32 read_frames = 0;
  for (i64 a = 0; a < max_a; a++) {
    for (i64 b = 0; b < max_b; b++) {
      u32 num = 0;
      i64 x = reverse ? a : b;
      i64 y = reverse ? b : a;
      if (dcm_filehandle_get_frame_number(NULL, filehandle, x, y, &num)) {
        dcm_log_info("tile %dx%d = %d", x, y, num);
        DcmFrame *frame = dcm_filehandle_read_frame(&error, filehandle, num);
        if (frame == NULL) {
          dcm_error_print(error);
          dcm_error_clear(&error);
          dcm_filehandle_destroy(filehandle);
          return EXIT_FAILURE;
        }
        dcm_frame_destroy(frame);
        read_frames++;
        if (read_frames >= n_frames) {
          goto end;
        }
      }
    }
  }
end:
  dcm_log_info("Cleaning up...");
  dcm_filehandle_destroy(filehandle);
  return EXIT_SUCCESS;
}

@jcupitt
Copy link
Copy Markdown
Collaborator

jcupitt commented Apr 28, 2026

Hi @rvause, yes libdicom assumes seek is (almost) free, I didn't realise it had such a significant cost on kubernetes.

Thanks for the PR, I'll have a read.

Comment thread src/dicom-file.c
@jcupitt
Copy link
Copy Markdown
Collaborator

jcupitt commented Apr 28, 2026

This is great @rvause, I feel embarrassed I missed such a large performance hole. I see a nice speedup on USB drives too. And your PR is very clean.

I usually try to avoid realloc, since some platforms have very poor implementations, but it's only called a few 10s of 1000s of times on my test images, which should be ok.

The only thing to add would be a line for the changelog, and a [rvause] credit. I could do this, but perhaps you'd like to?

@rvause
Copy link
Copy Markdown
Contributor Author

rvause commented Apr 28, 2026

Thanks for taking the time to review this. I've added a changelog entry. Feel free to adjust the changelog entry if it could be better described.

@jcupitt
Copy link
Copy Markdown
Collaborator

jcupitt commented Apr 28, 2026

That looks perfect! I'll merge.

You'll be pleased to hear that a new openslide release is coming up. I'll make a libdicom 1.3 with this change and some others, and it might be in users' hands in a just a few weeks.

Thanks again for this nice work.

@jcupitt jcupitt merged commit 6266955 into ImagingDataCommons:main Apr 28, 2026
6 checks passed
@rvause
Copy link
Copy Markdown
Contributor Author

rvause commented Apr 28, 2026

Good to hear, will look out for that. I've been building against versions from main for both of these projects for a few months. The DICOM concatenation support has been tested at scale and is working well.

@opless
Copy link
Copy Markdown

opless commented Apr 30, 2026

@rvause

Hi, am curious about the kubernetes aspect of the issue.

Can you share what was the technology of the backing storage of the volume that the file was using?

I appreciate some systems try to hide this away from you; but it may help others avoid using that particular filestore until the vendor of the volume driver fixes their horrible seek issue 😉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants