Skip to content

Experiment: ResumableParser#994

Draft
byroot wants to merge 2 commits into
ruby:masterfrom
byroot:resumable-parser
Draft

Experiment: ResumableParser#994
byroot wants to merge 2 commits into
ruby:masterfrom
byroot:resumable-parser

Conversation

@byroot
Copy link
Copy Markdown
Member

@byroot byroot commented Jun 5, 2026

Fix: #983

Numerous known issues TODO:

  • Performance: This new feature shouldn't significantly degrade classic JSON.parse. Right now twitter.json is 7-10% slower, that's not OK. We might need to duplicate the parsing loop if necessary.
  • object_start_cursor recorded in frame becomes invalid if the buffer string is reallocated or spilled.
  • The buffer need to be shrunk sometimes.
  • Lot more testing needed.
  • Unclear what to do with top level numbers (and perhaps true/false/null)
  • API is all but final
    • I'd like to be able to "pop" the value, so we don't uselessly keep a reference on it.
    • Then methods need to be documented.
  • It would worth trying to make json_parse_any exception free.
    • Right now EOF errors have been eliminated in favor of returning false.
    • We could try to do the same with syntax errors.
    • But then we need to rb_protect when calling back into Ruby or other unsafe APIs, so perhaps it's best to just accept it.

@byroot byroot force-pushed the resumable-parser branch 2 times, most recently from 163ad0a to 497df78 Compare June 5, 2026 15:04
byroot added 2 commits June 5, 2026 17:35
Fix: ruby#983

Numerous known issues TODO:

  - `object_start_cursor` recorded in frame becomes invalid if the buffer string
    is reallocated or spilled.
  - The buffer need to be shrunk sometimes.
  - Lot more testing needed.
  - Unclear what to do with top level numbers (and perhaps true/false/null)
  - API is all but final
    - I'd like to be able to "pop" the value, so we don't uselessly keep a
      reference on it.
    - Then methods need to be documented.
  - It would worth trying to make `json_parse_any` exception free.
    - Right now EOF errors have been eliminated in favor of returning
      `false`.
    - We could try to do the same with syntax errors.
    - But then we need to `rb_protect` when calling back into Ruby
      or other unsafe APIs, so perhaps it's best to just accept it.
@byroot byroot force-pushed the resumable-parser branch from 497df78 to 6162ba8 Compare June 5, 2026 15:36
byroot added a commit that referenced this pull request Jun 5, 2026
Extracted from: #994

Modern compilers shouldn't have problem computing `strlen` at
compile time and generating the same code.
matzbot pushed a commit to ruby/ruby that referenced this pull request Jun 5, 2026
Extracted from: ruby/json#994

Modern compilers shouldn't have problem computing `strlen` at
compile time and generating the same code.

ruby/json@b07f74bd73
Comment on lines +229 to +231
if (*handle) {
RB_OBJ_WRITTEN(*handle, Qundef, value);
}
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seem to account for most of the perf regression on twitter.json.

I think instead of making rvalue_stack WB protected, we could just not embed it, or just have a secondary non-protected object just to mark it.

@kou
Copy link
Copy Markdown
Member

kou commented Jun 5, 2026

Great!

Here are some random notes:

MessagePack like API (splitting a parsing API to appending new data, parsing buffer and getting parsed data) is good.

MessagePack also uses feed but feed (a parser feeds data) may be a bit strange.

parser.consume(data), parser.refill(data) or something may be better. I used "consume" when I implemented a resumable parser in Apache Arrow ( apache/arrow#6804 ) but it uses callback style API and "consume" fills a buffer and parses the buffer. With this API, "consume" may be a bit strange too.

How about #value returns parsing data even when #parse returns false something like the following?

diff --git a/ext/json/ext/parser/parser.c b/ext/json/ext/parser/parser.c
index 749d594..f0731ac 100644
--- a/ext/json/ext/parser/parser.c
+++ b/ext/json/ext/parser/parser.c
@@ -2254,9 +2254,8 @@ static VALUE cResumableParser_parse(VALUE self)
 static VALUE cResumableParser_value(VALUE self)
 {
     JSON_ResumableParser *parser = cResumableParser_get(self);
-    json_frame *frame = json_frame_stack_peek(&parser->frames);
 
-    if (frame->phase == JSON_PHASE_DONE) {
+    if (parser->state.value_stack->head > 0) {
         return *rvalue_stack_peek(parser->state.value_stack, 1);
     } else {
         rb_raise(rb_eArgError, "no ready value"); // TODO: Figure out the best exception and message

@tompng shared an use case of resumable parser:

  • It may be useful to process generative AI API response
  • It's returned as a stream
  • An application wants to display the response before a response isn't completed

For example:

  1. Response: {"message": "This is a response. (not completed)
  2. App: Show This is a response
  3. Response: More messages. (not completed)
  4. App: Append More messages. to its view
  5. ...

The above diff doesn't satisfy this use case but we can use it for simpler case such as [1,. We can get [1] before we have rest data (something like , 2]).

If we also provide an API that returns not processed data something like the following, we may be able to cover the generative AI API response use case:

diff --git a/ext/json/ext/parser/parser.c b/ext/json/ext/parser/parser.c
index 749d594..d39ba11 100644
--- a/ext/json/ext/parser/parser.c
+++ b/ext/json/ext/parser/parser.c
@@ -2263,6 +2263,15 @@ static VALUE cResumableParser_value(VALUE self)
     }
 }
 
+static VALUE cResumableParser_rest(VALUE self)
+{
+    JSON_ResumableParser *parser = cResumableParser_get(self);
+
+    return rb_str_substr(parser->buffer,
+                         parser->state.cursor - parser->state.start,
+                         parser->state.end - parser->state.cursor);
+}
+
 void Init_parser(void)
 {
 #ifdef HAVE_RB_EXT_RACTOR_SAFE
@@ -2289,6 +2298,7 @@ void Init_parser(void)
     rb_define_method(cResumableParser, "feed", cResumableParser_feed, 1);
     rb_define_method(cResumableParser, "parse", cResumableParser_parse, 0);
     rb_define_method(cResumableParser, "value", cResumableParser_value, 0);
+    rb_define_method(cResumableParser, "rest", cResumableParser_rest, 0);
 
     CNaN = rb_const_get(mJSON, rb_intern("NaN"));
     rb_gc_register_mark_object(CNaN);

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add support for parsing chunked data

2 participants