Skip to content

[FLINK-38902][docs] Add user instructions and usage documentation for FLIP-487#28003

Open
RocMarshal wants to merge 1 commit intoapache:masterfrom
RocMarshal:FLINK-38902
Open

[FLINK-38902][docs] Add user instructions and usage documentation for FLIP-487#28003
RocMarshal wants to merge 1 commit intoapache:masterfrom
RocMarshal:FLINK-38902

Conversation

@RocMarshal
Copy link
Copy Markdown
Contributor

What is the purpose of the change

[FLINK-38902][docs] Add user instructions and usage documentation for FLIP-487

Brief change log

[FLINK-38902][docs] Add user instructions and usage documentation for FLIP-487

Verifying this change

N.A

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): (yes / no)
  • The public API, i.e., is any changed class annotated with @Public(Evolving): (yes / no)
  • The serializers: (yes / no / don't know)
  • The runtime per-record code paths (performance sensitive): (yes / no / don't know)
  • Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: (yes / no / don't know)
  • The S3 file system connector: (yes / no / don't know)

Documentation

  • Does this pull request introduce a new feature? (yes / no)
  • If yes, how is the feature documented? (not applicable / docs / JavaDocs / not documented)

Was generative AI tooling used to co-author this PR?
  • Yes (please specify the tool below)

@RocMarshal
Copy link
Copy Markdown
Contributor Author

Hi, @XComp @spuru9 @och5351 Could you help take a look ?
Any input is appreciated.

@flinkbot
Copy link
Copy Markdown
Collaborator

flinkbot commented Apr 23, 2026

CI report:

Bot commands The @flinkbot bot supports the following commands:
  • @flinkbot run azure re-run the last Azure build

Copy link
Copy Markdown
Contributor

@spuru9 spuru9 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR. Added a few comments.

nit: Other than these, there are a few trivial punctuation, fixing not relevant.

Comment thread docs/content/docs/deployment/elastic_scaling.md Outdated
Comment thread docs/content/docs/deployment/elastic_scaling.md Outdated
Comment thread docs/content/docs/deployment/elastic_scaling.md Outdated
Comment thread docs/content/docs/deployment/elastic_scaling.md Outdated
Comment thread docs/content/docs/deployment/elastic_scaling.md Outdated
Comment thread docs/content/docs/deployment/elastic_scaling.md Outdated
Comment thread docs/content/docs/deployment/elastic_scaling.md Outdated
Comment thread docs/content/docs/deployment/elastic_scaling.md Outdated
Comment thread docs/content/docs/deployment/elastic_scaling.md Outdated
Comment thread docs/content/docs/deployment/elastic_scaling.md Outdated
@github-actions github-actions Bot added the community-reviewed PR has been reviewed by the community. label Apr 23, 2026
Copy link
Copy Markdown
Contributor

@spuru9 spuru9 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you apply the changes for chinese doc as well.

Comment thread docs/content/docs/deployment/elastic_scaling.md Outdated
@RocMarshal RocMarshal force-pushed the FLINK-38902 branch 2 times, most recently from efce38d to 6d2a7cc Compare April 23, 2026 09:23
Copy link
Copy Markdown
Contributor

@spuru9 spuru9 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, Missed a minor inconsistency.

Rest LGTM

Comment thread docs/content/docs/deployment/elastic_scaling.md Outdated
Comment thread docs/content.zh/docs/deployment/elastic_scaling.md Outdated
Copy link
Copy Markdown
Contributor

@spuru9 spuru9 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, Thanks.

Additionally, the page supports the display of detailed rescale information as outlined below:
- The basic information of the target rescale
- Rescale UUID: The unique ID in a rescale consists of 32 hexadecimal characters
- Attempt ID: The number ID of a rescale attempts that occurred under the same resource requirements
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure what this sentence means, this is one ID but it talks of a rescale attempts is it one or more attempts?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @davidradl

The detailed description of the rescaleIdInfo is here.

https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=334760525#FLIP495:SupportAdaptiveSchedulerrecordandquerytherescalehistory-RescaleID&ResourceRequirementsrequest

Do you think we should keep the picture mentioned in the doc link into the doc ?

- `History`
This sub-page displays abbreviated information for the most recent `n` rescale records based on configuration.
Additionally, the page supports the display of detailed rescale information as outlined below:
- The basic information of the target rescale
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to say target rescale- can we just say rescale or are there other rescale types?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

- The basic information of the target rescale
- Rescale UUID: The unique ID in a rescale consists of 32 hexadecimal characters
- Attempt ID: The number ID of a rescale attempts that occurred under the same resource requirements
- Requirements ID: The unique ID of resource requirements consists of 32 hexadecimal characters
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this a UUID like the other one ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

- Requirements ID: The unique ID of resource requirements consists of 32 hexadecimal characters
- Trigger Cause: The reason that triggered the target rescale
- Terminal State: The end state of the target rescale
- Terminated Reason: The reason for the completion or termination of the target rescale
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it talks of Terminated Reason which implies a termination, then in the wording in talks of completions and terminations as 2 different things. Implying that termination is an abnormal end.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

- Terminal State: The end state of the target rescale
- Terminated Reason: The reason for the completion or termination of the target rescale
- Start Time: The start time of the target rescale
- Duration: Duration from the start of the rescale to its completion or until now
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume it is always the Duration from the start of the rescale. If the rescale is ongoing I guess the end is now.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

- Terminated Reason: The reason for the completion or termination of the target rescale
- Start Time: The start time of the target rescale
- Duration: Duration from the start of the rescale to its completion or until now
- End Time: The end time of the target rescale
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume this is the same as Duration unless the rescale is ongoing

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consize.
And I had a try on a new description.

- Start Time: The start time of the target rescale
- Duration: Duration from the start of the rescale to its completion or until now
- End Time: The end time of the target rescale
- The basic attributes and rescale change per `Job Vertex`
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we have a reference to Job vertex

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @davidradl
I have considered this issue before. To be honest, JobVertex​ is such a fundamental term that I didn’t add any external reference links.

Of course , if you think it's required, I'd like to try to added it.

- Duration: Duration from the start of the rescale to its completion or until now
- End Time: The end time of the target rescale
- The basic attributes and rescale change per `Job Vertex`
- ID: The unique ID of target JobVertex consists of 32 hexadecimal characters
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume JobVertx is the same as Job Vertex- I would suggest we are consistent.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, +1 from me.

- Acquired Parallelism: The parallelism of target vertex after the current rescale
- Sufficient Parallelism: The minimal parallelism of target vertex to run
- Desired Parallelism: The desired parallelism of the target vertex
- The basic attributes and rescale change per `Slot Sharing Group`
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say UUID - and define that once

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated.

- Sufficient Slot: The minimal number of slots to deploy tasks in the rescale
- Request Profile: The request resource profile of the slot sharing group in the rescale
- Acquired Profile: The acquired resource profile of the slot sharing group in the rescale
- The internal `Scheduler State History` of `Adaptive Scheduler` within a rescale
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are the states documented?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these states are an abstract representation of the scheduler's internal state; therefore, they were omitted to avoid introducing unnecessary ambiguity.

- State: The scheduler state name
- Enter Time: The time to enter the state
- Leave Time: The time to leave the state
- Duration: Time spent in the state (Leave Time − Enter Time)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this include the current state which will not have an end time?

Copy link
Copy Markdown
Contributor Author

@RocMarshal RocMarshal Apr 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this doesn't include the current state which will not have an end time.

This is determined by the collection mechanism of adaptive state. Simply put, when a rescale adds a state, that state already includes start and end times.

- Enter Time: The time to enter the state
- Leave Time: The time to leave the state
- Duration: Time spent in the state (Leave Time − Enter Time)
- Exception: The exception information about current rescale during the state
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does during the state mean?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe during-> within?

This sub-page displays the total number of rescale events that have occurred since the job was launched,
along with the respective counts of failures and successes.
Additionally, it provides statistical summaries of the rescale history,
such as rescale duration statistics categorized by rescale status, including `Min`, `Max`, `Avg`, and `Pnn` metrics.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is Pnn here?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated.

… FLIP-487

Co-authored-by: spuru9 <sinhapurushottam911@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-reviewed PR has been reviewed by the community.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants