Skip to content
This repository was archived by the owner on Nov 24, 2025. It is now read-only.

Fix t3c timeouts and retries#7007

Open
rob05c wants to merge 1 commit intoapache:masterfrom
rob05c:t3c-fix-timeouts
Open

Fix t3c timeouts and retries#7007
rob05c wants to merge 1 commit intoapache:masterfrom
rob05c:t3c-fix-timeouts

Conversation

@rob05c
Copy link
Copy Markdown
Member

@rob05c rob05c commented Aug 4, 2022

Fixes bugs in t3c of both timeouts and retries to Traffic Ops. Config values of the Traffic Ops client weren't getting set correctly, resulting in no retries and timeouts far longer than configured.

This also adds info logging, which should make it easier to see if timeouts or retries somehow get the wrong value in the future.

Which Traffic Control components are affected by this PR?

  • Traffic Control Cache Config (t3c, formerly ORT)

What is the best way to verify this PR?

Run t3c, verify config is generated and applied as expected. Run t3c against a Traffic Ops somehow set up or configured to not respond or to initially respond with errors, verify failures are retried and timeouts are as configured. You can also partially verify, by observing the new timeout and retry info logs.

If this is a bugfix, which Traffic Control versions contained the bug?

Examples:

  • 6.0.0
  • 6.1.0
  • 7.0.0 (RC0)
  • 7.0.0 (RC1)
  • 7.0.0 (RC2)

PR submission checklist

  • [x] This PR has tests no tests, bugs are in the applications themselves configuring the library, not any library, so tests would require integration tests with an extensive ability to configure bad and slow replies in Traffic Ops, which don't exist and would be a large development effort.
  • [x] This PR has documentation no docs, no interface change
  • This PR has a CHANGELOG.md entry
  • This PR DOES NOT FIX A SERIOUS SECURITY VULNERABILITY (see the Apache Software Foundation's security guidelines for details)

@rob05c rob05c added medium impact impacts a significant portion of a CDN, or has the potential to do so cache-config Cache config generation labels Aug 4, 2022
@rob05c rob05c force-pushed the t3c-fix-timeouts branch 3 times, most recently from 97122ba to 5915ec8 Compare August 4, 2022 21:25
@ocket8888 ocket8888 added the bug something isn't working as intended label Aug 5, 2022
Comment thread CHANGELOG.md Outdated
@rob05c rob05c force-pushed the t3c-fix-timeouts branch from 5915ec8 to 3ca6b9f Compare August 5, 2022 17:24
@rawlinp
Copy link
Copy Markdown
Contributor

rawlinp commented Aug 5, 2022

So, exponential backoff retrying is certainly better than non-exponential, but it's really important for there to be jitter in the sleep intervals as well. Otherwise, clients will all retry at the same time still. For example, if a particular sleep window is 1 minute long, the client should sleep some random about between 0 and 1 minute before retrying. This prevents all retries from happening right at that 1 minute mark.

Also, since we're running t3c every 1 - 2 minutes already without retries, I think it would make sense for the longest sleep window to be something like 4 or 8 minutes. We should also provide the ability to cap the window to some maximum like 10 minutes.

@rob05c
Copy link
Copy Markdown
Member Author

rob05c commented Aug 5, 2022

Otherwise, clients will all retry at the same time still

t3c runs should already be dispersed. The ATC cache-config itself doesn't do that, users need to make their script do that, but I think the Ansible in the repo has an example, and it's not difficult.

Also, since we're running t3c every 1 - 2 minutes already without retries, I think it would make sense for the longest sleep window to be something like 4 or 8 minutes. We should also provide the ability to cap the window to some maximum like 10 minutes.

Users can run t3c as often as they want, the ATC cache-config doesn't really mandate or recommend a frequency. I think we generally suggest somewhere between 1-15mins, depending on the user's scale and TO resources. But I'm not sure even that recommendation is documented anywhere.

But you make a good point, the backoff should be configurable. I agree, but I was trying to keep this PR small, was planning to do that in a subsequent PR. But I can add it to this PR if you feel it's necessary.

@rawlinp
Copy link
Copy Markdown
Contributor

rawlinp commented Aug 5, 2022

t3c runs should already be dispersed. The ATC cache-config itself doesn't do that, users need to make their script do that, but I think the Ansible in the repo has an example, and it's not difficult.

I know that the runs themselves are already dispersed, but I'm saying that in the situation where retries are necessary (e.g. TODB is overloaded), the existing dispersion is not necessarily going to be enough. By adding random jitter, it's like expanding the dispersion window dynamically, as opposed to having to edit the cron job frequency in the middle of an outage scenario.

For example, let's say the dispersion window is big enough to where only 100 out of 3000 clients total are making requests to TO at any given point in time, but they all encounter an error and need to retry. Since they all started at the same time, they're all going to retry at the same time (even though it is only 100 out of 3000), which is not ideal. If TODB couldn't handle the load of 100 concurrent requests for some reason, then it still wouldn't be able to handle these 100 concurrent retries.

However, by giving the clients random jitter within that sleep interval, as long as the maximum sleep interval is longer than the normal cron frequency, we'd actually be making the requests more dispersed in an outage situation, without having to change cron frequencies, which is what we want. That's why I think making the "default" retries include up to a final 4-8 minute sleep, along with random jitter, can really help us out.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

bug something isn't working as intended cache-config Cache config generation medium impact impacts a significant portion of a CDN, or has the potential to do so

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants