io: add connection backoff#3191
Conversation
b8ab27b to
995a16d
Compare
29b0523 to
f449406
Compare
| #include <fluent-bit/flb_http_client.h> | ||
|
|
||
| /* Increase backoff time of an upstream */ | ||
| void flb_io_backoff_upstream(struct flb_upstream *u) |
There was a problem hiding this comment.
Maybe declare static instead of adding to header?
| @@ -342,6 +393,10 @@ static FLB_INLINE ssize_t net_io_read_async(struct flb_coro *co, | |||
| int flb_io_net_write(struct flb_upstream_conn *u_conn, const void *data, | |||
There was a problem hiding this comment.
Maybe also handle backoff in flb_io_net_read?
f449406 to
4358cea
Compare
edsiper
left a comment
There was a problem hiding this comment.
thanks for this contribution!, minor changes are requested.
| }, | ||
|
|
||
| { | ||
| FLB_CONFIG_MAP_TIME, "net.initial_backoff", "0s", |
There was a problem hiding this comment.
- can you rename the property to something like
net.backoff_init?
| }, | ||
|
|
||
| { | ||
| FLB_CONFIG_MAP_TIME, "net.max_backoff", "0s", |
There was a problem hiding this comment.
same as before: net.backoff_max
| struct mk_list _head; | ||
|
|
||
| /* Backoff state. */ | ||
| time_t next_attempt_time; |
There was a problem hiding this comment.
please prefix the variable with backoff_...
|
|
||
| /* Backoff state. */ | ||
| time_t next_attempt_time; | ||
| int last_backoff_seconds; |
4358cea to
4a5e951
Compare
Signed-off-by: Alexander Kabakaev <kabakaev@gmail.com>
4a5e951 to
b58cfc0
Compare
@edsiper, thanks for quick review! The suggested changes are implemented. PTAL |
|
@kabakaev can you pls fix the conflicts so we can do final review/merge ? |
|
Hi @kabakaev, can you please review the requested changes? |
|
@kabakaev would you mind resolving the conflicts here? |
|
@kabakaev as mentioned on the docs PR fluent/fluent-bit-docs#491, I closed this in favor of the new docs PR fluent/fluent-bit-docs#2590 which is waiting for this code PR to merge. |
In normal operation,
fluent-bitreuses TCP connections, hence new messages are flushed without sending a TCP SYN.But if an output connection cannot be established, then each
flb_io_net_write()call will trigger connection setup and will send a series of TCP SYN packets (one per thread?).The actual issue is described in #3103.
We observed this issue when hundreds of
fluent-bitagents tried to send logs viaforwardto a set of receivingfluent-bits, which were all down due to config error. The receiving FLB was hosted behind an openstack load balancer, a Linux stateful firewall and atraefikingress controller.Apart from high load, the flood of SYN packets may exhaust the connection tracking table, impacting the whole network infrastructure.
Fixes #3103.
This PR is inspired by GRPC backoff implementation.
Backoff is disabled by default.
If enabled, backoff will limit the number of TCP SYN packets during an output destination outage (raw data):

This chart shows rate of TCP SYN packets. The data is collected by
tcpdumpas described inHow to testsection below.Testing
Example of backoff configuration is given below.
Valgrind output is uploaded to my gist.
Documentation
Documentation for this feature is submitted as docs PR491.
How to test
Compile this version:
Collect SYN packets without backoff
Simulate connection timeout and run
tcpdumpon a separate console:and start
fluent-bitwithout backoff settings:timeout -s SIGKILL 5m \ bin/fluent-bit -vv \ -i dummy -p 'rate=1000000' \ -o forward://127.0.0.1:24224 -p 'retry_limit=1' \ 2>&1 | tee run5m_backoff0.log # Fluent Bit v1.8.0 # ... # [2021/03/08 18:27:32] [ warn] [engine] failed to flush chunk '869680-1615224452.505695198.flb', retry in 6 seconds: task_id=189, input=dummy.0 > output=forward.0 (out_id=0) # KilledCollect SYN packets with initial backoff of 1 second
Simulate connection timeout and run
tcpdumpon a separate console:and start
fluent-bitwith backoff settings:timeout -s SIGKILL 5m \ bin/fluent-bit -vv \ -i dummy -p 'rate=1000000' \ -o forward://127.0.0.1:24224 -p 'retry_limit=1' -p 'net.backoff_init=1' -p 'net.backoff_max=60' \ 2>&1 | tee run5m_backoff1.log # Fluent Bit v1.8.0 # ... # [2021/03/08 18:27:31] [debug] [upstream] skipping connection to 127.0.0.1:24224 because of connection backoff for another 28 seconds # [2021/03/08 18:27:32] [ warn] [engine] failed to flush chunk '869680-1615224452.505695198.flb', retry in 6 seconds: task_id=189, input=dummy.0 > output=forward.0 (out_id=0) # KilledFluent Bit is licensed under Apache 2.0, by submitting this pull request I understand that this code will be released under the terms of that license.
Alexander Kabakaev alexander.kabakaev@daimler.com, Daimler TSS GmbH, imprint