Skip to content

io: add connection backoff#3191

Open
kabakaev wants to merge 1 commit into
fluent:masterfrom
c445:connection-backoff
Open

io: add connection backoff#3191
kabakaev wants to merge 1 commit into
fluent:masterfrom
c445:connection-backoff

Conversation

@kabakaev
Copy link
Copy Markdown
Contributor

@kabakaev kabakaev commented Mar 9, 2021

In normal operation, fluent-bit reuses TCP connections, hence new messages are flushed without sending a TCP SYN.
But if an output connection cannot be established, then each flb_io_net_write() call will trigger connection setup and will send a series of TCP SYN packets (one per thread?).

The actual issue is described in #3103.

We observed this issue when hundreds of fluent-bit agents tried to send logs via forward to a set of receiving fluent-bits, which were all down due to config error. The receiving FLB was hosted behind an openstack load balancer, a Linux stateful firewall and a traefik ingress controller.

Apart from high load, the flood of SYN packets may exhaust the connection tracking table, impacting the whole network infrastructure.

Fixes #3103.

This PR is inspired by GRPC backoff implementation.

Backoff is disabled by default.

If enabled, backoff will limit the number of TCP SYN packets during an output destination outage (raw data):
image

This chart shows rate of TCP SYN packets. The data is collected by tcpdump as described in How to test section below.

Testing

Example of backoff configuration is given below.

Valgrind output is uploaded to my gist.

Documentation

Documentation for this feature is submitted as docs PR491.

How to test

Compile this version:

cd build
cmake -DFLB_RELEASE=On \
          -DFLB_TRACE=On \
          -DFLB_JEMALLOC=On \
          -DFLB_TLS=On \
          -DFLB_SHARED_LIB=Off \
          -DFLB_EXAMPLES=Off \
          -DFLB_HTTP_SERVER=On \
          -DFLB_IN_SYSTEMD=On ../ \
&& make -j$(getconf _NPROCESSORS_ONLN)

Collect SYN packets without backoff

Simulate connection timeout and run tcpdump on a separate console:

iptables -I INPUT -p tcp --dport 24224 -j DROP
timeout 6m tcpdump -lnn -i any dst host 127.0.0.1 and dst port 24224 and tcp[tcpflags] == tcp-syn | tee run5m_backoff0.tcpdump
# 1802 packets captured
iptables -D INPUT -p tcp --dport 24224 -j DROP

and start fluent-bit without backoff settings:

timeout -s SIGKILL 5m \
  bin/fluent-bit -vv \
    -i dummy -p 'rate=1000000' \
    -o forward://127.0.0.1:24224 -p 'retry_limit=1' \
  2>&1 | tee run5m_backoff0.log
# Fluent Bit v1.8.0
# ...
# [2021/03/08 18:27:32] [ warn] [engine] failed to flush chunk '869680-1615224452.505695198.flb', retry in 6 seconds: task_id=189, input=dummy.0 > output=forward.0 (out_id=0)
# Killed

Collect SYN packets with initial backoff of 1 second

Simulate connection timeout and run tcpdump on a separate console:

iptables -I INPUT -p tcp --dport 24224 -j DROP
timeout 6m tcpdump -lnn -i any dst host 127.0.0.1 and dst port 24224 and tcp[tcpflags] == tcp-syn | tee run5m_backoff1.tcpdump
# 360 packets captured
iptables -D INPUT -p tcp --dport 24224 -j DROP

and start fluent-bit with backoff settings:

timeout -s SIGKILL 5m \
  bin/fluent-bit -vv \
    -i dummy -p 'rate=1000000' \
    -o forward://127.0.0.1:24224 -p 'retry_limit=1' -p 'net.backoff_init=1' -p 'net.backoff_max=60' \
  2>&1 | tee run5m_backoff1.log
# Fluent Bit v1.8.0
# ...
# [2021/03/08 18:27:31] [debug] [upstream] skipping connection to 127.0.0.1:24224 because of connection backoff for another 28 seconds
# [2021/03/08 18:27:32] [ warn] [engine] failed to flush chunk '869680-1615224452.505695198.flb', retry in 6 seconds: task_id=189, input=dummy.0 > output=forward.0 (out_id=0)
# Killed

Fluent Bit is licensed under Apache 2.0, by submitting this pull request I understand that this code will be released under the terms of that license.

Alexander Kabakaev alexander.kabakaev@daimler.com, Daimler TSS GmbH, imprint

Comment thread include/fluent-bit/flb_io.h Outdated
Comment thread src/flb_io.c Outdated
Comment thread src/flb_io.c Outdated
Comment thread src/flb_upstream.c Outdated
@kabakaev kabakaev force-pushed the connection-backoff branch 2 times, most recently from 29b0523 to f449406 Compare March 15, 2021 08:48
@kabakaev kabakaev changed the title [io] add connection backoff io: add connection backoff Mar 15, 2021
Comment thread src/flb_io.c Outdated
#include <fluent-bit/flb_http_client.h>

/* Increase backoff time of an upstream */
void flb_io_backoff_upstream(struct flb_upstream *u)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe declare static instead of adding to header?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Same with flb_io_in_backoff)

Comment thread src/flb_io.c
@@ -342,6 +393,10 @@ static FLB_INLINE ssize_t net_io_read_async(struct flb_coro *co,
int flb_io_net_write(struct flb_upstream_conn *u_conn, const void *data,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe also handle backoff in flb_io_net_read?

@kabakaev kabakaev force-pushed the connection-backoff branch from f449406 to 4358cea Compare March 16, 2021 14:14
Copy link
Copy Markdown
Member

@edsiper edsiper left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for this contribution!, minor changes are requested.

Comment thread src/flb_upstream.c Outdated
},

{
FLB_CONFIG_MAP_TIME, "net.initial_backoff", "0s",
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • can you rename the property to something like net.backoff_init ?

Comment thread src/flb_upstream.c Outdated
},

{
FLB_CONFIG_MAP_TIME, "net.max_backoff", "0s",
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as before: net.backoff_max

Comment thread include/fluent-bit/flb_upstream.h Outdated
struct mk_list _head;

/* Backoff state. */
time_t next_attempt_time;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please prefix the variable with backoff_...

Comment thread include/fluent-bit/flb_upstream.h Outdated

/* Backoff state. */
time_t next_attempt_time;
int last_backoff_seconds;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do a prefix: backoff...

@edsiper edsiper self-assigned this Mar 20, 2021
@edsiper edsiper added the waiting-for-user Waiting for more information, tests or requested changes label Mar 20, 2021
@kabakaev kabakaev force-pushed the connection-backoff branch from 4358cea to 4a5e951 Compare March 22, 2021 09:44
Signed-off-by: Alexander Kabakaev <kabakaev@gmail.com>
@kabakaev kabakaev force-pushed the connection-backoff branch from 4a5e951 to b58cfc0 Compare March 22, 2021 16:14
@kabakaev
Copy link
Copy Markdown
Contributor Author

kabakaev commented Apr 7, 2021

minor changes are requested.

@edsiper, thanks for quick review! The suggested changes are implemented. PTAL

@edsiper
Copy link
Copy Markdown
Member

edsiper commented Dec 12, 2021

@kabakaev can you pls fix the conflicts so we can do final review/merge ?

@lecaros
Copy link
Copy Markdown
Contributor

lecaros commented Feb 11, 2022

Hi @kabakaev, can you please review the requested changes?
Thanks!

@eschabell
Copy link
Copy Markdown
Contributor

@kabakaev would you mind resolving the conflicts here?

@eschabell
Copy link
Copy Markdown
Contributor

@kabakaev as mentioned on the docs PR fluent/fluent-bit-docs#491, I closed this in favor of the new docs PR fluent/fluent-bit-docs#2590 which is waiting for this code PR to merge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

TCP SYN flood circuit breaker

5 participants