Few times a day I am having connection issues between nodes and clients are seeing occasional "Bad request" error. My swarm setup (aws) has following services: nginx
(global) and web
(replicated=2) and separate overlay network. In nginx.conf
I am using proxy_pass http://web:5000
to route requests to web service. Both services are running and marked as healthy and haven't been restarted while having these errors. Manager is separate node (30sec-manager1
).
Few times a day for few requests I am receiving an errors that nginx couldn't connect upstream and I always see 10.0.0.6
IP address mentioned:
Here are related nginx and docker logs. Both web services are replicated on 30sec-worker3
and 30sec-worker4
nodes.
Nginx log:
----------
2017/03/29 07:13:18 [error] 7#7: *44944 connect() failed (113: Host is unreachable) while connecting to upstream, client: 104.154.58.95, server: 30seconds.com, request: "GET / HTTP/1.1", upstream: "http://10.0.0.6:5000/", host: "30seconds.com"
Around same time from docker logs (journalctl -u docker.service)
on node 30sec-manager1:
---------------------------
Mar 29 07:12:50 30sec-manager1 docker[30365]: time="2017-03-29T07:12:50.736935344Z" level=warning msg="memberlist: Refuting a suspect message (from: 30sec-worker3-054c94d39b58)"
Mar 29 07:12:54 30sec-manager1 docker[30365]: time="2017-03-29T07:12:54.659229055Z" level=info msg="memberlist: Marking 30sec-worker3-054c94d39b58 as failed, suspect timeout reached"
Mar 29 07:12:54 30sec-manager1 docker[30365]: time="2017-03-29T07:12:54Z" level=info msg="Firewalld running: false"
Mar 29 07:12:54 30sec-manager1 docker[30365]: time="2017-03-29T07:12:54Z" level=info msg="Firewalld running: false"
Mar 29 07:12:54 30sec-manager1 docker[30365]: time="2017-03-29T07:12:54Z" level=info msg="Firewalld running: false"
Mar 29 07:12:54 30sec-manager1 docker[30365]: time="2017-03-29T07:12:54Z" level=info msg="Firewalld running: false"
Mar 29 07:12:54 30sec-manager1 docker[30365]: time="2017-03-29T07:12:54Z" level=info msg="Firewalld running: false"
Mar 29 07:12:54 30sec-manager1 docker[30365]: time="2017-03-29T07:12:54Z" level=info msg="Firewalld running: false"
Mar 29 07:13:10 30sec-manager1 docker[30365]: time="2017-03-29T07:13:10.302960985Z" level=info msg="memberlist: Suspect 30sec-worker3-054c94d39b58 has failed, no acks received"
Mar 29 07:13:11 30sec-manager1 docker[30365]: time="2017-03-29T07:13:11.055187819Z" level=warning msg="memberlist: Refuting a suspect message (from: 30sec-worker3-054c94d39b58)"
Mar 29 07:13:14 30sec-manager1 docker[30365]: time="2017-03-29T07:13:14Z" level=info msg="Firewalld running: false"
Mar 29 07:13:14 30sec-manager1 docker[30365]: time="2017-03-29T07:13:14Z" level=info msg="Firewalld running: false"
Mar 29 07:13:14 30sec-manager1 docker[30365]: time="2017-03-29T07:13:14Z" level=info msg="Firewalld running: false"
Mar 29 07:13:14 30sec-manager1 docker[30365]: time="2017-03-29T07:13:14Z" level=info msg="Firewalld running: false"
Mar 29 07:13:14 30sec-manager1 docker[30365]: time="2017-03-29T07:13:14Z" level=info msg="Firewalld running: false"
Mar 29 07:13:14 30sec-manager1 docker[30365]: time="2017-03-29T07:13:14Z" level=info msg="Firewalld running: false"
Mar 29 07:13:14 30sec-manager1 docker[30365]: time="2017-03-29T07:13:14Z" level=info msg="Firewalld running: false"
Mar 29 07:13:16 30sec-manager1 docker[30365]: time="2017-03-29T07:13:16Z" level=info msg="Firewalld running: false"
Mar 29 07:13:16 30sec-manager1 docker[30365]: time="2017-03-29T07:13:16Z" level=info msg="Firewalld running: false"
Mar 29 07:13:16 30sec-manager1 docker[30365]: time="2017-03-29T07:13:16Z" level=info msg="Firewalld running: false"
Mar 29 07:13:16 30sec-manager1 docker[30365]: time="2017-03-29T07:13:16Z" level=info msg="Firewalld running: false"
Mar 29 07:13:16 30sec-manager1 docker[30365]: time="2017-03-29T07:13:16Z" level=info msg="Firewalld running: false"
Mar 29 07:13:16 30sec-manager1 docker[30365]: time="2017-03-29T07:13:16Z" level=info msg="Firewalld running: false"
Mar 29 07:13:16 30sec-manager1 docker[30365]: time="2017-03-29T07:13:16Z" level=info msg="Firewalld running: false"
Mar 29 07:13:16 30sec-manager1 docker[30365]: time="2017-03-29T07:13:16Z" level=info msg="Firewalld running: false"
Mar 29 07:13:16 30sec-manager1 docker[30365]: time="2017-03-29T07:13:16Z" level=info msg="Firewalld running: false"
Mar 29 07:13:16 30sec-manager1 docker[30365]: time="2017-03-29T07:13:16Z" level=info msg="Firewalld running: false"
Mar 29 07:13:16 30sec-manager1 docker[30365]: time="2017-03-29T07:13:16Z" level=info msg="Firewalld running: false"
Mar 29 07:13:16 30sec-manager1 docker[30365]: time="2017-03-29T07:13:16Z" level=info msg="Firewalld running: false"
Mar 29 07:13:17 30sec-manager1 docker[30365]: time="2017-03-29T07:13:17Z" level=info msg="Firewalld running: false"
on node 30sec-worker3:
-------------------------
Mar 29 07:12:50 30sec-worker3 docker[30362]: time="2017-03-29T07:12:50.613402284Z" level=info msg="memberlist: Suspect 30sec-manager1-b1cbc10665cc has failed, no acks received"
Mar 29 07:12:55 30sec-worker3 docker[30362]: time="2017-03-29T07:12:55.614174704Z" level=warning msg="memberlist: Refuting a dead message (from: 30sec-worker4-4ca6b1dcaa42)"
Mar 29 07:13:09 30sec-worker3 docker[30362]: time="2017-03-29T07:13:09.613368306Z" level=info msg="memberlist: Suspect 30sec-worker4-4ca6b1dcaa42 has failed, no acks received"
Mar 29 07:13:10 30sec-worker3 docker[30362]: time="2017-03-29T07:13:10.613972658Z" level=info msg="memberlist: Suspect 30sec-manager1-b1cbc10665cc has failed, no acks received"
Mar 29 07:13:11 30sec-worker3 docker[30362]: time="2017-03-29T07:13:11.042788976Z" level=warning msg="memberlist: Refuting a suspect message (from: 30sec-worker4-4ca6b1dcaa42)"
Mar 29 07:13:14 30sec-worker3 docker[30362]: time="2017-03-29T07:13:14.613951134Z" level=info msg="memberlist: Marking 30sec-worker4-4ca6b1dcaa42 as failed, suspect timeout reached"
Mar 29 07:13:25 30sec-worker3 docker[30362]: time="2017-03-29T07:13:25.615128313Z" level=error msg="Bulk sync to node 30sec-manager1-b1cbc10665cc timed out"
on node 30sec-worker4:
-------------------------
Mar 29 07:12:49 30sec-worker4 docker[30376]: time="2017-03-29T07:12:49.658082975Z" level=info msg="memberlist: Suspect 30sec-worker3-054c94d39b58 has failed, no acks received"
Mar 29 07:12:54 30sec-worker4 docker[30376]: time="2017-03-29T07:12:54.658737367Z" level=info msg="memberlist: Marking 30sec-worker3-054c94d39b58 as failed, suspect timeout reached"
Mar 29 07:12:54 30sec-worker4 docker[30376]: time="2017-03-29T07:12:54Z" level=info msg="Firewalld running: false"
Mar 29 07:12:54 30sec-worker4 docker[30376]: time="2017-03-29T07:12:54Z" level=info msg="Firewalld running: false"
Mar 29 07:12:54 30sec-worker4 docker[30376]: time="2017-03-29T07:12:54Z" level=info msg="Firewalld running: false"
Mar 29 07:12:54 30sec-worker4 docker[30376]: time="2017-03-29T07:12:54Z" level=info msg="Firewalld running: false"
Mar 29 07:12:54 30sec-worker4 docker[30376]: time="2017-03-29T07:12:54Z" level=info msg="Firewalld running: false"
Mar 29 07:12:54 30sec-worker4 docker[30376]: time="2017-03-29T07:12:54Z" level=info msg="Firewalld running: false"
Mar 29 07:12:54 30sec-worker4 docker[30376]: time="2017-03-29T07:12:54Z" level=info msg="Firewalld running: false"
Mar 29 07:12:54 30sec-worker4 docker[30376]: time="2017-03-29T07:12:54Z" level=info msg="Firewalld running: false"
Mar 29 07:13:09 30sec-worker4 docker[30376]: time="2017-03-29T07:13:09.658056735Z" level=info msg="memberlist: Suspect 30sec-worker3-054c94d39b58 has failed, no acks received"
Mar 29 07:13:16 30sec-worker4 docker[30376]: time="2017-03-29T07:13:16.303689665Z" level=warning msg="memberlist: Refuting a suspect message (from: 30sec-worker4-4ca6b1dcaa42)"
Mar 29 07:13:16 30sec-worker4 docker[30376]: time="2017-03-29T07:13:16Z" level=info msg="Firewalld running: false"
Mar 29 07:13:16 30sec-worker4 docker[30376]: time="2017-03-29T07:13:16Z" level=info msg="Firewalld running: false"
Mar 29 07:13:16 30sec-worker4 docker[30376]: time="2017-03-29T07:13:16Z" level=info msg="Firewalld running: false"
Mar 29 07:13:16 30sec-worker4 docker[30376]: time="2017-03-29T07:13:16Z" level=info msg="Firewalld running: false"
Mar 29 07:13:16 30sec-worker4 docker[30376]: time="2017-03-29T07:13:16Z" level=info msg="Firewalld running: false"
syslog on 30sec-worker4:
--------------------------
Mar 29 07:12:49 30sec-worker4 docker[30376]: time="2017-03-29T07:12:49.658082975Z" level=info msg="memberlist: Suspect 30sec-worker3-054c94d39b58 has failed, no acks received"
Mar 29 07:12:54 30sec-worker4 docker[30376]: time="2017-03-29T07:12:54.658737367Z" level=info msg="memberlist: Marking 30sec-worker3-054c94d39b58 as failed, suspect timeout reached"
Mar 29 07:12:54 30sec-worker4 kernel: [645679.048975] IPVS: __ip_vs_del_service: enter
Mar 29 07:12:54 30sec-worker4 docker[30376]: time="2017-03-29T07:12:54Z" level=info msg="Firewalld running: false"
Mar 29 07:12:54 30sec-worker4 kernel: [645679.100691] IPVS: __ip_vs_del_service: enter
Mar 29 07:12:54 30sec-worker4 kernel: [645679.130069] IPVS: __ip_vs_del_service: enter
Mar 29 07:12:54 30sec-worker4 kernel: [645679.155859] IPVS: __ip_vs_del_service: enter
Mar 29 07:12:54 30sec-worker4 kernel: [645679.180461] IPVS: __ip_vs_del_service: enter
Mar 29 07:12:54 30sec-worker4 kernel: [645679.205707] IPVS: __ip_vs_del_service: enter
Mar 29 07:12:54 30sec-worker4 kernel: [645679.230326] IPVS: __ip_vs_del_service: enter
Mar 29 07:12:54 30sec-worker4 kernel: [645679.255597] IPVS: __ip_vs_del_service: enter
Mar 29 07:12:54 30sec-worker4 docker[30376]: message repeated 7 times: [ time="2017-03-29T07:12:54Z" level=info msg="Firewalld running: false"]
Mar 29 07:13:09 30sec-worker4 docker[30376]: time="2017-03-29T07:13:09.658056735Z" level=info msg="memberlist: Suspect 30sec-worker3-054c94d39b58 has failed, no acks received"
Mar 29 07:13:16 30sec-worker4 docker[30376]: time="2017-03-29T07:13:16.303689665Z" level=warning msg="memberlist: Refuting a suspect message (from: 30sec-worker4-4ca6b1dcaa42)"
Mar 29 07:13:16 30sec-worker4 docker[30376]: time="2017-03-29T07:13:16Z" level=info msg="Firewalld running: false"
Mar 29 07:13:16 30sec-worker4 docker[30376]: message repeated 7 times: [ time="2017-03-29T07:13:16Z" level=info msg="Firewalld running: false"]
I checked other cases when nginx can't find find upstream and always I find these 3x lines appear most at these times in docker logs in:
level=info msg="memberlist:Suspect 30sec-worker3-054c94d39b58 has failed, no acks received"
level=warning msg="memberlist: Refuting a suspect message (from: 30sec-worker3-054c94d39b58)"
level=warning msg="memberlist: Refuting a dead message (from: 30sec-worker3-054c94d39b58)
By searching other issues, found that these have similar errors, so it may be related:
https://github.com/docker/docker/issues/28843
https://github.com/docker/docker/issues/25325
Anything I should check or debug more to spot the problem or is it a bug?
Thank you.
Output of docker version
:
Client:
Version: 17.03.0-ce
API version: 1.26
Go version: go1.7.5
Git commit: 60ccb22
Built: Thu Feb 23 11:02:43 2017
OS/Arch: linux/amd64
Server:
Version: 17.03.0-ce
API version: 1.26 (minimum version 1.12)
Go version: go1.7.5
Git commit: 60ccb22
Built: Thu Feb 23 11:02:43 2017
OS/Arch: linux/amd64
Experimental: false
Output of docker info
:
Containers: 18
Running: 3
Paused: 0
Stopped: 15
Images: 16
Server Version: 17.03.0-ce
Storage Driver: aufs
Root Dir: /var/lib/docker/aufs
Backing Filesystem: extfs
Dirs: 83
Dirperm1 Supported: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
Volume: local
Network: bridge host macvlan null overlay
Swarm: active
NodeID: ck99cyhgydt8y1zn8ik2xmcdv
Is Manager: true
ClusterID: in0q54eh74ljazrprt0vza3wj
Managers: 1
Nodes: 5
Orchestration:
Task History Retention Limit: 5
Raft:
Snapshot Interval: 10000
Number of Old Snapshots to Retain: 0
Heartbeat Tick: 1
Election Tick: 3
Dispatcher:
Heartbeat Period: 5 seconds
CA Configuration:
Expiry Duration: 3 months
Node Address: 172.31.31.146
Manager Addresses:
172.31.31.146:2377
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 977c511eda0925a723debdc94d09459af49d082a
runc version: a01dafd48bc1c7cc12bdb01206f9fea7dd6feb70
init version: 949e6fa
Security Options:
apparmor
seccomp
Profile: default
Kernel Version: 4.4.0-57-generic
Operating System: Ubuntu 16.04.1 LTS
OSType: linux
Architecture: x86_64
CPUs: 1
Total Memory: 990.6 MiB
Name: 30sec-manager1
ID: 5IIF:RONB:Y27Q:5MKX:ENEE:HZWM:XYBV:O6KN:BKL6:AEUK:2VKB:MO5P
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
WARNING: No swap limit support
Labels:
provider=amazonec2
Experimental: false
Insecure Registries:
127.0.0.0/8
Live Restore Enabled: false
Additional environment details (AWS, VirtualBox, physical, etc.):
Amazon AWS (Manager - t2.micro, rest of nodes - t2.small)
docker-compose.yml (There are more services and nodes in setup, but I posted only involved ones)
version: "3"
services:
nginx:
image: 333435094895.dkr.ecr.us-east-1.amazonaws.com/swarm/nginx:latest
ports:
- 80:80
- 81:81
networks:
- thirtysec
depends_on:
- web
deploy:
mode: global
update_config:
delay: 2s
monitor: 2s
web:
image: 333435094895.dkr.ecr.us-east-1.amazonaws.com/swarm/os:latest
command: sh -c "python manage.py collectstatic --noinput && daphne thirtysec.asgi:channel_layer -b 0.0.0.0 -p 5000"
ports:
- 5000:5000
networks:
- thirtysec
deploy:
mode: replicated
replicas: 2
labels: [APP=THIRTYSEC]
update_config:
delay: 15s
monitor: 15s
placement:
constraints: [node.labels.aws_type == t2.small]
healthcheck:
test: goss -g deploy/swarm/checks/web-goss.yaml validate
interval: 2s
timeout: 3s
retries: 15
networks:
thirtysec:
web-goss.yaml
port:
tcp:5000:
listening: true
ip:
- 0.0.0.0
area/networking area/swarm version/17.03