High-Performance server for NATS, the cloud native messaging system.

Overview

NATS is a simple, secure and performant communications system for digital systems, services and devices. NATS is part of the Cloud Native Computing Foundation (CNCF). NATS has over 40 client language implementations, and its server can run on-premise, in the cloud, at the edge, and even on a Raspberry Pi. NATS can secure and simplify design and operation of modern distributed systems.

License Build Release Slack Coverage Docker Downloads CII Best Practices

Documentation

Contact

  • Twitter: Follow us on Twitter!
  • Google Groups: Where you can ask questions
  • Slack: Click here to join. You can ask question to our maintainers and to the rich and active community.

Contributing

If you are interested in contributing to NATS, read about our...

Security

Security Audit

A third party security audit was performed by Cure53, you can see the full report here.

Reporting Security Vulnerabilities

If you've found a vulnerability or a potential vulnerability in the NATS server, please let us know at nats-security.

License

Unless otherwise noted, the NATS source files are distributed under the Apache Version 2.0 license found in the LICENSE file.

Comments
  • subscription count in subsz is wrong

    subscription count in subsz is wrong

    SInce updating one of my brokers to 2.0.0 I noticed a slow increate in subscription counts - I also did a bunch of other updates like move to the newly renamed libraries etc so in order to find the cause I eventually concluded the server is just counting things wrongly.

    graph

    Ignoring the annoying popup, you can see a steady increase in subscriptions.

    Data below is from the below dependency embedded in another go process:

    github.com/nats-io/nats-server/v2 v2.0.1-0.20190701212751-a171864ae7df
    
    $ curl -s http://localhost:6165/varz|jq .subscriptions
    29256
    

    I then tried to verify this number, and assuming I have no bugs in the script below I think the varz counter is off by a lot, comparing snapshots of connz over time I see no growth reflected there not in connection counts nor subscriptions:

    $ curl "http://localhost:6165/connz?limit=200000&subs=1"|./countsubs.rb
    Connections: 3659
    Subscriptions: 25477
    

    I also captured connz output over time 15:17, 15:56 and 10:07 the next day:

    $ cat connz-1562685506.json|./countsubs.rb
    Connections: 3657
    Subscriptions: 25463
    $ cat connz-1562687791.json|./countsubs.rb
    Connections: 3658
    Subscriptions: 25463
    $ cat connz-1562687791.json|./countsubs.rb
    Connections: 3658
    Subscriptions: 25463
    

    Using the script here:

    require "json"
    
    subs = JSON.parse(STDIN.read)
    puts "Connections: %d" % subs["connections"].length
    
    count = 0
    
    subs["connections"].each do |conn|
      count += subs.length if subs = conn["subscriptions_list"]
    end
    
    puts "Subscriptions: %d" % count
    
    opened by ripienaar 87
  • Performance issues with locks and sublist cache

    Performance issues with locks and sublist cache

    • [ ] Defect
    • [x] Feature Request or Change Proposal

    Feature Requests

    Use Case:

    We are using gnatsd 1.4.1 (compiled go 1.11.5). During benchmark, we observed non-trivial latency (500 ms+, usually seconds) at gnatsd cluster.

    As there is no slow consumers (with default 2 seconds threshold) and the OS rcv buffer got full and TCP window went to 0, it seems that the gnatsd server is somehow slow in read loop. We are trying to slow down the sender for one connection but we believe that gnatsd can also be improved. If you need more proofs of slowness of read loop, we might be able to provide some tcpdump snippets and tracing logs of gnatsd.

    We also observe some parser errors happened rarely when gnatsd is under high load of reading. The client is using cnats. However we are not sure who (cnats, OS, or gnatsd) was not doing right. After we found it out, we may open another issue to address the problem.

    [8354] 2019/04/01 12:17:11.695815 [ERR] 10.228.255.129:44588 - cid:1253 - Client parser ERROR, state=0, i=302: proto='"\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"...'
    

    By the way, as gnatsd could detect slow consumer, is that possible for gnatsd to know itself becomes a slow consumer (slow read)? The only idea I come up is to adjust OS buffer and let the upstream to know the pressure. If you have any suggestions, please let me know.

    Proposed Change:

    1. Improve locks. https://github.com/nats-io/gnatsd/compare/branch_1_4_0...azrle:enhance/processMsg_lock Comparison of read loops between high load and low load: image Sync blocking graph: image

    2. Ability to adjust sublist cache size or disable it. https://github.com/nats-io/gnatsd/compare/branch_1_4_0...azrle:feature/opts-sublist_cache_size According to our application characteristic, it doing sub/unsub very frequently and most of subjects are single-used. The hit rate of cache is under 0.5%. However, it can cost gnatsd to maintenance the sublist cache. Besides locks for the cache, reduceCacheCount is noticeable. Compared to other function's goroutines which are less than 50, the number of goroutines for server.(*Sublist).reduceCacheCount can climb up to near 18,000.

    Who Benefits From The Change(s)?

    Clients send messages heavily to gnatsd. And subscription changes frequently. Under our test cases (with enough servers), the 99.9%tile of latency drops from 1500ms to 500ms (it's still slow though).

    I noticed that gnatsd v2 is coming. And the implementation changes a lot. But I am afraid that we may not have time to wait for it to get production-ready.

    I sincerely hope the performance can be improved for v1.4.

    Thank you in advance!

    opened by azrle 59
  • Consumer stopped working after errPartialCache (nats-server oom-killed)

    Consumer stopped working after errPartialCache (nats-server oom-killed)

    Defect

    Make sure that these boxes are checked before submitting your issue -- thank you!

    • [x] Included nats-server -DV output
    • [ ] Included a [Minimal, Complete, and Verifiable example] (https://stackoverflow.com/help/mcve)

    Versions of nats-server and affected client libraries used:

    # nats-server -DV
    [92] 2021/12/06 15:16:05.235349 [INF] Starting nats-server
    [92] 2021/12/06 15:16:05.235397 [INF]   Version:  2.6.6
    [92] 2021/12/06 15:16:05.235401 [INF]   Git:      [878afad]
    [92] 2021/12/06 15:16:05.235406 [DBG]   Go build: go1.16.10
    [92] 2021/12/06 15:16:05.235416 [INF]   Name:     NASX72BQAFBIH4QBLZ36RADTPKSO6LCKRDEAS37XRJ7SYZ53RYYOFHHS
    [92] 2021/12/06 15:16:05.235436 [INF]   ID:       NASX72BQAFBIH4QBLZ36RADTPKSO6LCKRDEAS37XRJ7SYZ53RYYOFHHS
    [92] 2021/12/06 15:16:05.235457 [DBG] Created system account: "$SYS"
    
    Image:         nats:2.6.6-alpine
        Limits:
          cpu:     200m
          memory:  256Mi
        Requests:
          cpu:      200m
          memory:   256Mi
    

    go library:

    github.com/nats-io/nats.go v1.13.1-0.20211018182449-f2416a8b1483
    

    OS/Container environment:

    Server Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.4", GitCommit:"b695d79d4f967c403a96986f1750a35eb75e75f1", GitTreeState:"clean", BuildDate:"2021-11-17T15:42:41Z", GoVersion:"go1.16.10", Compiler:"gc", Platform:"linux/amd64"}
    
    CONTAINER-RUNTIME
    cri-o://1.21.4
    

    Steps or code to reproduce the issue:

    1. Start nats cluster (3 replicas) with Jetstream enabled. JS Config:
    jetstream {
      max_mem: 64Mi
      store_dir: /data
    
      max_file:10Gi
    }
    
    
    1. Start to push messages into stream. Stream config:
    Configuration:
    
                 Subjects: widget-request-collector
         Acknowledgements: true
                Retention: File - WorkQueue
                 Replicas: 3
           Discard Policy: Old
         Duplicate Window: 2m0s
        Allows Msg Delete: true
             Allows Purge: true
           Allows Rollups: false
         Maximum Messages: unlimited
            Maximum Bytes: 1.9 GiB
              Maximum Age: 1d0h0m0s
     Maximum Message Size: unlimited
        Maximum Consumers: unlimited
    
    
    1. Shutdown one of the nats nodes for a while and rate limit consumer (or shutdown consumer) for collecting messages in file storage.
    2. Wait until storage reached it's maximum capacity (1.9G).
    3. Bring up nats server. (Do not bring up consumer)

    Expected result:

    Outdated node should become current.

    Actual result:

    Outdated node tries to become current, gets messages from stream leader, but reached memory limit and killed by OOM. It restarts again, and again killed by OOM.

    Cluster Information:
    
                     Name: nats
                   Leader: promo-widget-collector-event-nats-2
                  Replica: promo-widget-collector-event-nats-1, outdated, OFFLINE, seen 2m8s ago, 13,634 operations behind
                  Replica: promo-widget-collector-event-nats-0, current, seen 0.00s ago
    
    State:
    
                 Messages: 2,695,412
                    Bytes: 1.9 GiB
                 FirstSeq: 3,957,219 @ 2021-12-06T14:04:00 UTC
                  LastSeq: 6,652,630 @ 2021-12-06T15:09:36 UTC
         Active Consumers: 1
    

    Crashed pod info:

        State:          Waiting                                                                                                                                                                                                                                                                                                                                                                                                              
          Reason:       CrashLoopBackOff                                                                                                                                                                                                                                                                                                                                                                                                     
        Last State:     Terminated                                                                                                                                                                                                                                                                                                                                                                                                           
          Reason:       OOMKilled                                                                                                                                                                                                                                                                                                                                                                                                            
          Exit Code:    137                                                                                                                                                                                                                                                                                                                                                                                                                  
          Started:      Mon, 06 Dec 2021 14:30:26 +0000                                                                                                                                                                                                                                                                                                                                                                                      
          Finished:     Mon, 06 Dec 2021 14:31:08 +0000                                                                                                                                                                                                                                                                                                                                                                                      
        Ready:          False                                                                                                                                                                                                                                                                                                                                                                                                                
        Restart Count:  3 
    

    Is it possible to configure memory limits for nats-server to prevent memory overeating?

    🐞 bug 
    opened by rino-pupkin 51
  • jetstream could not pull message after nats-server restart

    jetstream could not pull message after nats-server restart

    i was testing jetstream on nats-server v2.3.2. one sender and one receiver program are running for quite a long time.

    this is what my stream look like :

    	_, err = js.AddStream(&nats.StreamConfig{
    		Name:      streamName,
    		Subjects:  []string{streamSubjects},
    		Storage:   nats.FileStorage,
    		Replicas:  3,
    		Retention: nats.WorkQueuePolicy,
    		Discard:   nats.DiscardNew,
    		MaxMsgs:   -1,
    		MaxAge:    time.Hour * 24 * 365,
    	})
    

    this is how i create the consumer:

    	if _, err := js.AddConsumer(streamName, &nats.ConsumerConfig{
    		Durable:       durableName,
    		DeliverPolicy: nats.DeliverAllPolicy,
    		AckPolicy:     nats.AckExplicitPolicy,
    		ReplayPolicy:  nats.ReplayInstantPolicy,
    		FilterSubject: subjectName,
    		AckWait:       time.Second * 30,
    		MaxDeliver:    -1,
    		MaxAckPending: 1000,
    	}); err != nil && !strings.Contains(err.Error(), "already in use") {
    		log.Println("AddConsumer fail")
    		return
    	}
    

    this is what the subscriber look like:

    	sub, err := js.PullSubscribe("ORDERS.created", durableName, nats.Bind("ORDERS", durableName))
    	if err != nil {
    		fmt.Println(" PullSubscribe:", err)
    		return
    	}
           msgs, err := sub.Fetch(1000, nats.MaxWait(10*time.Second))
    

    when i restart my nats-server cluster nodes(upgrade to nats-server 2.3.3), the consumer can no longer pull messages even if i restart my consumer program. the Fetch call just return : "nats: timeout", but i'm sure there are lots of message in the working queue. only if i delete the consumer by calling js.DeleteConsumer(streamName, durableName), recreate it, my program can resume fetching messages. actually, every time i restart nats-server nodes, my consumer program encouter the same problem.

    there is another issue, after i restart nats-server nodes, restart my program, it sometimes reports : "PullSubscribe: nats: JetStream system temporarily unavailable"

    I expect nats-server nodes restarting action not impacting jetstream clients fetching messages.

    🐞 bug 
    opened by carr123 50
  • Client Auth API

    Client Auth API

    Nats seems perfect for our needs, however having auth hard coded on service start isn't very practical when we are adding and removing users while its running.

    Implementing some go code to handle this is 1 option, another is to use an external service for authorization. Whether it's HTTP basic auth, etc. Being able to set an authentication endpoint would be very handy. Especially since we only allow a user to be logged in with 1 session.

    If this is possible now please let me know, but I couldn't find it in the docs anywhere.

    Thanks!

    customer requested security 
    opened by qrpike 47
  • memory increase in clustered mode

    memory increase in clustered mode

    This is a follow on from https://github.com/nats-io/nats-server/issues/1065

    While looking into the above issue I noticed memory growth, we wanted to focus on one issue at a time so with 1065 done I looked at the memory situation. The usage patterns and so forth is identical to 1065.

    12 hours

    Above is 12 hours, now as you know I embed your broker into one of my apps and I run a bunch of things in there. However in order to isolate the problem I did a few things:

    1. Same version of everything with the same usage pattern on a single unclustered broker does not show memory growth
    2. Turning off all the related feature in my code where I embed nats-server when clustered I still see the growth
    3. I made my code respond to SIGQUIT to write memory profiles on demand so I can interrogate a running nats server

    The nats-server is github.com/nats-io/nats-server/v2 v2.0.3-0.20190723153225-9cf534bc5e97

    From above memory dumps when comparing 6 hours apart dumps I see:

    8am:

    (pprof) top10
    Showing nodes accounting for 161.44MB, 90.17% of 179.04MB total
    Dropped 66 nodes (cum <= 0.90MB)
    Showing top 10 nodes out of 51
          flat  flat%   sum%        cum   cum%
       73.82MB 41.23% 41.23%    73.82MB 41.23%  github.com/nats-io/nats-server/v2/server.(*client).queueOutbound
       29.18MB 16.30% 57.53%    29.68MB 16.58%  github.com/nats-io/nats-server/v2/server.(*Server).createClient
       19.60MB 10.95% 68.48%    19.60MB 10.95%  math/rand.NewSource
       15.08MB  8.42% 76.90%   140.30MB 78.37%  github.com/nats-io/nats-server/v2/server.(*client).readLoop
        6.50MB  3.63% 80.53%       12MB  6.70%  github.com/nats-io/nats-server/v2/server.(*client).processSub
        5.25MB  2.93% 83.46%    11.25MB  6.28%  github.com/nats-io/nats-server/v2/server.(*Sublist).Insert
        4.01MB  2.24% 85.70%    65.85MB 36.78%  github.com/nats-io/nats-server/v2/server.(*client).processInboundClientMsg
        3.50MB  1.95% 87.65%     3.50MB  1.95%  github.com/nats-io/nats-server/v2/server.newLevel
        2.50MB  1.40% 89.05%     2.50MB  1.40%  github.com/nats-io/nats-server/v2/server.newNode
           2MB  1.12% 90.17%        2MB  1.12%  github.com/nats-io/nats-server/v2/server.(*client).addSubToRouteTargets
    

    1pm

    (pprof) top10
    Showing nodes accounting for 185.64MB, 90.87% of 204.29MB total
    Dropped 69 nodes (cum <= 1.02MB)
    Showing top 10 nodes out of 46
          flat  flat%   sum%        cum   cum%
       86.33MB 42.26% 42.26%    86.33MB 42.26%  github.com/nats-io/nats-server/v2/server.(*client).queueOutbound
       30.19MB 14.78% 57.04%    30.69MB 15.02%  github.com/nats-io/nats-server/v2/server.(*Server).createClient
       25.75MB 12.60% 69.64%   165.05MB 80.79%  github.com/nats-io/nats-server/v2/server.(*client).readLoop
       19.60MB  9.59% 79.24%    19.60MB  9.59%  math/rand.NewSource
        6.50MB  3.18% 82.42%    12.55MB  6.14%  github.com/nats-io/nats-server/v2/server.(*client).processSub
        5.25MB  2.57% 84.99%    11.25MB  5.51%  github.com/nats-io/nats-server/v2/server.(*Sublist).Insert
        4.02MB  1.97% 86.95%    73.70MB 36.08%  github.com/nats-io/nats-server/v2/server.(*client).processInboundClientMsg
        3.50MB  1.71% 88.67%     3.50MB  1.71%  github.com/nats-io/nats-server/v2/server.newLevel
        2.50MB  1.22% 89.89%     2.50MB  1.22%  github.com/nats-io/nats-server/v2/server.newNode
           2MB  0.98% 90.87%        2MB  0.98%  github.com/nats-io/nats-server/v2/server.(*client).addSubToRouteTargets
    
    opened by ripienaar 44
  • Suggest repair actions for JetStream cluster consumer NO quorum issue

    Suggest repair actions for JetStream cluster consumer NO quorum issue

    Environment

    • NATS version: 2.2.6 with jetstream enabled
    • Number of nodes nodes in the cluster : 3
    • Deploy on OKD 3.11 by nats helm chart 0.8.0

    Event description

    • Getting jetstream stream info successfully, but failed on getting jetstream consumer info by natscli
    • [Pub] OK, [Sub] Failed. NATS sub client can't connect to NATS cluster after 7/7 00:18
    • The cluster has been running for more than a month, and there were no errors until 7/7. It was confirmed that there were no network or hardware problems.
    • Attached logs and tried actions, please suggest other repair actions. Thanks.

    NATS server logs

    nats instance 0

    [1] 2021/07/07 00:18:44.650787 [WRN] JetStream cluster stream '$G > MY-STREAM2' has NO quorum, stalled.
    [1] 2021/07/07 00:18:44.651098 [WRN] JetStream cluster consumer '$G > MY-STREAM2 > consumer5' has NO quorum, stalled.
    [1] 2021/07/07 00:18:47.433327 [INF] JetStream cluster new metadata leader
    [1] 2021/07/07 00:18:47.930284 [INF] JetStream cluster new consumer leader for '$G > MY-STREAM2 > consumer5'
    [1] 2021/07/07 00:18:51.306199 [WRN] JetStream cluster stream '$G > MY-STREAM' has NO quorum, stalled.
    [1] 2021/07/07 00:18:51.652389 [WRN] JetStream cluster consumer '$G > MY-STREAM > consumer3' has NO quorum, stalled.
    [1] 2021/07/07 00:18:56.555042 [INF] JetStream cluster new consumer leader for '$G > MY-STREAM > consumer2'
    [1] 2021/07/07 00:19:00.462077 [INF] JetStream cluster new consumer leader for '$G > MY-STREAM > consumer3'
    [1] 2021/07/07 00:19:00.870001 [WRN] Got stream sequence mismatch for '$G > MY-STREAM'
    [1] 2021/07/07 00:19:01.024537 [WRN] Resetting stream '$G > MY-STREAM'
    [1] 2021/07/07 00:19:01.292724 [INF] JetStream cluster new stream leader for '$G > MY-STREAM'
    

    nats instance 1

    [1] 2021/07/07 00:18:48.190309 [INF] JetStream cluster new stream leader for '$G > MY-STREAM2'
    [1] 2021/07/07 00:18:53.343597 [INF] JetStream cluster new metadata leader
    [1] 2021/07/07 00:18:56.820943 [INF] JetStream cluster new consumer leader for '$G > MY-STREAM2 > consumer5'
    [1] 2021/07/07 00:18:57.098682 [INF] JetStream cluster new consumer leader for '$G > MY-STREAM > consumer1'
    [1] 2021/07/07 00:18:57.572857 [INF] JetStream cluster new stream leader for '$G > MY-STREAM2'
    [1] 2021/07/07 00:18:57.679975 [INF] JetStream cluster new stream leader for '$G > MY-STREAM'
    [1] 2021/07/07 00:19:00.710121 [WRN] Got stream sequence mismatch for '$G > MY-STREAM'
    [1] 2021/07/07 00:19:00.909870 [WRN] Resetting stream '$G > MY-STREAM'
    [1] 2021/07/08 03:30:19.175389 [WRN] Did not receive all stream info results for "$G"
    

    nats instance 2

    [1] 2021/07/07 00:18:57.508614 [INF] JetStream cluster new consumer leader for '$G > MY-STREAM > consumer4'
    [1] 2021/07/07 00:19:00.710399 [WRN] Got stream sequence mismatch for '$G > MY-STREAM'
    [1] 2021/07/07 00:19:00.907675 [WRN] Resetting stream '$G > MY-STREAM'
    

    Tried Actions

    1. Try to execute "nats consumer cluster step-down" [Failed]
    nats consumer list MY-STREAM
    # Consumers for Stream MY-STREAM:
    
    #         consumer1
    #         consumer2
    #         consumer3
    #         consumer4
    
    nats consumer cluster step-down --trace 
    # 13:11:04 >>> $JS.API.STREAM.NAMES
    # {"offset":0}
    
    # 13:11:05 <<< $JS.API.STREAM.NAMES
    # {"type":"io.nats.jetstream.api.v1.stream_names_response","total":2,"offset":0,"limit":1024,"streams":["MY-STREAM","MY-STREAM2"]}
    
    # ? Select a Stream MY-STREAM
    # 13:11:13 >>> $JS.API.CONSUMER.NAMES.MY-STREAM
    # {"offset":0}
    
    # 13:11:13 <<< $JS.API.CONSUMER.NAMES.MY-STREAM
    # {"type":"io.nats.jetstream.api.v1.consumer_names_response","total":4,"offset":0,"limit":1024,"consumers":["consumer1","consumer2","consumer3","consumer4"]}
    
    # ? Select a Consumer consumer2
    # 13:11:16 >>> $JS.API.CONSUMER.INFO.MY-STREAM.consumer2
    
    
    # 13:11:21 <<< $JS.API.CONSUMER.INFO.MY-STREAM.consumer2: context deadline exceeded
    
    # nats.exe: error: context deadline exceeded, try --help
    
    1. Try to request CONSUMER STEPDOWN API directly [Failed]
    nats req '$JS.API.CONSUMER.LEADER.STEPDOWN.MY-STREAM.consumer3' "" --trace
    
    # 05:20:43 Sending request on "$JS.API.CONSUMER.LEADER.STEPDOWN.MY-STREAM.consumer3"
    # nats: error: nats: timeout, try --help
    
    
    1. Try to restart NATS server [Still failed to get consumer]
    kubectl rollout restart statefulset nats -n mynamespace
    
    nats con info --trace
    # 05:43:02 >>> $JS.API.STREAM.NAMES
    # {"offset":0}
    
    # 05:43:02 <<< $JS.API.STREAM.NAMES
    # {"type":"io.nats.jetstream.api.v1.stream_names_response","total":2,"offset":0,"limit":1024,"streams":["MY-STREAM","MY-STREAM2"]}
    
    # ? Select a Stream MY-STREAM
    # 05:43:03 >>> $JS.API.CONSUMER.NAMES.MY-STREAM
    # {"offset":0}
    
    # 05:43:03 <<< $JS.API.CONSUMER.NAMES.MY-STREAM
    # {"type":"io.nats.jetstream.api.v1.consumer_names_response","total":4,"offset":0,"limit":1024,"consumers":["consumer1","consumer2","consumer3","consumer4"]}
    
    # ? Select a Consumer consumer1
    # 05:43:05 >>> $JS.API.CONSUMER.INFO.MY-STREAM.consumer1
    
    
    # 05:43:05 <<< $JS.API.CONSUMER.INFO.MY-STREAM.consumer1
    # {"type":"io.nats.jetstream.api.v1.consumer_info_response","error":{"code":503,"description":"JetStream system temporarily unavailable"}}
    
    # nats: error: could not load Consumer MY-STREAM > consumer1: JetStream system temporarily unavailable
    

    nats-0 server have a lot of JetStream WRAN logs

    [1] 2021/07/08 05:40:33.345825 [WRN] JetStream cluster consumer '$G > MY-STREAM > consumer3' has NO quorum, stalled.
    [1] 2021/07/08 05:40:34.027116 [WRN] JetStream cluster consumer '$G > MY-STREAM > consumer2' has NO quorum, stalled.
    [1] 2021/07/08 05:40:34.542920 [WRN] JetStream cluster consumer '$G > MY-STREAM > consumer1' has NO quorum, stalled.
    [1] 2021/07/08 05:40:35.494354 [WRN] JetStream cluster consumer '$G > MY-STREAM > consumer4' has NO quorum, stalled.
    [1] 2021/07/08 05:40:55.586260 [WRN] JetStream cluster consumer '$G > MY-STREAM > consumer4' has NO quorum, stalled.
    [1] 2021/07/08 05:40:57.300211 [WRN] JetStream cluster consumer '$G > MY-STREAM > consumer1' has NO quorum, stalled.
    [1] 2021/07/08 05:40:58.005908 [WRN] JetStream cluster consumer '$G > MY-STREAM > consumer3' has NO quorum, stalled.
    [1] 2021/07/08 05:40:58.324828 [WRN] JetStream cluster consumer '$G > MY-STREAM > consumer2' has NO quorum, stalled.
    [1] 2021/07/08 05:41:16.664240 [WRN] JetStream cluster consumer '$G > MY-STREAM > consumer4' has NO quorum, stalled.
    [1] 2021/07/08 05:41:17.659280 [WRN] JetStream cluster consumer '$G > MY-STREAM > consumer1' has NO quorum, stalled.
    [1] 2021/07/08 05:41:20.245055 [WRN] JetStream cluster consumer '$G > MY-STREAM > consumer3' has NO quorum, stalled.
    

    NATS stream report have MY-STREAM nats-0 failed status

    nats stream report
    
    Obtaining Stream stats
    
    +--------------------------------------------------------------------------------------------------------------------+
    |                                                   Stream Report                                                    |
    +-----------------------------+---------+-----------+----------+---------+------+---------+--------------------------+
    | Stream                      | Storage | Consumers | Messages | Bytes   | Lost | Deleted | Replicas                 |
    +-----------------------------+---------+-----------+----------+---------+------+---------+--------------------------+
    | MY-STREAM2 | File    | 1         | 0        | 0 B     | 0    | 0       | nats-0, nats-1, nats-2*  |
    | MY-STREAM                  | File    | 0         | 500      | 3.9 MiB | 0    | 0       | nats-0!, nats-1, nats-2* |
    +-----------------------------+---------+-----------+----------+---------+------+---------+--------------------------+
    
    1. Try to remove nats-0 peer for MY-STREAM [Failed]
    nats stream cluster peer-remove
    # ? Select a Stream MY-STREAM
    # ? Select a Peer nats-0
    # 06:16:31 Removing peer "nats-0"
    # nats: error: peer remap failed, try --help
    
    opened by phho 42
  • Service crossing accounts and leaf nodes can't send message back to requester.

    Service crossing accounts and leaf nodes can't send message back to requester.

    • [X] Defect
    • [ ] Feature Request or Change Proposal

    Defects

    Make sure that these boxes are checked before submitting your issue -- thank you!

    • [X] Included nats-server -DV output
    c1          | [1372] 2020/01/10 15:17:46.476336 [INF] Starting nats-server version 2.1.2
    c1          | [1372] 2020/01/10 15:17:46.476336 [DBG] Go build version go1.12.13
    c1          | [1372] 2020/01/10 15:17:46.476336 [INF] Git commit [679beda]
    c1          | [1372] 2020/01/10 15:17:46.476336 [WRN] Plaintext passwords detected, use nkeys or bcrypt.
    c1          | [1372] 2020/01/10 15:17:46.478337 [INF] Starting http monitor on 0.0.0.0:8222
    c1          | [1372] 2020/01/10 15:17:46.478337 [INF] Listening for leafnode connections on 0.0.0.0:7422
    c1          | [1372] 2020/01/10 15:17:46.478337 [DBG] Get non local IPs for "0.0.0.0"
    c1          | [1372] 2020/01/10 15:17:46.485338 [DBG]  ip=172.18.206.186
    c1          | [1372] 2020/01/10 15:17:46.488338 [INF] Listening for client connections on 0.0.0.0:4244
    c1          | [1372] 2020/01/10 15:17:46.488338 [INF] Server id is ND2MSDWDWTMJEX2V7TDS2O53Q5ZEY3W3ORS6T53HOM3PR5BBP6ZSYCA6
    c1          | [1372] 2020/01/10 15:17:46.488338 [INF] Server is ready
    c1          | [1372] 2020/01/10 15:17:46.488338 [DBG] Get non local IPs for "0.0.0.0"
    c1          | [1372] 2020/01/10 15:17:46.492338 [DBG]  ip=172.18.206.186
    c2          | [1372] 2020/01/10 15:17:48.537218 [INF] Starting nats-server version 2.1.2
    c2          | [1372] 2020/01/10 15:17:48.537218 [DBG] Go build version go1.12.13
    c2          | [1372] 2020/01/10 15:17:48.537218 [INF] Git commit [679beda]
    c2          | [1372] 2020/01/10 15:17:48.537218 [WRN] Plaintext passwords detected, use nkeys or bcrypt.
    c2          | [1372] 2020/01/10 15:17:48.539218 [INF] Starting http monitor on 0.0.0.0:8222
    c2          | [1372] 2020/01/10 15:17:48.539218 [INF] Listening for client connections on 0.0.0.0:4244
    c2          | [1372] 2020/01/10 15:17:48.539218 [INF] Server id is NCIHCZWAIQUH3OK624BMEV62WEEX6IEBKUFXAPRFRCE3GVEWRRNC5WBX
    c2          | [1372] 2020/01/10 15:17:48.539218 [INF] Server is ready
    c2          | [1372] 2020/01/10 15:17:48.539218 [DBG] Get non local IPs for "0.0.0.0"
    c2          | [1372] 2020/01/10 15:17:48.545215 [DBG]  ip=172.18.194.70
    c2          | [1372] 2020/01/10 15:17:48.556228 [DBG] Trying to connect as leafnode to remote server on "c1:7422" (172.18.206.186:7422)
    c1          | [1372] 2020/01/10 15:17:48.560110 [DBG] 172.18.194.70:49157 - lid:1 - Leafnode connection created
    c2          | [1372] 2020/01/10 15:17:48.560661 [DBG] 172.18.206.186:7422 - lid:1 - Remote leafnode connect msg sent
    c2          | [1372] 2020/01/10 15:17:48.560661 [DBG] 172.18.206.186:7422 - lid:1 - Leafnode connection created
    c2          | [1372] 2020/01/10 15:17:48.560661 [INF] Connected leafnode to "c1"
    c1          | [1372] 2020/01/10 15:17:48.561188 [TRC] 172.18.194.70:49157 - lid:1 - <<- [CONNECT {"tls_required":false,"name":"NCIHCZWAIQUH3OK624BMEV62WEEX6IEBKUFXAPRFRCE3GVEWRRNC5WBX"}]
    c1          | [1372] 2020/01/10 15:17:48.562131 [TRC] 172.18.194.70:49157 - lid:1 - ->> [LS+ test.service.1]
    c1          | [1372] 2020/01/10 15:17:48.562131 [TRC] 172.18.194.70:49157 - lid:1 - ->> [LS+ lds.qtioyTeG9dZPgE8uYM7rsy]
    c2          | [1372] 2020/01/10 15:17:48.561759 [TRC] 172.18.206.186:7422 - lid:1 - <<- [LS+ test.service.1]
    c2          | [1372] 2020/01/10 15:17:48.562839 [TRC] 172.18.206.186:7422 - lid:1 - <<- [LS+ lds.qtioyTeG9dZPgE8uYM7rsy]
    c1          | [1372] 2020/01/10 15:17:49.489505 [DBG] 10.35.68.24:62849 - cid:2 - Client connection created
    c1          | [1372] 2020/01/10 15:17:49.491212 [TRC] 10.35.68.24:62849 - cid:2 - <<- [CONNECT {"verbose":false,"pedantic":false,"user":"a","pass":"[REDACTED]","tls_required":false,"name":"NATS Sample Responder","lang":"go","version":"1.9.1","protocol":1,"echo":true}]
    c1          | [1372] 2020/01/10 15:17:49.491212 [TRC] 10.35.68.24:62849 - cid:2 - <<- [PING]
    c1          | [1372] 2020/01/10 15:17:49.491212 [TRC] 10.35.68.24:62849 - cid:2 - ->> [PONG]
    c1          | [1372] 2020/01/10 15:17:49.491563 [TRC] 10.35.68.24:62849 - cid:2 - <<- [SUB test.service.1 NATS-RPLY-22 1]
    c1          | [1372] 2020/01/10 15:17:49.491563 [TRC] 10.35.68.24:62849 - cid:2 - <<- [PING]
    c1          | [1372] 2020/01/10 15:17:49.491563 [TRC] 10.35.68.24:62849 - cid:2 - ->> [PONG]
    c2          | [1372] 2020/01/10 15:17:49.636028 [DBG] 172.18.206.186:7422 - lid:1 - LeafNode Ping Timer
    c2          | [1372] 2020/01/10 15:17:49.636282 [TRC] 172.18.206.186:7422 - lid:1 - ->> [PING]
    c1          | [1372] 2020/01/10 15:17:49.636909 [TRC] 172.18.194.70:49157 - lid:1 - <<- [PING]
    c1          | [1372] 2020/01/10 15:17:49.636909 [TRC] 172.18.194.70:49157 - lid:1 - ->> [PONG]
    c2          | [1372] 2020/01/10 15:17:49.637613 [TRC] 172.18.206.186:7422 - lid:1 - <<- [PONG]
    c1          | [1372] 2020/01/10 15:17:49.732680 [DBG] 172.18.194.70:49157 - lid:1 - LeafNode Ping Timer
    c1          | [1372] 2020/01/10 15:17:49.732680 [TRC] 172.18.194.70:49157 - lid:1 - ->> [PING]
    c2          | [1372] 2020/01/10 15:17:49.717524 [TRC] 172.18.206.186:7422 - lid:1 - <<- [PING]
    c2          | [1372] 2020/01/10 15:17:49.717524 [TRC] 172.18.206.186:7422 - lid:1 - ->> [PONG]
    c1          | [1372] 2020/01/10 15:17:49.732680 [TRC] 172.18.194.70:49157 - lid:1 - <<- [PONG]
    c1          | [1372] 2020/01/10 15:17:51.714580 [DBG] 10.35.68.24:62849 - cid:2 - Client Ping Timer
    c1          | [1372] 2020/01/10 15:17:51.714580 [TRC] 10.35.68.24:62849 - cid:2 - ->> [PING]
    c1          | [1372] 2020/01/10 15:17:51.714580 [TRC] 10.35.68.24:62849 - cid:2 - <<- [PONG]
    c1          | [1372] 2020/01/10 15:18:00.301474 [DBG] 10.35.68.24:62850 - cid:3 - Client connection created
    c1          | [1372] 2020/01/10 15:18:00.302611 [TRC] 10.35.68.24:62850 - cid:3 - <<- [CONNECT {"verbose":false,"pedantic":false,"user":"a","pass":"[REDACTED]","tls_required":false,"name":"NATS Sample Requestor","lang":"go","version":"1.9.1","protocol":1,"echo":true}]
    c1          | [1372] 2020/01/10 15:18:00.302866 [TRC] 10.35.68.24:62850 - cid:3 - <<- [PING]
    c1          | [1372] 2020/01/10 15:18:00.302866 [TRC] 10.35.68.24:62850 - cid:3 - ->> [PONG]
    c1          | [1372] 2020/01/10 15:18:00.302866 [TRC] 10.35.68.24:62850 - cid:3 - <<- [SUB _INBOX.W7P0kJjrbQVrbmzAqqk6V1.*  1]
    c1          | [1372] 2020/01/10 15:18:00.302866 [TRC] 10.35.68.24:62850 - cid:3 - <<- [PUB test.service.1 _INBOX.W7P0kJjrbQVrbmzAqqk6V1.9cn7513D 3]
    c1          | [1372] 2020/01/10 15:18:00.302866 [TRC] 10.35.68.24:62850 - cid:3 - <<- MSG_PAYLOAD: ["foo"]
    c1          | [1372] 2020/01/10 15:18:00.302866 [TRC] 10.35.68.24:62849 - cid:2 - ->> [PING]
    c1          | [1372] 2020/01/10 15:18:00.302866 [TRC] 10.35.68.24:62849 - cid:2 - ->> [MSG test.service.1 1 _INBOX.W7P0kJjrbQVrbmzAqqk6V1.9cn7513D 3]
    c1          | [1372] 2020/01/10 15:18:00.303903 [TRC] 10.35.68.24:62849 - cid:2 - <<- [PONG]
    c1          | [1372] 2020/01/10 15:18:00.304384 [TRC] 10.35.68.24:62849 - cid:2 - <<- [PUB _INBOX.W7P0kJjrbQVrbmzAqqk6V1.9cn7513D 13]
    c1          | [1372] 2020/01/10 15:18:00.304384 [TRC] 10.35.68.24:62849 - cid:2 - <<- MSG_PAYLOAD: ["response text"]
    c1          | [1372] 2020/01/10 15:18:00.304384 [TRC] 10.35.68.24:62850 - cid:3 - ->> [MSG _INBOX.W7P0kJjrbQVrbmzAqqk6V1.9cn7513D 1 13]
    c1          | [1372] 2020/01/10 15:18:00.305527 [DBG] 10.35.68.24:62850 - cid:3 - Client connection closed
    c1          | [1372] 2020/01/10 15:18:00.307546 [TRC] 10.35.68.24:62850 - cid:3 - <-> [DELSUB 1]
    c1          | [1372] 2020/01/10 15:18:03.175280 [DBG] 10.35.68.24:62865 - cid:4 - Client connection created
    c1          | [1372] 2020/01/10 15:18:03.176364 [TRC] 10.35.68.24:62865 - cid:4 - <<- [CONNECT {"verbose":false,"pedantic":false,"user":"b","pass":"[REDACTED]","tls_required":false,"name":"NATS Sample Requestor","lang":"go","version":"1.9.1","protocol":1,"echo":true}]
    c1          | [1372] 2020/01/10 15:18:03.176364 [TRC] 10.35.68.24:62865 - cid:4 - <<- [PING]
    c1          | [1372] 2020/01/10 15:18:03.176364 [TRC] 10.35.68.24:62865 - cid:4 - ->> [PONG]
    c1          | [1372] 2020/01/10 15:18:03.176364 [TRC] 10.35.68.24:62865 - cid:4 - <<- [SUB _INBOX.4ynIPqChOQMSroNEZqndLx.*  1]
    c1          | [1372] 2020/01/10 15:18:03.177312 [TRC] 172.18.194.70:49157 - lid:1 - ->> [LS+ _INBOX.4ynIPqChOQMSroNEZqndLx.*]
    c1          | [1372] 2020/01/10 15:18:03.177312 [TRC] 10.35.68.24:62865 - cid:4 - <<- [PUB test.service.1 _INBOX.4ynIPqChOQMSroNEZqndLx.HhaycK1D 3]
    c1          | [1372] 2020/01/10 15:18:03.177312 [TRC] 10.35.68.24:62865 - cid:4 - <<- MSG_PAYLOAD: ["foo"]
    c1          | [1372] 2020/01/10 15:18:03.177312 [TRC] 10.35.68.24:62849 - cid:2 - ->> [MSG test.service.1 1 _R_.ie4QZJ.5bq99K 3]
    c2          | [1372] 2020/01/10 15:18:03.177521 [TRC] 172.18.206.186:7422 - lid:1 - <<- [LS+ _INBOX.4ynIPqChOQMSroNEZqndLx.*]
    c1          | [1372] 2020/01/10 15:18:03.178465 [TRC] 10.35.68.24:62849 - cid:2 - <<- [PUB _R_.ie4QZJ.5bq99K 13]
    c1          | [1372] 2020/01/10 15:18:03.178465 [TRC] 10.35.68.24:62849 - cid:2 - <<- MSG_PAYLOAD: ["response text"]
    c1          | [1372] 2020/01/10 15:18:03.178530 [TRC] 10.35.68.24:62865 - cid:4 - ->> [MSG _INBOX.4ynIPqChOQMSroNEZqndLx.HhaycK1D 1 13]
    c1          | [1372] 2020/01/10 15:18:03.179615 [DBG] 10.35.68.24:62865 - cid:4 - Client connection closed
    c1          | [1372] 2020/01/10 15:18:03.180602 [TRC] 10.35.68.24:62865 - cid:4 - <-> [DELSUB 1]
    c1          | [1372] 2020/01/10 15:18:03.180602 [TRC] 172.18.194.70:49157 - lid:1 - ->> [LS- _INBOX.4ynIPqChOQMSroNEZqndLx.*]
    c2          | [1372] 2020/01/10 15:18:03.179385 [TRC] 172.18.206.186:7422 - lid:1 - <<- [LS- _INBOX.4ynIPqChOQMSroNEZqndLx.*]
    c2          | [1372] 2020/01/10 15:18:03.181119 [TRC] 172.18.206.186:7422 - lid:1 - <-> [DELSUB _INBOX.4ynIPqChOQMSroNEZqndLx.*]
    c2          | [1372] 2020/01/10 15:18:05.761225 [DBG] 10.35.68.24:62866 - cid:2 - Client connection created
    c2          | [1372] 2020/01/10 15:18:05.762291 [TRC] 10.35.68.24:62866 - cid:2 - <<- [CONNECT {"verbose":false,"pedantic":false,"user":"c","pass":"[REDACTED]","tls_required":false,"name":"NATS Sample Requestor","lang":"go","version":"1.9.1","protocol":1,"echo":true}]
    c2          | [1372] 2020/01/10 15:18:05.762524 [TRC] 10.35.68.24:62866 - cid:2 - <<- [PING]
    c2          | [1372] 2020/01/10 15:18:05.762524 [TRC] 10.35.68.24:62866 - cid:2 - ->> [PONG]
    c2          | [1372] 2020/01/10 15:18:05.762524 [TRC] 10.35.68.24:62866 - cid:2 - <<- [SUB _INBOX.TfzSpQyvrMigTw0TP7cMHt.*  1]
    c2          | [1372] 2020/01/10 15:18:05.762524 [TRC] 172.18.206.186:7422 - lid:1 - ->> [LS+ _INBOX.TfzSpQyvrMigTw0TP7cMHt.*]
    c2          | [1372] 2020/01/10 15:18:05.762524 [TRC] 10.35.68.24:62866 - cid:2 - <<- [PUB test.service.1 _INBOX.TfzSpQyvrMigTw0TP7cMHt.05sUrsio 3]
    c2          | [1372] 2020/01/10 15:18:05.762524 [TRC] 10.35.68.24:62866 - cid:2 - <<- MSG_PAYLOAD: ["foo"]
    c2          | [1372] 2020/01/10 15:18:05.762524 [TRC] 172.18.206.186:7422 - lid:1 - ->> [LMSG test.service.1 _INBOX.TfzSpQyvrMigTw0TP7cMHt.05sUrsio 3]
    c1          | [1372] 2020/01/10 15:18:05.763695 [TRC] 172.18.194.70:49157 - lid:1 - <<- [LS+ _INBOX.TfzSpQyvrMigTw0TP7cMHt.*]
    c1          | [1372] 2020/01/10 15:18:05.763912 [TRC] 172.18.194.70:49157 - lid:1 - <<- [LMSG test.service.1 _INBOX.TfzSpQyvrMigTw0TP7cMHt.05sUrsio 3]
    c1          | [1372] 2020/01/10 15:18:05.763912 [TRC] 172.18.194.70:49157 - lid:1 - <<- MSG_PAYLOAD: ["foo"]
    c1          | [1372] 2020/01/10 15:18:05.763912 [TRC] 10.35.68.24:62849 - cid:2 - ->> [MSG test.service.1 1 _R_.ie4QZJ.x9JGBo 3]
    c1          | [1372] 2020/01/10 15:18:05.763912 [TRC] 10.35.68.24:62849 - cid:2 - <<- [PUB _R_.ie4QZJ.x9JGBo 13]
    c1          | [1372] 2020/01/10 15:18:05.763912 [TRC] 10.35.68.24:62849 - cid:2 - <<- MSG_PAYLOAD: ["response text"]
    c1          | [1372] 2020/01/10 15:18:05.763912 [TRC] 172.18.194.70:49157 - lid:1 - ->> [LMSG _INBOX.TfzSpQyvrMigTw0TP7cMHt.05sUrsio 13]
    c2          | [1372] 2020/01/10 15:18:05.764649 [TRC] 172.18.206.186:7422 - lid:1 - <<- [LMSG _INBOX.TfzSpQyvrMigTw0TP7cMHt.05sUrsio 13]
    c2          | [1372] 2020/01/10 15:18:05.764649 [TRC] 172.18.206.186:7422 - lid:1 - <<- MSG_PAYLOAD: ["response text"]
    c2          | [1372] 2020/01/10 15:18:05.765070 [TRC] 10.35.68.24:62866 - cid:2 - ->> [MSG _INBOX.TfzSpQyvrMigTw0TP7cMHt.05sUrsio 1 13]
    c2          | [1372] 2020/01/10 15:18:05.766173 [DBG] 10.35.68.24:62866 - cid:2 - Client connection closed
    c2          | [1372] 2020/01/10 15:18:05.766411 [TRC] 10.35.68.24:62866 - cid:2 - <-> [DELSUB 1]
    c2          | [1372] 2020/01/10 15:18:05.766411 [TRC] 172.18.206.186:7422 - lid:1 - ->> [LS- _INBOX.TfzSpQyvrMigTw0TP7cMHt.*]
    c1          | [1372] 2020/01/10 15:18:05.766060 [TRC] 172.18.194.70:49157 - lid:1 - <<- [LS- _INBOX.TfzSpQyvrMigTw0TP7cMHt.*]
    c1          | [1372] 2020/01/10 15:18:05.766060 [TRC] 172.18.194.70:49157 - lid:1 - <-> [DELSUB _INBOX.TfzSpQyvrMigTw0TP7cMHt.*]
    c2          | [1372] 2020/01/10 15:18:07.378670 [DBG] 10.35.68.24:62867 - cid:3 - Client connection created
    c2          | [1372] 2020/01/10 15:18:07.378670 [TRC] 10.35.68.24:62867 - cid:3 - <<- [CONNECT {"verbose":false,"pedantic":false,"user":"d","pass":"[REDACTED]","tls_required":false,"name":"NATS Sample Requestor","lang":"go","version":"1.9.1","protocol":1,"echo":true}]
    c2          | [1372] 2020/01/10 15:18:07.378670 [TRC] 10.35.68.24:62867 - cid:3 - <<- [PING]
    c2          | [1372] 2020/01/10 15:18:07.379670 [TRC] 10.35.68.24:62867 - cid:3 - ->> [PONG]
    c2          | [1372] 2020/01/10 15:18:07.379746 [TRC] 10.35.68.24:62867 - cid:3 - <<- [SUB _INBOX.89dvNgB1mAb4aZo4PLaWJz.*  1]
    c2          | [1372] 2020/01/10 15:18:07.380243 [TRC] 10.35.68.24:62867 - cid:3 - <<- [PUB test.service.1 _INBOX.89dvNgB1mAb4aZo4PLaWJz.WyXS3UnR 3]
    c2          | [1372] 2020/01/10 15:18:07.380243 [TRC] 10.35.68.24:62867 - cid:3 - <<- MSG_PAYLOAD: ["foo"]
    c2          | [1372] 2020/01/10 15:18:07.380243 [TRC] 172.18.206.186:7422 - lid:1 - ->> [LMSG test.service.1 _R_.BBa91r.hQOCWj 3]
    c1          | [1372] 2020/01/10 15:18:07.380535 [TRC] 172.18.194.70:49157 - lid:1 - <<- [LMSG test.service.1 _R_.BBa91r.hQOCWj 3]
    c1          | [1372] 2020/01/10 15:18:07.380535 [TRC] 172.18.194.70:49157 - lid:1 - <<- MSG_PAYLOAD: ["foo"]
    c1          | [1372] 2020/01/10 15:18:07.380747 [TRC] 10.35.68.24:62849 - cid:2 - ->> [MSG test.service.1 1 _R_.ie4QZJ.tVfoKl 3]
    c1          | [1372] 2020/01/10 15:18:07.380747 [TRC] 10.35.68.24:62849 - cid:2 - <<- [PUB _R_.ie4QZJ.tVfoKl 13]
    c1          | [1372] 2020/01/10 15:18:07.380747 [TRC] 10.35.68.24:62849 - cid:2 - <<- MSG_PAYLOAD: ["response text"]
    c2          | [1372] 2020/01/10 15:18:09.386622 [DBG] 10.35.68.24:62867 - cid:3 - Client connection closed
    c2          | [1372] 2020/01/10 15:18:09.386891 [TRC] 10.35.68.24:62867 - cid:3 - <-> [DELSUB 1]
    Gracefully stopping... (press Ctrl+C again to force)
    Stopping c2   ... done
    Stopping c1   ... done
    
    • [x] Included a [Minimal, Complete, and Verifiable example] (https://stackoverflow.com/help/mcve)

    Versions of nats-server and affected client libraries used:

    See logs. The go examples are as of commit f66f9c02346dc33296576bf0ef4bd48520bf88c9.

    OS/Container environment:

    Windows nanoserver

    Steps or code to reproduce the issue:

    docker-compose.yml

    version: "3.2"
    
    services:
     cluster1: 
       image: nats:2.1.2-nanoserver
       container_name: c1
       command: -c C:\\mount\\c1 -DV
       ports: 
         - 80:8222
         - 4244:4244
       expose:
         - "7422"
       volumes:
         - .\:C:\mount\
       networks:
         - cluster1
       restart: always
     cluster2: 
       depends_on: 
         - cluster1
       image: nats:2.1.2-nanoserver
       container_name: c2
       command: -c C:\\mount\\c2 -DV
       ports: 
         - 81:8222
         - 4245:4244
       expose:
         - "7422"
       volumes:
         - .\:C:\mount\
       networks:
         - cluster1
       restart: always
     
    networks:
     cluster1:
    

    cluster 1 config:

    port: 4244
    monitor_port: 8222
    accounts: {
      A: {
        users:[{
          user: a
          password: a
        }]
        exports: [
          {service: test.service.>}
        ]
      },
      B: {
        users:[{
           user: b
            password: b
        }]
        imports: [
          {service: {account: A, subject: test.service.1}}
        ]
      }
    }
    
    leafnodes {
      port: 7422
      authorization {
        account: B
      }
    }
    

    cluster 2 config:

    port: 4244
    monitor_port: 8222
    accounts: {
      C: {
        users:[{
          user: c
          password: c
        }]
        exports: [
          {service: test.service.>}
        ]
      },
      D: {
        users:[{
           user: d
           password: d
        }]
        imports: [
          {service: {account: C, subject: test.service.1}}
        ]
      }
    }
    leafnodes {
      remotes: [
        {
          urls: [
            nats-leaf://c1:7422
          ]
          account: C
        }
      ]
    }
    

    Starting a nats-rply: start "cluster1 Account A service" nats-rply -s nats://a:[email protected]:4244 test.service.1 "response text"

    Sending request to account D: nats-req -s nats://d:[email protected]:4245 test.service.1 foo

    Expected result:

    Request is sent from account D on cluster 2 to service listening at test.service.1 on Account A on cluster 1, and the requester gets "response text" back.

    Actual result:

    The service listening at test.service.1 gets a request of 'foo', but no message is returned to requester. Instead: "nats: timeout for request"

    opened by cjmang 41
  • Support WebSocket Connectivity

    Support WebSocket Connectivity

    Hi,

    At @gretaio, we need our signaling server to talk with web browsers, and in order to perform this, we setup a small proxy to gateway websocket to tcp so it can talk to nats.

    I saw on the todolist that you plan on adding a websocket strategy, and that's something we would greatly appreciate as that'd basically half the number of connections we need to have open :+1:.

    So, would you be open to a PR regarding this?

    idea customer requested 
    opened by pldubouilh 40
  • logging system, syslog and abstraction improvements

    logging system, syslog and abstraction improvements

    This is a WIP, this is a little roadmad and some questions I have.

    • [x] Create server.Logger interface and add server.SetLogger method
    • [x] Modify all the actual call to the new format
    • [x] Network syslog
    • [x] gnatsd options and link Network syslog
    • [x] fix this race condition on server.SetLogger()
    • [x] Tests
    • [x] Transform error message in real Errors

    Questions:

    Related to: https://github.com/apcera/gnatsd/issues/7

    opened by mcuadros 39
  • Consumers stops receiving messages

    Consumers stops receiving messages

    Defect

    Versions of nats-server and affected client libraries used:

    Nats server version

    [83] 2021/09/04 18:51:12.239432 [INF] Starting nats-server
    [83] 2021/09/04 18:51:12.239488 [INF]   Version:  2.4.0
    [83] 2021/09/04 18:51:12.239494 [INF]   Git:      [219a7c98]
    [83] 2021/09/04 18:51:12.239496 [DBG]   Go build: go1.16.7
    [83] 2021/09/04 18:51:12.239517 [INF]   Name:     NBVE7O7DMRAZ63STC7Z644KHF5HJ6QQUGLZVGDIKEG32CFL2J6O2456M
    [83] 2021/09/04 18:51:12.239533 [INF]   ID:       NBVE7O7DMRAZ63STC7Z644KHF5HJ6QQUGLZVGDIKEG32CFL2J6O2456M
    [83] 2021/09/04 18:51:12.239605 [DBG] Created system account: "$SYS"
    

    Go client version: v1.12.0

    OS/Container environment:

    GKE Kubernetes. Running nats js HA cluster. Deployed via nats helm chart.

    Steps or code to reproduce the issue:

    Stream configuration:

    apiVersion: jetstream.nats.io/v1beta1
    kind: Stream
    metadata:
      name: agent
    spec:
      name: agent
      subjects: ["data.*"]
      storage: file
      maxAge: 1h
      replicas: 3
      retention: interest
    

    There are two consumers to this stream. Each runs as queue subscriber in two services with 2 pod replicas. Note that I don't care if message is not processed, this is why ack none is set.

    
    // 2 pods for service A.
    js.QueueSubscribe(
    	"data.received",
    	"service1_queue",
    	func(msg *nats.Msg) {},
    	nats.DeliverNew(),
    	nats.AckNone(),
    )
    
    // 2 pods for service B.
    s.js.QueueSubscribe(
    	"data.received",
    	"service2_queue",
    	func(msg *nats.Msg) {},
    	nats.DeliverNew(),
    	nats.AckNone(),
    )
    

    Expected result:

    Consumer receives messages.

    Actual result:

    Stream stats after few days:

    agent                  │ File    │ 3         │ 28,258   │ 18 MiB  │ 0    │ 84      │ nats-js-0, nats-js-1*, nats-js-2
    

    Consumers stats:

    service1_queue │ Push │ None       │ 0.00s    │ 0           │ 0           │ 0           │ 60,756    │ nats-js-0, nats-js-1*, nats-js-2
    service2_queue │ Push │ None       │ 0.00s    │ 0           │ 0           │ 8,193 / 28% │ 60,843    │ nats-js-0, nats-js-1*, nats-js-2
    
    1. Non of the nats server pods contains errors indicating any problem.
    2. Unprocessed messages count for second consumer stays the same and doesn't decrease.
    3. The only fix which helped is after I changed second consumer raft leader with nats consumer cluster step-down. But after some time problem still comes back.
    4. There are active connections to the server. Checked with nats server report connections.

    /cc @kozlovic @derekcollison

    🐞 bug 
    opened by anjmao 38
  • Resolves #3682: leaf node fails to reconnect, due to ping messages being held off indefinitely

    Resolves #3682: leaf node fails to reconnect, due to ping messages being held off indefinitely

    This contribution is my original work and I license the work to the project under the Apache 2 license

    Changes proposed in this pull request:

    • don't set cp.last in client::flushClients()

    /cc @nats-io/core

    opened by sandykellagher 3
  • Leaf node fails to reconnect, due to ping messages being held off indefinitely

    Leaf node fails to reconnect, due to ping messages being held off indefinitely

    Defect

    We are using a LeafNode NATS server to connect to a cluster, and see a strange effect which prevents the LeafNode reconnecting properly in the event that its link to the cluster goes down.

    The LeafNode has two local clients which connect to it, and which generate traffic on a continuous basis. And in short, this continuous client traffic results in the outbound Pings from LeafNode to cluster being held off/delayed indefinitely, with the message "Leaf Node Ping Timer", "Delaying Ping due to client activity". And because the Pings are held off, the LeafNode doesn't detect a stale connection, and hence doesn't close the connection and attempt to reconnect.

    I believe I understand the issue.

    In NATS server client.go:: processPingTimer() there is a check to test whether to delay sending an outgoing ping, in two cases:

    • we recently (within specified pingInterval) received a data message or sub/unsub message from the remote end (Client activity)
    • we recently received a ping from the remote end (Remote ping)

    This makes perfect sense: incoming receive messages mean we still have a link and don't need to send a ping.

    However, the first test above is derived from the client.last field ("last packet" time) which is set in two cases:

    • when the readLoop parsing determines we have received a new message or sub/unsub
    • in flushClients() routine, which is called when we have received some messages that need to be forwarded to other client connections

    But I believe that this second case is incorrect: we should only hold off pings when there has been receive traffic on the specified connection, and this isn't so in the second case. We received traffic on one connection, but are resetting the c.last field on another connection where we are forwarding/sending the message.

    If I remove the line of code in flushClients() (about line 1113) that updates cp.last, then the Stale Client Connection fires fine in my testing.

    Versions of nats-server and affected client libraries used:

    Latest master NATS server as of 2/12/2022, eg git commit c4c87612

    OS/Container environment:

    Linux ARM64 - but not a factor

    Steps or code to reproduce the issue:

    Configure leaf node server with local clients sending traffic that an external client is subscribed for, and then break the connection to the external NATS cluster

    Expected result:

    Leaf node server should automatically trigger a reconnect when the connection to an external NATS cluster is lost, with the detection time as configured by ping_interval and ping_max parameters.

    Actual result:

    Leaf node server does not automatically reconnect. Instead, it remains in a zombie state indefinitely. To be more precise, it might possibly recover when the network stack TCP keepalive timeout expires (which by default is 2 hours), but that is much too long to be useful

    🐞 bug 
    opened by sandykellagher 4
  • tag policies not honored in reassignment after peer remove

    tag policies not honored in reassignment after peer remove

    Resolves issue in which tag policies (tag affinity and tag uniqueness constraint) were not honored when selecting a new peer for a stream RG to replace one just removed.

    /cc @nats-io/core

    opened by tbeets 0
  • JetStream invalid file usage causing server crash when is reaching max storage usage

    JetStream invalid file usage causing server crash when is reaching max storage usage

    Defect

    Make sure that these boxes are checked before submitting your issue -- thank you!

    • [x] Included nats-server -DV output
    • [x] Included a [Minimal, Complete, and Verifiable example] (https://stackoverflow.com/help/mcve)

    Versions of nats-server and affected client libraries used:

    2.9.8

    OS/Container environment:

    K8s - 2.9.8-alpine

    Steps or code to reproduce the issue:

    image image image image image image

    Expected result:

    File usage to be around 1GB and the same as hard disk usage.

    Actual result:

    File usage is around 10GB while the harddisk usage is around 1GB

    🐞 bug 
    opened by tibrn 34
  • Nats memory usage limit

    Nats memory usage limit

    Hi, we are fond of NATS JetStreaming the way it processes 100 - 1000 bytes messages compared to its competitors. Our target is to capitalise its potential for OLTP use case (as IPC) under financial transactions processing requirement. Likewise consider it too for enterprise logging architecture. The thing we have noticed during in-house benchmark activity using custom testing tools is, it translated high traffic work load to Network IO aggressively. Our tool pumped 1 million messages around in 59 seconds of 100 bytes each, over a Windows Server 2019 having configuration i.e. 6 virtual core and 16 GB of RAM, 7200 RPM SATA Disk. Hence the bandwidth of 10 Gbps Ethernet is consumed up to 7 Gbps of network IO by NATS JetStreaming. Our used NATS JetStream version is ?? We found the following two configuration options if wanted to curb/control NATS JetStreaming traffic on Ethernet: max_outstanding_catchup: This one is not working if we have correctly assumed it for our problem too (https://nats.io/blog/nats-server-29-release/). GOMEMLIMIT: This variable is primarily used to curb memory in a containerised environment. Not sure if it will also limit network IO specially in our case that’s out of container (NATS Server on Windows OS).

    opened by KhurramShahzadODM 6
Releases(v2.9.8)
Owner
NATS - The Cloud Native Messaging System
NATS is a simple, secure and performant communications system for digital systems, services and devices.
NATS - The Cloud Native Messaging System
Micro is a platform for cloud native development

Micro Overview Micro addresses the key requirements for building services in the cloud. It leverages the microservices architecture pattern and provid

Micro 11.5k Nov 29, 2022
Kafka implemented in Golang with built-in coordination (No ZK dep, single binary install, Cloud Native)

Jocko Kafka/distributed commit log service in Go. Goals of this project: Implement Kafka in Go Protocol compatible with Kafka so Kafka clients and ser

Travis Jeffery 4.7k Dec 7, 2022
CockroachDB - the open source, cloud-native distributed SQL database.

CockroachDB is a cloud-native distributed SQL database designed to build, scale, and manage modern, data-intensive applications. What is CockroachDB?

CockroachDB 26.2k Dec 6, 2022
A feature complete and high performance multi-group Raft library in Go.

Dragonboat - A Multi-Group Raft library in Go / 中文版 News 2021-01-20 Dragonboat v3.3 has been released, please check CHANGELOG for all changes. 2020-03

lni 4.5k Dec 8, 2022
High performance, distributed and low latency publish-subscribe platform.

Emitter: Distributed Publish-Subscribe Platform Emitter is a distributed, scalable and fault-tolerant publish-subscribe platform built with MQTT proto

emitter 3.4k Dec 7, 2022
short-url distributed and high-performance

durl 是一个分布式的高性能短链服务,逻辑简单,并提供了相关api接口,开发人员可以快速接入,也可以作为go初学者练手项目.

宋昂 531 Dec 3, 2022
Collection of high performance, thread-safe, lock-free go data structures

Garr - Go libs in a Jar Collection of high performance, thread-safe, lock-free go data structures. adder - Data structure to perform highly-performant

LINE 353 Nov 25, 2022
A realtime distributed messaging platform

Source: https://github.com/nsqio/nsq Issues: https://github.com/nsqio/nsq/issues Mailing List: [email protected] IRC: #nsq on freenode Docs:

NSQ 23k Dec 7, 2022
Simple, fast and scalable golang rpc library for high load

gorpc Simple, fast and scalable golang RPC library for high load and microservices. Gorpc provides the following features useful for highly loaded pro

Aliaksandr Valialkin 665 Nov 29, 2022
Dapr is a portable, event-driven, runtime for building distributed applications across cloud and edge.

Dapr is a portable, serverless, event-driven runtime that makes it easy for developers to build resilient, stateless and stateful microservices that run on the cloud and edge and embraces the diversity of languages and developer frameworks.

Dapr 20k Dec 3, 2022
A distributed systems library for Kubernetes deployments built on top of spindle and Cloud Spanner.

hedge A library built on top of spindle and Cloud Spanner that provides rudimentary distributed computing facilities to Kubernetes deployments. Featur

null 21 Nov 9, 2022
A distributed locking library built on top of Cloud Spanner and TrueTime.

A distributed locking library built on top of Cloud Spanner and TrueTime.

null 47 Sep 13, 2022
Flowgraph package for scalable asynchronous system development

flowgraph Getting Started go get -u github.com/vectaport/flowgraph go test Links Wiki Slides from Minneapolis Golang Meetup, May 22nd 2019 Overview F

Scott Johnston 53 Sep 26, 2022
Fast, efficient, and scalable distributed map/reduce system, DAG execution, in memory or on disk, written in pure Go, runs standalone or distributedly.

Gleam Gleam is a high performance and efficient distributed execution system, and also simple, generic, flexible and easy to customize. Gleam is built

Chris Lu 3.1k Dec 7, 2022
A distributed system for embedding-based retrieval

Overview Vearch is a scalable distributed system for efficient similarity search of deep learning vectors. Architecture Data Model space, documents, v

vector search infrastructure for AI applications 1.5k Nov 28, 2022
a dynamic configuration framework used in distributed system

go-archaius This is a light weight configuration management framework which helps to manage configurations in distributed system The main objective of

null 203 Nov 1, 2022
Verifiable credential system on Cosmos with IBC for Distributed Identities

CertX This is a project designed to demonstrate the use of IBC between different zones in the Cosmos ecosystem for privacy preserving credential manag

bwty 6 Mar 29, 2022