A distributed, fault-tolerant pipeline for observability data

Overview

Build Status GoDoc

Table of Contents

What Is Veneur?

Veneur (/vɛnˈʊr/, rhymes with “assure”) is a distributed, fault-tolerant pipeline for runtime data. It provides a server implementation of the DogStatsD protocol or SSF for aggregating metrics and sending them to downstream storage to one or more supported sinks. It can also act as a global aggregator for histograms, sets and counters.

More generically, Veneur is a convenient sink for various observability primitives with lots of outputs!

Use Case

Once you cross a threshold into dozens, hundreds or (gasp!) thousands of machines emitting metric data for an application, you've moved into that world where data about individual hosts is uninteresting except in aggregate form. Instead of paying to store tons of data points and then aggregating them later at read-time, Veneur can calculate global aggregates, like percentiles and forward those along to your time series database, etc.

Veneur is also a StatsD or DogStatsD protocol transport, forwarding the locally collected metrics over more reliable TCP implementations.

Here are some examples of why Stripe and other companies are using Veneur today:

  • reducing cost by pre-aggregating metrics such as timers into percentiles
  • creating a vendor-agnostic metric collection pipeline
  • consolidating disparate observability data (from trace spans to metrics, and more!)
  • improving efficiency over other metric aggregator implementations
  • improving reliability by building a more resilient forwarding system over single points of failure

See Also

We wanted percentiles, histograms and sets to be global. We wanted to unify our observability clients, be vendor agnostic and build automatic features like SLI measurement. Veneur helps us do all this and more!

Status

Veneur is currently handling all metrics for Stripe and is considered production ready. It is under active development and maintenance! Starting with v1.6, Veneur operates on a six-week release cycle, and all releases are tagged in git. If you'd like to contribute, see CONTRIBUTING!

Building Veneur requires Go 1.11 or later.

Features

Vendor And Backend Agnostic

Veneur has many sinks such that your data can be sent one or more vendors, TSDBs or tracing stores!

Modern Metrics Format (Or Others!)

Unify metrics, spans and logs via the Sensor Sensibility Format. Also works with DogStatsD, StatsD and Prometheus.

Global Aggregation

If configured to do so, Veneur can selectively aggregate global metrics to be cumulative across all instances that report to a central Veneur, allowing global percentile calculation, global counters or global sets.

For example, say you emit a timer foo.bar.call_duration_ms from 20 hosts that are configured to forward to a central Veneur. You'll see the following:

  • Metrics that have been "globalized"
    • foo.bar.call_duration_ms.50percentile: the p50 across all hosts, by tag
    • foo.bar.call_duration_ms.90percentile: the p90 across all hosts, by tag
    • foo.bar.call_duration_ms.95percentile: the p95 across all hosts, by tag
    • foo.bar.call_duration_ms.99percentile: the p99 across all hosts, by tag
  • Metrics that remain host-local
    • foo.bar.call_duration_ms.avg: by-host tagged average
    • foo.bar.call_duration_ms.count: by-host tagged count which (when summed) shows the total count of times this metric was emitted
    • foo.bar.call_duration_ms.max: by-host tagged maximum value
    • foo.bar.call_duration_ms.median: by-host tagged median value
    • foo.bar.call_duration_ms.min: by-host tagged minimum value
    • foo.bar.call_duration_ms.sum: by-host tagged sum value representing the total time

Clients can choose to override this behavior by including the tag veneurlocalonly.

Approximate Histograms

Because Veneur is built to handle lots and lots of data, it uses approximate histograms. We have our own implementation of Dunning's t-digest, which has bounded memory consumption and reduced error at extreme quantiles. Metrics are consistently routed to the same worker to distribute load and to be added to the same histogram.

Datadog's DogStatsD — and StatsD — uses an exact histogram which retains all samples and is reset every flush period. This means that there is a loss of precision when using Veneur, but the resulting percentile values are meant to be more representative of a global view.

Datadog Distributions

Because Veneur already handles "global" histograms, any DogStatsD packets received with type dDatadog's distribution type — will be considered a histogram and therefore compatible with all sinks. Veneur does not send any metrics to Datadog typed as a Datadog-native distribution.

Approximate Sets

Veneur uses HyperLogLogs for approximate unique sets. These are a very efficient unique counter with fixed memory consumption.

Global Counters

Via an optional magic tag Veneur will forward counters to a global host for accumulation. This feature was primarily developed to control tag cardinality. Some counters are valuable but do not require per-host tagging.

Concepts

  • Global metrics are those that benefit from being aggregated for chunks — or all — of your infrastructure. These are histograms (including the percentiles generated by timers) and sets.
  • Metrics that are sent to another Veneur instance for aggregation are said to be "forwarded". This terminology helps to decipher configuration and metric options below.
  • Flushed, in Veneur, means metrics or spans processed by a sink.

By Metric Type Behavior

To clarify how each metric type behaves in Veneur, please use the following:

  • Counters: Locally accrued, flushed to sinks (see magic tags for global version)
  • Gauges: Locally accrued, flushed to sinks (see magic tags for global version)
  • Histograms: Locally accrued, count, max and min flushed to sinks, percentiles forwarded to forward_address for global aggregation when set.
  • Timers: Locally accrued, count, max and min flushed to sinks, percentiles forwarded to forward_address for global aggregation when set.
  • Sets: Locally accrued, forwarded to forward_address for sinks aggregation when set.

Expiration

Veneur expires all metrics on each flush. If a metric is no longer being sent (or is sent sparsely) Veneur will not send it as zeros! This was chosen because the combination of the approximation's features and the additional hysteresis imposed by retaining these approximations over time was deemed more complex than desirable.

Other Notes

  • Veneur aligns its flush timing with the local clock. For the default interval of 10s Veneur will generally emit metrics at 00, 10, 20, 30, … seconds after the minute.
  • Veneur will delay it's first metric emission to align the clock as stated above. This may result in a brief quiet period on a restart at worst < interval seconds long.

Usage

veneur -f example.yaml

See example.yaml for a sample config. Be sure to set the appropriate *_api_key!

Setup

Here we'll document some explanations of setup choices you may make when using Veneur.

Clients

Veneur is capable of ingesting:

  • DogStatsD including events and service checks
  • SSF
  • StatsD as a subset of DogStatsD, but this may cause trouble depending on where you store your metrics.

To use clients with Veneur you need only configure your client of choice to the proper host and port combination. This port should match one of:

  • statsd_listen_addresses for UDP- and TCP-based clients
  • ssf_listen_addresses for SSF-based clients using UDP or UNIX domain sockets.
  • grpc_listen_addresses for both SSF and dogstatsd based clients using GRPC (over TCP).

Einhorn Usage

When you upgrade Veneur (deploy, stop, start with new binary) there will be a brief period where Veneur will not be able to handle HTTP requests. At Stripe we use Einhorn as a shared socket manager to bridge the gap until Veneur is ready to handle HTTP requests again.

You'll need to consult Einhorn's documentation for installation, setup and usage. But once you've done that you can tell Veneur to use Einhorn by setting http_address to [email protected]. This informs goji/bind to use its Einhorn handling code to bind to the file descriptor for HTTP.

Forwarding

Veneur instances can be configured to forward their global metrics to another Veneur instance. You can use this feature to get the best of both worlds: metrics that benefit from global aggregation can be passed up to a single global Veneur, but other metrics can be published locally with host-scoped information. Note: Forwarding adds an additional delay to metric availability corresponding to the value of the interval configuration option, as the local veneur will flush it to its configured upstream, which will then flush any recieved metrics when its interval expires.

If a local instance receives a histogram or set, it will publish the local parts of that metric (the count, min and max) directly to sinks, but instead of publishing percentiles, it will package the entire histogram and send it to the global instance. The global instance will aggregate all the histograms together and publish their percentiles to sinks.

Note that the global instance can also receive metrics over UDP. It will publish a count, min and max for the samples that were sent directly to it, but not counting any samples from other Veneur instances (this ensures that things don't get double-counted). You can even chain multiple levels of forwarding together if you want. This might be useful if, for example, your global Veneur is under too much load. The root of the tree will be the Veneur instance that has an empty forward_address. (Do not tell a Veneur instance to forward metrics to itself. We don't support that and it doesn't really make sense in the first place.)

With respect to the tags configuration option, the tags that will be added are those of the Veneur that actually publishes to a sink. If a local instance forwards its histograms and sets to a global instance, the local instance's tags will not be attached to the forwarded structures. It will still use its own tags for the other metrics it publishes, but the percentiles will get extra tags only from the global instance.

Proxy

To improve availability, you can leverage veneur-proxy in conjunction with Consul service discovery.

The proxy can be configured to query the Consul API for instances of a service using consul_forward_service_name. Each healthy instance is then entered in to a hash ring. When choosing which host to forward to, Veneur will use a combination of metric name and tags to consistently choose the same host for forwarding.

See more documentation for Proxy Veneur.

Static Configuration

For static configuration you need one Veneur, which we'll call the global instance, and one or more other Veneurs, which we'll call local instances. The local instances should have their forward_address configured to the global instance's http_address. The global instance should have an empty forward_address (ie just don't set it). You can then report metrics to any Veneur's statsd_listen_addresses as usual.

Magic Tag

If you want a metric to be strictly host-local, you can tell Veneur not to forward it by including a veneurlocalonly tag in the metric packet, eg foo:1|h|#veneurlocalonly. This tag will not actually appear in storage; Veneur removes it.

Global Counters And Gauges

Relatedly, if you want to forward a counter or gauge to the global Veneur instance to reduce tag cardinality, you can tell Veneur to flush it to the global instance by including a veneurglobalonly tag in the metric's packet. This veneurglobalonly tag is stripped and will not be passed on to sinks.

Note: For global counters to report correctly, the local and global Veneur instances should be configured to have the same flush interval.

Note: Global gauges are "random write wins" since they are merged in a non-deterministic order at the global Veneur.

Routing metrics

Veneur supports specifying that metrics should only be routed to a specific metric sink, with the veneursinkonly:<sink_name> tag. The <sink_name> value can be any configured metric sink. Currently, that's datadog, kafka, signalfx, prometheus. It's possible to specify multiple sink destination tags on a metric, which will cause the metric to be routed to each sink specified.

Configuration

Veneur expects to have a config file supplied via -f PATH. The included example.yaml explains all the options!

The config file can be validated using a pair of flags:

  • -validate-config: checks that the config file specified via -f is valid YAML, and has correct datatypes for all fields.
  • -validate-config-strict: checks the above, and also that there are no unknown fields.

Configuration via Environment Variables

Veneur and veneur-proxy each allow configuration via environment variables using envconfig. Options provided via environment variables take precedent over those in config. This allows stuff like:

VENEUR_DEBUG=true veneur -f someconfig.yml

Note: The environment variables used for configuration map to the field names in config.go, capitalized, with the prefix VENEUR_. For example, the environment variable equivalent of datadog_api_hostname is VENEUR_DATADOGAPIHOSTNAME.

You may specify configurations that are arrays by separating them with a comma, for example VENEUR_AGGREGATES="min,max"

Monitoring

Here are the important things to monitor with Veneur:

At Local Node

When running as a local instance, you will be primarily concerned with the following metrics:

  • veneur.flush*.error_total as a count of errors when flushing metrics. This should rarely happen. Occasional errors are fine, but sustained is bad.

Forwarding

If you are forwarding metrics to central Veneur, you'll want to monitor these:

  • veneur.forward.error_total and the cause tag. This should pretty much never happen and definitely not be sustained.
  • veneur.forward.duration_ns and veneur.forward.duration_ns.count. These metrics track the per-host time spent performing a forward. The time should be minimal!

At Global Node

When forwarding you'll want to also monitor the global nodes you're using for aggregation:

  • veneur.import.request_error_total and the cause tag. This should pretty much never happen and definitely not be sustained.
  • veneur.import.response_duration_ns and veneur.import.response_duration_ns.count to monitor duration and number of received forwards. This should not fail and not take very long. How long it takes will depend on how many metrics you're forwarding.
  • And the same veneur.flush.* metrics from the "At Local Node" section.

Metrics

Veneur will emit metrics to the stats_address configured above in DogStatsD form. Those metrics are:

  • veneur.sink.metric_flush_total_duration_ns.* - Duration of flushes per-sink, tagged by sink.
  • veneur.packet.error_total - Number of packets that Veneur could not parse due to some sort of formatting error by the client. Tagged by packet_type and reason.
  • veneur.forward.post_metrics_total - Indicates how many metrics are being forwarded in a given POST request. A "metric", in this context, refers to a unique combination of name, tags and metric type.
  • veneur.*.content_length_bytes.* - The number of bytes in a single POST body. Remember that Veneur POSTs large sets of metrics in multiple separate bodies in parallel. Uses a histogram, so there are multiple metrics generated depending on your local DogStatsD config.
  • veneur.forward.duration_ns - Same as flush.duration_ns, but for forwarding requests.
  • veneur.flush.error_total - Number of errors received POSTing via sinks.
  • veneur.forward.error_total - Number of errors received POSTing to an upstream Veneur. See also import.request_error_total below.
  • veneur.gc.number - Number of completed GC cycles.
  • veneur.gc.pause_total_ns - Total seconds of STW GC since the program started.
  • veneur.mem.heap_alloc_bytes - Total number of reachable and unreachable but uncollected heap objects in bytes.
  • veneur.worker.metrics_processed_total - Total number of metric packets processed between flushes by workers, tagged by worker. This helps you find hot spots where a single worker is handling a lot of metrics. The sum across all workers should be approximately proportional to the number of packets received.
  • veneur.worker.metrics_flushed_total - Total number of metrics flushed at each flush time, tagged by metric_type. A "metric", in this context, refers to a unique combination of name, tags and metric type. You can use this metric to detect when your clients are introducing new instrumentation, or when you acquire new clients.
  • veneur.worker.metrics_imported_total - Total number of metrics received via the importing endpoint. A "metric", in this context, refers to a unique combination of name, tags, type and originating host. This metric indicates how much of a Veneur instance's load is coming from imports.
  • veneur.import.response_duration_ns - Time spent responding to import HTTP requests. This metric is broken into part tags for request (time spent blocking the client) and merge (time spent sending metrics to workers).
  • veneur.import.request_error_total - A counter for the number of import requests that have errored out. You can use this for monitoring and alerting when imports fail.
  • veneur.listen.received_per_protocol_total - A counter for the number of metrics/spans/etc. received by direct listening on global Veneur instances. This can be used to observe metrics that were received from direct emits as opposed to imports. Tagged by protocol.

Error Handling

In addition to logging, Veneur will dutifully send any errors it generates to a Sentry instance. This will occur if you set the sentry_dsn configuration option. Not setting the option will disable Sentry reporting.

Performance

Processing packets quickly is the name of the game.

Benchmarks

The common use case for Veneur is as an aggregator and host-local replacement for DogStatsD, therefore processing UDP fast is no longer the priority. That said, we were processing > 60k packets/second in production before shifting to the current local aggregation method. This outperformed both the Datadog-provided DogStatsD and StatsD in our infrastructure.

SO_REUSEPORT

As other implementations have observed, there's a limit to how many UDP packets a single kernel thread can consume before it starts to fall over. Veneur supports the SO_REUSEPORT socket option on Linux, allowing multiple threads to share the UDP socket with kernel-space balancing between them. If you've tried throwing more cores at Veneur and it's just not going fast enough, this feature can probably help by allowing more of those cores to work on the socket (which is Veneur's hottest code path by far). Note that this is only supported on Linux (right now). We have not added support for other platforms, like darwin and BSDs.

TCP connections

Veneur supports reading the statsd protocol from TCP connections. This is mostly to support TLS encryption and authentication, but might be useful on its own. Since TCP is a continuous stream of bytes, this requires each stat to be terminated by a new line character ('\n'). Most statsd clients only add new lines between stats within a single UDP packet, and omit the final trailing new line. This means you will likely need to modify your client to use this feature.

TLS encryption and authentication

If you specify the tls_key and tls_certificate options, Veneur will only accept TLS connections on its TCP port. This allows the metrics sent to Veneur to be encrypted.

If you specify the tls_authority_certificate option, Veneur will require clients to present a client certificate, signed by this authority. This ensures that only authenticated clients can connect.

You can generate your own set of keys using openssl:

# Generate the authority key and certificate (2048-bit RSA signed using SHA-256)
openssl genrsa -out cakey.pem 2048
openssl req -new -x509 -sha256 -key cakey.pem -out cacert.pem -days 1095 -subj "/O=Example Inc/CN=Example Certificate Authority"

# Generate the server key and certificate, signed by the authority
openssl genrsa -out serverkey.pem 2048
openssl req -new -sha256 -key serverkey.pem -out serverkey.csr -days 1095 -subj "/O=Example Inc/CN=veneur.example.com"
openssl x509 -sha256 -req -in serverkey.csr -CA cacert.pem -CAkey cakey.pem -CAcreateserial -out servercert.pem -days 1095

# Generate a client key and certificate, signed by the authority
openssl genrsa -out clientkey.pem 2048
openssl req -new -sha256 -key clientkey.pem -out clientkey.csr -days 1095 -subj "/O=Example Inc/CN=Veneur client key"
openssl x509 -req -in clientkey.csr -CA cacert.pem -CAkey cakey.pem -CAcreateserial -out clientcert.pem -days 1095

Set statsd_listen_addresses, tls_key, tls_certificate, and tls_authority_certificate:

statsd_listen_addresses:
  - "tcp://localhost:8129"
tls_certificate: |
  -----BEGIN CERTIFICATE-----
  MIIC8TCCAdkCCQDc2V7P5nCDLjANBgkqhkiG9w0BAQsFADBAMRUwEwYDVQQKEwxC
  ...
  -----END CERTIFICATE-----
tls_key: |
    -----BEGIN RSA PRIVATE KEY-----
  MIIEpAIBAAKCAQEA7Sntp4BpEYGzgwQR8byGK99YOIV2z88HHtPDwdvSP0j5ZKdg
  ...
  -----END RSA PRIVATE KEY-----
tls_authority_certificate: |
  -----BEGIN CERTIFICATE-----
  ...
  -----END CERTIFICATE-----

Performance implications of TLS

Establishing a TLS connection is fairly expensive, so you should reuse connections as much as possible. RSA keys are also far more expensive than using ECDH keys. Using localhost on a machine with one CPU, Veneur was able to establish ~700 connections/second using ECDH prime256v1 keys, but only ~110 connections/second using RSA 2048-bit keys. According to the Go profiling for a Veneur instance using TLS with RSA keys, approximately 25% of the CPU time was in the TLS handshake, and 13% was decrypting data.

Name

The veneur is a person acting as superintendent of the chase and especially of hounds in French medieval venery and being an important officer of the royal household. In other words, it is the master of dogs. :)

Comments
  • Drop statsd client from veneur's internal metrics reporting

    Drop statsd client from veneur's internal metrics reporting

    Summary

    This PR reworks veneur such that its metrics get reported not through statsd, spewing packets at its own UDP listening port, but sending them through its own *trace.Client (typically internal), by attaching them to an SSF span.

    Motivation

    I'd like us to cause as little network traffic as possible, reduce syscalls, gain more clarity around what gets reported how.

    Test plan

    I made sure the tests still work, but then most of them don't exercise the path where we actually send metrics through the trace backend.

    To try it out, roll this to our QA and see:

    • [x] do metrics still get reported in the first place? (They should be!)
    • [x] how are dashboards affected on the hosts that this is running on?
    • [x] how is CPU / memory usage affected by this, possibly syscall volume too?

    Rollout/monitoring/revert plan

    This will likely affect dashboards - if you roll this out, you will notice a severe drop in UDP packets.

    There is an ordering requirement for rolling it out, too:

    1. Roll out local veneur servers
    2. Roll out global veneur servers
    3. Roll out veneur proxies

    Proxies must come last because they report SSF metrics to a local veneur server, and that server must know to expect them/report them correctly - so the versions must match up

    approved 
    opened by asf-stripe 26
  • Added a more performant Hyperloglog.

    Added a more performant Hyperloglog.

    Summary

    Changed the current implementation of the Hyperloglog by Clark Duvall to the Axiom one.

    Motivation

    Based on the intent on https://news.ycombinator.com/item?id=14636699 I took a stab at using the axiomhq implementation to reduce resource usage.

    Test plan

    I adapted and used the existing samplers_test.go test case.

    Rollout/monitoring/revert plan

    A rollout should be fairly easy as changing the import and changing a few method calls.

    approved release-1.7 
    opened by martinpinto 23
  • Support forwarding over gRPC

    Support forwarding over gRPC

    Summary

    As requested by @cory-stripe in #390, this adds a gRPC service that duplicates the functionality of /import, allowing forwarding over gRPC instead of HTTP. I implemented this for both veneur and veneur-proxy. For both services it should be possible to run both the HTTP and the gRPC servers simultaneously, allowing for a gradual switch of local Veneur's over time. For Veneur's that are forwarding metrics, I added a configuration option that controls weather forwarding happens over gRPC or HTTP.

    I tried to break this change down into somewhat descriptive and self-contained commits, which I think might be more useful then my essay in #390 😄 Let me know if you'd like more description and I'm happy to add it.

    Sorry for making this so gigantic! I'm happy to break this down into smaller PRs if you'd like. One division that I could see working would be 3 PRs consisting of the following, although this is pretty much contained in the commits I made:

    • Implementing the veneur gRPC server
    • Implementing the veneur-proxy gRPC server
    • Adding forwarding support over gRPC

    Motivation

    We (@quantcast) are very motivated to contribute changes that fix issue #155 so that we can heavily reduce the cardinality of many of our metrics that are submitted to Datadog. Hopefully this puts us on a path where we can get that merged!

    Test plan

    I added a pretty good number of tests to verify the new endpoint. This includes the following:

    • Tests for the new gRPC proxy server
    • Tests for the gRPC import server
    • Tests of the new flushing behavior over gRPC
    • Tests for the Worker's ingesting metrics from gRPC
    • An end-to-end test that follows the chain of forwarding from a local Veneur to being flushed by a global instance.

    I've also created a small development setup that I'm currently testing these changes on, and it's looking good so far. If all goes well we will probably try to canary this to our production proxies and global Veneur's in the next week.

    TODOs

    • [x] Add tests for the new functions added to the various samplers

    Rollout/monitoring/revert plan

    This can be deployed to local Veneur's, global Veneur's, and proxies in any order, and should cause no behavior change until gRPC configuration options are edited.

    Obviously before enabling gRPC on local Veneur's you will want to have enabled the gRPC servers on both the proxies and global Veneur's 😄

    CC: @stripe/observability

    opened by noahgoldman 19
  • Accept statsd from authenticated TLS connections

    Accept statsd from authenticated TLS connections

    Summary

    By default, Veneur (and statsd, and dogstatsd) accepts stats from any client that can send UDP packets to the correct port. For slightly bizarre reasons, it is useful to only accept data from authenticated clients. An "easy" way to do this securely is to use an encrypted TLS connection, and require the client to present a certificate signed by a trusted authority. This reuses the existing authentication mechanism in TLS, which is hopefully secure.

    This adds configuration options to read the statsd protocol from a TLS connection. It will only accept connections with a client certificate, signed by the configured authority.

    Motivation

    See Issue #117

    Test plan

    Currently we (Bluecore Inc) are running this change as an experiment for getting metrics out of App Engine. We will be ramping up our usage of it over the next couple weeks.

    There are no unit tests here yet: If this seems reasonable, I'll attempt to add some for the TCP connection reader, and possibly for the TLS listen socket setup.

    Rollout/monitoring/revert plan

    This adds tls.connects and tls.disconnects counters to monitor connections with TLS. Maybe this should be a gauge, but then you might not see if clients are in reconnect loops.

    approved 
    opened by evanj 19
  • Add a Kafka sink

    Add a Kafka sink

    Summary

    Add a sink for sinking metrics or spans to a Kafka topic

    Motivation

    From @parabuzzle,

    I want to be able to sink my metrics to a Kafka topic for later consuming and stream processing.

    Me too! I've made lots of changes to his original PR. Partly because the interface changed underneath him, but also because we wanted to include span!

    This PR adds a large pile of config option — sorry not sorry, Kafka has lots of options! — and sets up an asynchronous Kafka producer that writes to the supplied topic(s). Users can configure their producer to "flush" based on messages, bytes or durations and even configure spans to be one of JSON or Protobuf.

    Test plan

    Tests added and local tests confirm things work!

    opened by parabuzzle 18
  • Align tick with interval boundaries for bucketing.

    Align tick with interval boundaries for bucketing.

    Summary

    Align Veneur's internal flush ticker with the interval for nice bucketing.

    Motivation

    We got an issue from @hermanbanken in #296 that pointed our Veneur's choice to "tick" outside of a convenient boundary was causing some bucketing issues. This isn't the first time we've heard this, but it's the first time someone helped us understand why.

    To that end, this patch changes veneur to delay the start of it's ticker by <= interval so as to align with nice, round clock buckets.

    Test plan

    Test incoming, in writing this PR text I realized how to test it. :)

    r? @stripe/observability

    approved 
    opened by cory-stripe 16
  • Veneur panic

    Veneur panic

    Our veneur's v1.3.1 started crashing sporadically past thurday. Yesterday, I compiled from master, and updated all, but the crash seams to persist.

    The problem is that it's ocasional and I can't seam to be able to debug ingested metrics.

    Thank you for any help.

    LOG:
    time="2018-11-27T13:38:11Z" level=info msg="Completed flush to Datadog" metrics=2403
    fatal error: fault
    unexpected fault address 0x10f9000
    	/go/src/github.com/stripe/veneur/worker.go:259 +0x3ee fp=0xc000ac7fa8 sp=0xc000ac7c28 pc=0x10f854e
    github.com/stripe/veneur.(*Worker).Work(0xc000362fc0)
    	/go/src/github.com/stripe/veneur/worker.go:311 +0x9fa fp=0xc000ac7c28 sp=0xc000ac7a90 pc=0x10f8ffa
    github.com/stripe/veneur.(*Worker).ProcessMetric(0xc000362fc0, 0xc000ac7dc8)
    	/usr/local/go/src/runtime/signal_unix.go:397 +0x275 fp=0xc000ac7a90 sp=0xc000ac7a40 pc=0x441aa5
    runtime.sigpanic()
    	/usr/local/go/src/runtime/panic.go:608 +0x72 fp=0xc000ac7a40 sp=0xc000ac7a10 pc=0x42bf72
    runtime.throw(0x13cc01a, 0x5)
    goroutine 134 [running]:
    
    [signal SIGSEGV: segmentation violation code=0x2 addr=0x10f9000 pc=0x10f8ffa]
    net.(*conn).Read(0xc00000c050, 0xc0004e8000, 0x2000, 0x2000, 0x0, 0x0, 0x0)
    	/usr/local/go/src/net/fd_unix.go:202 +0x4f
    net.(*netFD).Read(0xc000401180, 0xc0004e8000, 0x2000, 0x2000, 0x408a7b, 0xc00002e000, 0x125a980)
    	/usr/local/go/src/internal/poll/fd_unix.go:169 +0x179
    internal/poll.(*FD).Read(0xc000401180, 0xc0004e8000, 0x2000, 0x2000, 0x0, 0x0, 0x0)
    	/usr/local/go/src/internal/poll/fd_poll_runtime.go:90 +0x3d
    internal/poll.(*pollDesc).waitRead(0xc000401198, 0xc0004e8000, 0x2000, 0x2000)
    	/usr/local/go/src/internal/poll/fd_poll_runtime.go:85 +0x9a
    internal/poll.(*pollDesc).wait(0xc000401198, 0x72, 0xffffffffffffff00, 0x15b5280, 0x206e558)
    	/usr/local/go/src/runtime/netpoll.go:173 +0x66
    internal/poll.runtime_pollWait(0x7f131c6a9950, 0x72, 0xc0006c2870)
    goroutine 36 [IO wait]:
    
    	/go/src/github.com/stripe/veneur/cmd/veneur/main.go:94 +0x2eb
    main.main()
    	/go/src/github.com/stripe/veneur/server.go:1074 +0x6f
    github.com/stripe/veneur.(*Server).Serve(0xc0005dc600)
    goroutine 1 [chan receive, 244 minutes]:
    
    	/go/src/github.com/stripe/veneur/server.go:273 +0x95c
    created by github.com/stripe/veneur.NewFromConfig
    	/usr/local/go/src/runtime/asm_amd64.s:1333 +0x1 fp=0xc000ac7fd8 sp=0xc000ac7fd0 pc=0x45bab1
    runtime.goexit()
    	/go/src/github.com/stripe/veneur/server.go:277 +0x51 fp=0xc000ac7fd0 sp=0xc000ac7fa8 pc=0x10fe351
    github.com/stripe/veneur.NewFromConfig.func1(0xc0005dc600, 0xc000362fc0)
    
    
    CONFIG:
    statsd_listen_addresses:
     - udp://0.0.0.0:8125
    tls_key: ""
    tls_certificate: ""
    tls_authority_certificate: ""
    forward_address: ""
    forward_use_grpc: false
    interval: "10s"
    synchronize_with_interval: false
    stats_address: "localhost:8125"
    http_address: "0.0.0.0:8127"
    grpc_address: "0.0.0.0:8128"
    indicator_span_timer_name: ""
    percentiles:
      - 0.99
      - 0.95
    aggregates:
      - "min"
      - "max"
      - "median"
      - "avg"
      - "count"
      - "sum"
    num_workers: 96
    num_readers: 4
    num_span_workers: 10
    span_channel_capacity: 100
    metric_max_length: 4096
    trace_max_length_bytes: 16384
    read_buffer_size_bytes: 4194304
    debug: false
    debug_ingested_spans: false
    debug_flushed_metrics: false
    mutex_profile_fraction: 0
    block_profile_rate: 0
    sentry_dsn: ""
    enable_profiling: false
    datadog_api_hostname: https://app.datadoghq.com
    datadog_api_key: "DATADOG_KEY"
    datadog_flush_max_per_body: 25000
    datadog_trace_api_address: ""
    datadog_span_buffer_size: 16384
    aws_access_key_id: ""
    aws_secret_access_key: ""
    aws_region: ""
    aws_s3_bucket: ""
    flush_file: ""
    

    ADDITIONAL INFO: They receive metrics via UDP at a rate no lower than 1000/s. We healthcheck via TCP/8127 /healthcheck and the servers are replaced on failure. (AWS ASG)

    opened by ruilapa 13
  • Add a gRPC

    Add a gRPC "import" server

    This PR contains a subset of the changes in #407, which enables full gRPC support

    Summary

    This adds a gRPC "import" server, which receives metrics and outputs them to a set of generic MetricIngesters. This should be used to receive metrics forwarded from another Veneur.

    importsrv package

    The importsrv package defines a gRPC server that implements the forwardrpc.Forward service. It receives batches of metrics, and sends them on to a set of sinks based on their key. To facilitate easy testing of this RPC, I made the server accept an interface MetricIngester, which is what is used to forward the metrics. Worker will later implement this interface, allowing for them to be passed directly into this server.

    Optimizations for gRPC imports

    From profiling global Veneurs in our production environment, we see that the garbage collector is generally dominating the CPU usage. Digging in a little more, it looks like the process to hash a metric to a destination worker in the "importsrv" package is actually doing a ton of allocations (about 20% of total allocated objects).

    To optimize this, I pulled in an external fnv hash implementation github.com/segmentio/fasthash that handles strings with zero allocations and doesn't use interface types. I also removed usages of "MetricKey", as it does a lot of string manipulations that perform a ton of allocations, in favor of just writing strings directly into the hash.

    The benchmarks show nice improvements in the number of allocations. There is a 2x reduction for a small number of metrics (10 to be specific) and a 6x reduction for 100 metrics, which is closer to our average of around 180 currently.

    While this seems like a nice improvement to me, it does require the additional dependency. I put these changes in a separate commit (fc6634a), and I'm happy to revert it if the performance gain doesn't seem worth it!

    Motivation

    Like I also said in #439, we (@quantcast) would like to contribute gRPC support to eventually enable global histogram aggregations (#390). We are also already running this.

    Test plan

    I wrote a number of tests for the new package. Very similar code is also currently running in our production environment and handling almost our entire metric volume without issues.

    Rollout/monitoring/revert plan

    As with #439, all of the new code is (so far) unused, and so there should be no functional changes and any deployment should be a no-op!

    CC: @sdboyer-stripe @cory-stripe @stripe/observability

    approved 
    opened by noahgoldman 12
  • Use shared http.Client for signalfx sink

    Use shared http.Client for signalfx sink

    Summary

    basically see commits

    r? @aditya-stripe

    Motivation

    the sfx sink should use the same http client object as the rest of veneur, so it can share keepalive connections

    Test plan

    been baking in qa for a while

    Rollout/monitoring/revert plan

    as normal

    approved 
    opened by sjung-stripe 12
  • Extract and set host header, return X-Powered-By

    Extract and set host header, return X-Powered-By

    Summary

    In order to use Veneur behind a webserver or a proxy, some changes need to be applied. Veneur modifies the endpoint to avoid dns-caching and not preserving the original host header, this breaks Veneur if it sits behind a proxy that is listening to more than one site.

    Motivation

    Support usage of veneur behind another webserver / proxy (Nginx, Apache, etc...) Adding visibility once it reached the veneur instance using X-Powered-By

    Test plan

    I wrote a test to verify the new extractHostPort works as expected and that I didn't break any other tests.

    opened by ortz 12
  • sentry.go: sentryHook.Fire: Don't hang for Fatal logs without Sentry

    sentry.go: sentryHook.Fire: Don't hang for Fatal logs without Sentry

    Summary

    If s.c is nil, s.c.Capture returns a nil chan. Reading from it blocks forever. This means if someone logs at logrus Fatal or Panic levels without Sentry configured, the caller will hang instead of causing the process. This is the same bug as commit d04711e9c2, just in a different place.

    It turns out the unit tests were depending on this behaviour, since they were reusing ports and hitting this error. With this fix, they were crashing. PR #134 fixes that problem.

    Motivation

    I was debugging something and found that Veneur was hanging, not crashing. This was the cause.

    Test plan

    So far, I've only run the unit tests.

    opened by evanj 12
  • Add an example of comma-separated `-tag` in veneur-emit

    Add an example of comma-separated `-tag` in veneur-emit

    Summary

    Add an example of setting multiple tags with veneur-emit.

    Motivation

    Clarity

    Test plan

    N/a docs change only.

    Rollout/monitoring/revert plan

    N/A

    opened by jkz-stripe 0
  • Veneur forarding to Datadog - avg higher than max, avg missing data

    Veneur forarding to Datadog - avg higher than max, avg missing data

    Hello

    We are using veneur v 14.2.0 to forward metrics to datadog. Our topology looks like this:

    Application sends statsd metrics -> to veneur agent running on ECS instances -> veneur proxy -> veneur global -> datadog sink
    

    We use the go-lang datadog statsd libray to send stats (github.com/DataDog/datadog-go/statsd). For this particular problem metric seen in screenshots we use the Statsd.Histogram(...) api to send the timing data.

    We are seeing weird issues with AVG for certain metrics. We see that avg is higher than both max and p95. We also see the avg data point missing for significant periods. Attaching a screenshot that show this: image

    Our configuration files are as below:

    Veneur client

    startup command: ./local/veneurclient -f local/veneurclient.yaml env: none veneurclient.yaml config file:

    ---
    # == COLLECTION ==
    
    statsd_listen_addresses:
     - udp://0.0.0.0:8125
    
    # == BEHAVIOR ==
    
    forward_address: "http://veneur-proxy.service.consul:18127"
    interval: "10s"
    stats_address: "localhost:8125"
    http_address: "0.0.0.0:8127"
    
    # == METRICS CONFIGURATION ==
    
    # Defaults to the os.Hostname()!
    hostname: ""
    # If true and hostname is "" or absent, don't add the host tag
    omit_empty_hostname: true
    
    tags: ["service:veneur-local"]
    
    # Set to floating point values that you'd like to output percentiles for from
    # histograms.
    percentiles: [0.95]
    aggregates: ["max","avg","count"]
    
    # == PERFORMANCE ==
    num_workers: 6
    num_readers: 3
    
    # == LIMITS ==
    
    # How big of a buffer to allocate for incoming metrics. Metrics longer than this
    # will be truncated!
    metric_max_length: 131072
    
    # How big of a buffer to allocate for incoming traces.
    trace_max_length_bytes: 16384
    
    # The size of the buffer we'll use to buffer socket reads. Tune this if you
    # you think Veneur needs more room to keep up with all packets.
    read_buffer_size_bytes: 26214400
    
    # == DIAGNOSTICS ==
    
    # Enables Go profiling
    enable_profiling: true
    
    # == SINKS ==
    # == Datadog ==
    # Datadog can be a sink for metrics, events, service checks and trace spans.
    
    # Hostname to send Datadog data to.
    datadog_api_hostname: "https://app.datadoghq.com"
    
    # API key for acessing Datadog
    datadog_api_key: blahyidda
    
    # How many metrics to include in the body of each POST to Datadog. Veneur
    # will post multiple times in parallel if the limit is exceeded.
    flush_max_per_body: 25000
    
    

    Veneur proxy

    startup command: local/veneur-proxy -f local/veneur-proxy.yaml env:

            VENEUR_PROXY_FORWARDADDRESS           = "http://veneur-global.service.consul:18127"
            VENEUR_PROXY_HTTPADDRESS              = "0.0.0.0:18127"
            VENEUR_PROXY_STATSADDRESS             = "127.0.0.1:18125"
            VENEUR_PROXY_STATSDLISTENADDRESSES    = "udp://0.0.0.0:18125"
            VENEUR_PROXY_TAGS                     = "service:veneur-proxy"
    

    veneur-proxy.yaml config file:

    ---
    debug: true
    enable_profiling: false
    http_address: "0.0.0.0:18127"
    
    # How often to refresh from Consul's healthy nodes
    consul_refresh_interval: "30s"
    
    # This field is deprecated - use ssf_destination_address instead!
    stats_address: "localhost:18125"
    
    # The address to which to send SSF spans and metrics - this is the
    # same format as on the veneur server's `ssf_listen_addresses`.
    ssf_destination_address: "udp://localhost:8126"
    
    ### FORWARDING
    # Use a static host for forwarding
    forward_address: "http://veneur.example.com"
    # Or use a consul service for consistent forwarding.
    consul_forward_service_name: "forwardServiceName"
    
    # Maximum time that forwarding each batch of metrics can take;
    # note that forwarding to multiple global veneur servers happens in
    # parallel, so every forwarding operation is expected to complete
    # within this time.
    forward_timeout: 10s
    
    ### TRACING
    # The address on which we will listen for trace data
    trace_address: "127.0.0.1:18128"
    # Use a static host to send traces to
    trace_api_address: "http://localhost:7777"
    # Ose us a consul service for sending all spans belonging to the same parent
    # trace to a consistent host
    consul_trace_service_name: "traceServiceName"
    
    sentry_dsn: ""
    

    Veneur global

    startup command: local/veneur -f local/veneur-global.yaml env:

            VENEUR_AGGREGATES             = "max,avg,count"
            VENEUR_DATADOGAPIHOSTNAME     = "https://app.datadoghq.com"
            VENEUR_DATADOGAPIKEY          = "blahyidda"
            VENEUR_DATADOGTRACEAPIADDRESS = ""
            VENEUR_HOSTNAME               = ""
            VENEUR_HTTPADDRESS            = "0.0.0.0:18127"
            VENEUR_NUMWORKERS             = "3"
            VENEUR_OMITEMPTYHOSTNAME      = "false"
            VENEUR_PERCENTILES            = "0.95"
            VENEUR_STATSDLISTENADDRESSES  = "udp://0.0.0.0:18125"
            VENEUR_TAGS  
    

    veneur-global.yaml config file:

    ---
    # == COLLECTION ==
    
    # The addresses on which to listen for statsd metrics. These are
    # formatted as URLs, with schemes corresponding to valid "network"
    # arguments on https://golang.org/pkg/net/#Listen. Currently, only udp
    # and tcp (including IPv4 and 6-only) schemes are supported.
    # This option supersedes the "udp_address" and "tcp_address" options.
    statsd_listen_addresses:
     - udp://localhost:18126
     - tcp://localhost:18126
    
    # The addresses on which to listen for SSF data. As with
    # statsd_listen_addresses, these are formatted as URLs, with schemes
    # corresponding to valid "network" arguments on
    # https://golang.org/pkg/net/#Listen. Currently, only UDP and Unix
    # domain sockets are supported.
    # Note: SSF sockets are required to ingest trace data.
    # This option supersedes the "ssf_address" option.
    ssf_listen_addresses:
      - udp://localhost:18128
      - unix:///tmp/veneur-ssf.sock
    
    # TLS
    # These are only useful in conjunction with TCP listening sockets
    
    # TLS server private key and certificate for encryption (specify both)
    # These are the key/certificate contents, not a file path
    tls_key: ""
    tls_certificate: ""
    
    # Authority certificate: requires clients to be authenticated
    tls_authority_certificate: ""
    
    
    
    # == BEHAVIOR ==
    
    # Use a static host for forwarding
    #forward_address: "http://veneur.example.com"
    forward_address: ""
    
    # How often to flush. When flushing to Datadog, changing this
    # value when you've already emitted metrics will break your time
    # series data.
    interval: "10s"
    
    # Veneur can "sychronize" it's flushes with the system clock, flushing at even
    # intervals i.e. 0, 10, 20… to align with the `interval`. This is disabled by
    # default for now, as it can cause thundering herds in large installations.
    synchronize_with_interval: false
    
    # Veneur emits its own metrics; this configures where we send them. It's ok
    # to point veneur at itself for metrics consumption!
    stats_address: "localhost:18126"
    
    # The address on which to listen for HTTP imports and/or healthchecks.
    # http_address: "[email protected]"
    http_address: "0.0.0.0:18127"
    
    # The name of timer metrics that "indicator" spans should be tracked
    # under. If this is unset, veneur doesn't report an additional timer
    # metric for indicator spans.
    indicator_span_timer_name: "indicator_span.duration_ms"
    
    # == METRICS CONFIGURATION ==
    
    # Defaults to the os.Hostname()!
    hostname: ""
    
    # If true and hostname is "" or absent, don't add the host tag
    omit_empty_hostname: false
    
    # Tags supplied here will be added to all metrics ingested by this instance.
    # Example:
    # tags:
    #  - "foo:bar"
    #  - "baz:quz"
    tags:
      - ""
    
    # Set to floating point values that you'd like to output percentiles for from
    # histograms.
    percentiles:
      - 0.5
      - 0.75
      - 0.99
    
    # Aggregations you'd like to putput for histograms. Possible values can be any
    # or all of:
    # - `min`: the minimum value in the histogram during the flush period
    # - `max`: the maximum value in the histogram during the flush period
    # - `median`: the median value in the histogram during the flush period
    # - `avg`: the average value in the histogram during the flush period
    # - `count`: the number of values added to the histogram during the flush period
    # - `sum`: the sum of all values added to the histogram during the flush period
    # - `hmean`: the harmonic mean of the all the values added to the histogram during the flush period
    aggregates:
     - "min"
     - "max"
     - "count"
    
    
    
    # == PERFORMANCE ==
    
    # Adjusts the number of workers Veneur will distribute aggregation across.
    # More decreases contention but has diminishing returns.
    num_workers: 96
    
    # Numbers larger than 1 will enable the use of SO_REUSEPORT, make sure
    # this is supported on your platform!
    num_readers: 1
    
    
    
    # == LIMITS ==
    
    # How big of a buffer to allocate for incoming metrics. Metrics longer than this
    # will be truncated!
    metric_max_length: 4096
    
    # How big of a buffer to allocate for incoming traces.
    trace_max_length_bytes: 16384
    
    # The number of SSF packets that can be processed
    # per flush interval
    ssf_buffer_size: 16384
    
    # The size of the buffer we'll use to buffer socket reads. Tune this if you
    # you think Veneur needs more room to keep up with all packets.
    read_buffer_size_bytes: 2097152
    
    # How many metrics to include in the body of each POST to Datadog. Veneur
    # will post multiple times in parallel if the limit is exceeded.
    flush_max_per_body: 25000
    
    
    
    # == DIAGNOSTICS ==
    
    # Sets the log level to DEBUG
    debug: true
    
    # Providing a Sentry DSN here will send internal exceptions to Sentry
    sentry_dsn: ""
    
    # Enables Go profiling
    enable_profiling: false
    
    
    
    # == SINKS ==
    
    # == Datadog ==
    # Datadog can be a sink for metrics, events, service checks and trace spans.
    
    # Hostname to send Datadog data to.
    datadog_api_hostname: https://app.datadoghq.com
    
    # API key for acessing Datadog
    datadog_api_key: "farts"
    
    # Hostname to send Datadog trace data to.
    datadog_trace_api_address: ""
    
    # == SignalFx ==
    # SignalFx can be a sink for metrics.
    signalfx_api_key: ""
    
    # Where to send metrics
    signalfx_endpoint_base: "https://ingest.signalfx.com"
    
    # The tag we'll add to each metric that contains the hostname we came from
    signalfx_hostname_tag: "host"
    
    # == LightStep ==
    # LightStep can be a sink for trace spans.
    
    # If present, lightstep will be enabled as a tracing sink
    # and this access token will be used
    # Access token for accessing LightStep
    trace_lightstep_access_token: ""
    
    # Host to send trace data to
    trace_lightstep_collector_host: ""
    
    # How often LightStep should reconnect to collectors. If your workload is
    # imbalanced — some veneur instances see more spans than others — then you may
    # want to reconnect more often.
    trace_lightstep_reconnect_period: "5m"
    
    # The LightStep client has internal throttling to prevent you overwhelming
    # things. Anything that exceeds this many spans in the reporting period
    # — which is a minimum of 500ms and maxmium 2.5s at the time of this writing
    # — will be dropped. In other words, you can only submit this many spans per
    # flush! If left at zero, veneur will set the maximum to the size of
    # `ssf_buffer_size`.
    trace_lightstep_maximum_spans: 0
    
    # Multiple clients can be used to load-balance spans cross multiple collectors,
    # improving span indexing success rates.
    # If missing (or set to zero), it will default
    # to a minimum of one client
    trace_lightstep_num_clients: 1
    
    # == Kafka ==
    
    # Comma-delimited list of brokers suitable for Sarama's [NewAsyncProducer](https://godoc.org/github.com/Shopify/sarama#NewAsyncProducer)
    # in the form hostname:port, such as localhost:9092
    kafka_broker: ""
    
    # Name of the topic we'll be publishing checks to
    kafka_check_topic: "veneur_checks"
    
    # Name of the topic we'll be publishing events to
    kafka_event_topic: "veneur_events"
    
    # Name of the topic we'll be publishing metrics to
    kafka_metric_topic: ""
    
    # Name of the topic we'll be publishing spans to
    kafka_span_topic: "veneur_spans"
    
    # Name of a tag to hash on for sampling; if empty, spans are sampled based off
    # of traceID
    kafka_span_sample_tag: ""
    
    # Sample rate in percent (as an integer)
    # This should ideally be a floating point number, but at the time this was
    # written, gojson interpreted whole-number floats in yaml as integers.
    kafka_span_sample_rate_percent: 100
    
    kafka_metric_buffer_bytes: 0
    
    kafka_metric_buffer_messages: 0
    
    #kafka_metric_buffer_frequency: ""
    
    kafka_span_serialization_format: "protobuf"
    
    # The type of partitioner to use.
    kafka_partitioner: "hash"
    
    # What type of acks to require for metrics? One of none, local or all.
    kafka_metric_require_acks: "all"
    
    # What type of acks to require for span? One of none, local or all.
    kafka_span_require_acks: "all"
    
    kafka_span_buffer_bytes: 0
    
    kafka_span_buffer_mesages: 0
    
    #kafka_span_buffer_frequency: ""
    
    # The number of retries before giving up.
    kafka_retry_max: 0
    
    # == PLUGINS ==
    
    # == S3 Output ==
    # Include these if you want to archive data to S3
    aws_access_key_id: ""
    aws_secret_access_key: ""
    aws_region: ""
    aws_s3_bucket: ""
    
    # == LocalFile Output ==
    # Include this if you want to archive data to a local file (which should then be rotated/cleaned)
    flush_file: ""
    
    opened by aniaptebsft 0
  • enriching dropped metrics with tags from metric

    enriching dropped metrics with tags from metric

    Summary

    Enriching dropped metrics with tags from the metric we're dropping

    Motivation

    Allows better observability over dropped metrics

    Test plan

    Unit tested

    Rollout/monitoring/revert plan

    opened by andresgalindo-stripe 0
  • Improve multi-arch build + documentation

    Improve multi-arch build + documentation

    Summary

    In order to make #947 easier, this change makes use golang native cross-compilation making image building when build for both AMD64 and ARM64 around 5-6x faster.

    Motivation

    I'm also depending on issue #947 resolved, and I'm at your disposal for any help that I can provide.

    Test plan

    I'm still conducting internal testing of this.

    Rollout/monitoring/revert plan

    This should have no impact on AMD64 images, but ARM64 still requires more testing, but I don't think that should be a blocker on release.

    The command used to build the images needs to be updated, but could not find it anywhere in code, so I'm assuming you are doing this manually on release, in that case please use the updated command in the public-docker-images/README.md file but replacing --output=type=docker with --push.

    opened by rafaelgaspar 3
A library for performing data pipeline / ETL tasks in Go.

Ratchet A library for performing data pipeline / ETL tasks in Go. The Go programming language's simplicity, execution speed, and concurrency support m

Daily Burn 385 Jan 19, 2022
churro is a cloud-native Extract-Transform-Load (ETL) application designed to build, scale, and manage data pipeline applications.

Churro - ETL for Kubernetes churro is a cloud-native Extract-Transform-Load (ETL) application designed to build, scale, and manage data pipeline appli

churrodata 13 Mar 10, 2022
Prometheus Common Data Exporter can parse JSON, XML, yaml or other format data from various sources (such as HTTP response message, local file, TCP response message and UDP response message) into Prometheus metric data.

Prometheus Common Data Exporter Prometheus Common Data Exporter 用于将多种来源(如http响应报文、本地文件、TCP响应报文、UDP响应报文)的Json、xml、yaml或其它格式的数据,解析为Prometheus metric数据。

null 7 May 18, 2022
Dud is a lightweight tool for versioning data alongside source code and building data pipelines.

Dud Website | Install | Getting Started | Source Code Dud is a lightweight tool for versioning data alongside source code and building data pipelines.

Kevin Hanselman 119 Nov 22, 2022
CUE is an open source data constraint language which aims to simplify tasks involving defining and using data.

CUE is an open source data constraint language which aims to simplify tasks involving defining and using data.

null 3.2k Nov 30, 2022
xyr is a very lightweight, simple and powerful data ETL platform that helps you to query available data sources using SQL.

xyr [WIP] xyr is a very lightweight, simple and powerful data ETL platform that helps you to query available data sources using SQL. Supported Drivers

Mohammed Al Ashaal 56 Jul 14, 2022
Fast, efficient, and scalable distributed map/reduce system, DAG execution, in memory or on disk, written in pure Go, runs standalone or distributedly.

Gleam Gleam is a high performance and efficient distributed execution system, and also simple, generic, flexible and easy to customize. Gleam is built

Chris Lu 3.1k Nov 27, 2022
DEPRECATED: Data collection and processing made easy.

This project is deprecated. Please see this email for more details. Heka Data Acquisition and Processing Made Easy Heka is a tool for collecting and c

Mozilla Services 3.4k Nov 30, 2022
Open source framework for processing, monitoring, and alerting on time series data

Kapacitor Open source framework for processing, monitoring, and alerting on time series data Installation Kapacitor has two binaries: kapacitor – a CL

InfluxData 2.2k Dec 2, 2022
Kanzi is a modern, modular, expendable and efficient lossless data compressor implemented in Go.

kanzi Kanzi is a modern, modular, expendable and efficient lossless data compressor implemented in Go. modern: state-of-the-art algorithms are impleme

null 423 Nov 6, 2022
Data syncing in golang for ClickHouse.

ClickHouse Data Synchromesh Data syncing in golang for ClickHouse. based on go-zero ARCH A typical data warehouse architecture design of data sync Aut

好未来技术 876 Dec 4, 2022
sq is a command line tool that provides jq-style access to structured data sources such as SQL databases, or document formats like CSV or Excel.

sq: swiss-army knife for data sq is a command line tool that provides jq-style access to structured data sources such as SQL databases, or document fo

Neil O'Toole 398 Nov 25, 2022
Machine is a library for creating data workflows.

Machine is a library for creating data workflows. These workflows can be either very concise or quite complex, even allowing for cycles for flows that need retry or self healing mechanisms.

whitaker-io 121 Nov 23, 2022
Stream data into Google BigQuery concurrently using InsertAll() or BQ Storage.

bqwriter A Go package to write data into Google BigQuery concurrently with a high throughput. By default the InsertAll() API is used (REST API under t

null 11 Nov 9, 2022
Dev Lake is the one-stop solution that integrates, analyzes, and visualizes software development data

Dev Lake is the one-stop solution that integrates, analyzes, and visualizes software development data throughout the software development life cycle (SDLC) for engineering teams.

Merico 69 Nov 29, 2022
Distributed, fault-tolerant key-value storage written in go.

A simple, distributed, fault-tolerant key-value storage inspired by Redis. It uses Raft protocotol as consensus algorithm. It supports the following data structures: String, Bitmap, Map, List.

Igor German 360 Aug 4, 2022
Dkron - Distributed, fault tolerant job scheduling system https://dkron.io

Dkron - Distributed, fault tolerant job scheduling system for cloud native environments Website: http://dkron.io/ Dkron is a distributed cron service,

Distributed Works 3.4k Dec 5, 2022
A distributed fault-tolerant order book matching engine

Go - Between A distributed fault-tolerant order book matching engine. Features Limit orders Market orders Order book depth Calculate market price for

Daniel Gatis 1 Dec 24, 2021
Easy to use Raft library to make your app distributed, highly available and fault-tolerant

An easy to use customizable library to make your Go application Distributed, Highly available, Fault Tolerant etc... using Hashicorp's Raft library wh

Richard Bertok 60 Nov 16, 2022