DEPRECATED: Data collection and processing made easy.

Related tags

Data Processing heka
Overview

This project is deprecated. Please see this email for more details.

Heka

Data Acquisition and Processing Made Easy

Heka is a tool for collecting and collating data from a number of different sources, performing "in-flight" processing of collected data, and delivering the results to any number of destinations for further analysis.

Heka is written in Go, but Heka plugins can be written in either Go or Lua. The easiest way to compile Heka is by sourcing (see below) the build script in the root directory of the project, which will set up a Go environment, verify the prerequisites, and install all required dependencies. The build process also provides a mechanism for easily integrating external plug-in packages into the generated hekad. For more details and additional installation options see Installing.

WARNING: YOU MUST SOURCE THE BUILD SCRIPT (i.e. source build.sh) TO BUILD HEKA. Setting up the Go build environment requires changes to the shell environment, if you simply execute the script (i.e. ./build.sh) these changes will not be made.

Resources:

Comments
  • (slightly) Improving deb packaging with init scripts and user creation

    (slightly) Improving deb packaging with init scripts and user creation

    As mentioned in PR: #791 there was some wishes for upstart and systemd jobs as well as an init script and there was also concerns that the install([...]) used in the PR would actually have included the files in all packaging. I've added in a CMake module that I found here: https://github.com/sebknzl/cmake-debhelper/ which is licensed under GPL (might be of concern? But it is only for the build system, so it should not go viral over the entire code base) This module is "needed" to execute dh_* scripts during the setup of the project - those in turn are needed to be able to use the debian packaging idioms of putting the init scripts into a debian/ folder.

    I also found a "bug" in the current deb-package setup. The listed debian dependency did not get assigned to the package, which I'll admit caused me some headache - I initially tried adding the init files in a similar fashion. Passing it in as a variable to the custom make entry seems to have made it work, though. But is it really needed to have libc6 above 2.13? It means that heka won't install on squeeze...

    Though, in the long run I would really recommend going away from having CPack do the Debian packaging. I didn't do that now since it would've involved rearranging a significant portion of the build, but if it is something you want to do with the project I think I could summon the time to do that too. :)

    opened by mikn 26
  • Support new InfluxDB 0.9.x line protocol write API

    Support new InfluxDB 0.9.x line protocol write API

    Well, it looks like they've decided to change the default format for the write API, per this PR: https://github.com/influxdb/influxdb/pull/2696. I'm logging this here to follow up on updating the functionality of the Schema InfluxDB Write Encoder to support this as the JSON format will eventually be deprecated and this provides better performance anyway.

    opened by acesaro 25
  • Output buffer never flushed on restart

    Output buffer never flushed on restart

    With the following output config:

    [ESJsonEncoder]
    type = "ESJsonEncoder"
    index = "logs-%{Type}-%{%Y.%m.%d}"
    es_index_from_timestamp = true
    type_name = "%{Type}"
    fields = ["Timestamp", "Logger", "Severity", "Payload", "Pid", "Hostname", "DynamicFields"]
      [ESJsonEncoder.field_mappings]
      Timestamp = '@timestamp'
      #Uuid = 'heka_uuid'
      #Type = 'type' # inutile, car dans type_name
      Logger = 'heka_logger'
      Severity = 'syslog_severity_code'
      Payload = 'message'
      #EnvVersion = 'heka_env_version' # inutile
      Pid = 'pid'
      Hostname = 'host'
    
    [ElasticSearchOutput]
    message_matcher = "Type != 'heka.all-report' && Type != 'heka.memstat'"
    encoder = "ESJsonEncoder"
    server = "http://localhost:9200"
    flush_interval = 50
      [ElasticSearchOutput.buffering]
      max_file_size = 268435456  # 256MiB
      max_buffer_size = 536870912  # 512MiB
      full_action = "shutdown"
    

    After a service heka restart, the queue is not processed. It will grow untill full.

    I've looked at the code without any clue.

    I've seen that #1724 has not been merged into dev yet, but this is a different problem isn't it?

    opened by sathieu 20
  • Add FileOutput file rotation

    Add FileOutput file rotation

    We've had several requests for FileOutput to be able to do file rotation w/o the use of an external rotation tool, in part b/c less tools, and in part b/c Heka needs to get a HUP signal to actually notice that a file has been rotated out from under it, and the person running Heka doesn't always have access to when rotation has happened and when HUP needs to be sent.

    opened by rafrombrc 17
  • Implementation of a multiline splitter

    Implementation of a multiline splitter

    We work with a lot of Java and Scala stacktraces and the other options I've tried for supporting them in Heka don't work as well as I'd like. This is an implementation of a regex-based MultilineSplitter which works great for our stacktraces. The implementation is that you define a regex to use as the delimiter and a regex used to match lines that should be joined together. It first splits the buffer using the delimiter and then checks each section against the multiline regex to see if it's a match. All lines that are contiguous and match the multiline regex are joined. Because of the multiline nature, it always keeps the delimiter on the EOL.

    This is going to be notably slower than the RegexSplitter because it has to find many matches on the first pass rather than the first one. (That's limited to 99 by default and is not currently configurable without a recompile.) Secondly, it will run a second regex on all those matches. Given's Go's performance-oriented Regex implementation and reasonable logging levels it appears to be tolerable. It can, in the worst case, re-split the first lines in a very large buffer repeatedly.

    Here's an example configuration for the splitter:

    [multiline_splitter]
    type = "MultilineSplitter"
    multiline = '(\] FATAL )|(\A\s*.+Exception: .)|(at \S+\(\S+\))|(\A\s+... \d+ more)|(\A\s*Caused by:.)|(\A\s*Grave:)'
    delimiter = '\n'
    

    Given a broken Kafka installation, this generates something like the following when encoded with the ESLogstashV0Encoder:

    {
      "@fields": {
        "ContainerName": "boring_bohr",
        "ContainerID": "910f097243d6"
      },
      "@source_host": "docker1",
      "@uuid": "bda1cb47-8aa4-420e-9316-543364afd5fc",
      "@timestamp": "2016-01-24T14:22:17",
      "@type": "message",
      "@logger": "stdout",
      "@severity": 7,
      "@message": "java.net.ConnectException: Connection refused\n\tat sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)\n\tat sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:716)\n\tat org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361)\n\tat org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081)\n",
      "@envversion": "",
      "@pid": 0
    }
    

    Note that this output is from a splitter-enabled Docker input plugin that I will prepare a separate PR for.

    opened by relistan 15
  • Raw syslog datagrams multidecoder example

    Raw syslog datagrams multidecoder example

    I am proposing to add an example to cover #1162 and the older #790.

    This is particularly interesting in the case of containers, if one does not want to run syslog/rsyslog inside the container but rather directly read /dev/log via heka.

    Although the best scenario is to let syslog/rsyslog poll /dev/log and write their own log (which is then feeded to/read by heka), I prefer this approach and I provide here a working example (tested with 0.9.0) that should also allow people to easily extend/build on.

    I am new to heka, so please tell me about possible improvements. Feedback for PR changes welcome as well :)

    NOTE: might require some refactoring of the name of sibling example.toml, if this is accepted at all

    1. Is there already a decoder that does this?
    2. If not, would this better be converted to a LUA decoder?

    I didn't write (2) as I wanted a simple message proxying feature, but for official inclusion it might be a valid option instead of a multidecoder example (also to properly parse PRI).

    opened by gdm85 14
  • New input plugin for Docker containers: DockerLogInput

    New input plugin for Docker containers: DockerLogInput

    Solves: #1092 This is a follow-up on PR #1095

    This PR implements DockerLogInput, an input plugin based on Logspout for sending logs from Docker containers into the heka pipeline.

    @rafrombrc Something like this? I renamed the plugin to DockerLogInput and wrote some basic docs for it.

    opened by carlanton 14
  • Can't install heka in Ubuntu13.04

    Can't install heka in Ubuntu13.04

    Hi, all!

    We use ubuntu13.04, and we install Go lang use source code under /usr/local/go and I set the path like this

     export GOROOT=/usr/local/go
     export PATH=$PATH:$GOROOT/bin
    
    # go version
    go version go1.1.1 linux/amd64
    

    after finished install golang, And install heka got a error

    [email protected]:~/heka# ./build.sh
    CMake Error at /usr/share/cmake-2.8/Modules/FindPackageHandleStandardArgs.cmake:97 (message):
      Could NOT find Go (missing: GO_VERSION GO_PLATFORM GO_ARCH) (Required is at
      least version "1.1")
    Call Stack (most recent call first):
      /usr/share/cmake-2.8/Modules/FindPackageHandleStandardArgs.cmake:291 (_FPHSA_FAILURE_MESSAGE)
      cmake/FindGo.cmake:32 (find_package_handle_standard_args)
      CMakeLists.txt:16 (find_package)
    
    
    -- Configuring incomplete, errors occurred!
    make: *** No targets specified and no makefile found.  Stop.
    

    Can anyone tell me. How can I install heka the right way?

    opened by ghost 14
  • Update Schema InfluxDB Write Encoder to use InfluxDB 0.9.x+ line protocol

    Update Schema InfluxDB Write Encoder to use InfluxDB 0.9.x+ line protocol

    This PR updates the Schema InfluxDB Write Encoder to format metrics into the line protocol instead of the JSON format that was originally proposed for the 0.9.0 release. This also makes sure that proper formatting is done for the various fields to escape spaces, commas and double quotes as defined by the line protocol. There is also a somewhat hacky implementation to overcome the fact that Lua converts float values of 0.0 into an integer based on its internal representation of numerical data types. I've also added a couple of new configuration items that provide more flexibility in the naming of measurements as they are sent to InfluxDB. There is now more emphasis on utilizing tags instead of the Graphite style "paths" to uniquely identify series, so the defaults have been blanked out to work with this recommendation more naturally.

    opened by acesaro 12
  • Can't install built heka deb package.

    Can't install built heka deb package.

    I can run generated hekad binary file directly, but after run

    source build.sh
    make deb
    sudo dpkg -i heka_0.10.0_amd64.deb
    

    OS output:

    dpkg: dependency problems prevent configuration of heka:
     heka depends on libc6-amd64 (>= 2.15).
    

    However, if i install v0.10.0 directly from source, there is no this dependency at all. Is there something wrong?

    opened by YiuTerran 11
  • add example decoder for linux /proc/stat

    add example decoder for linux /proc/stat

    /proc/stat is a good source for cpu utilization metric. Where /proc/loadavg is kinda tricky to work with. /proc/stat can give 1 second resolution without any issues.

    however getting the information is not so simple. You have to pull /proc/stat twice to solve the delta of the values to gauge preformance.

    This is my first time playing with lua lpeg and im still vary green to the heka project.

    This pull request is just to get some thoughs on how a good heka plugin could be created for the /proc/stat cpu metric.

    This solution is not ideal! but it's giving me good data.

    [stat_ProcessInput]
    type = "ProcessInput"
    decoder = "StatDecoder"
    ticker_interval = 3
    stdout = true
    stderr = false
        [stat_ProcessInput.command.0]
        bin = "/bin/sh"
        args = ["-c",'A=`head -1 /proc/stat`; sleep 1; B=`head -1 /proc/stat`; echo ${A}zzz${B}zzz;']
    
    {#  This would be best ...  but I dont see how I can get the diff of the previous read...
        [stat]
        type = "FilePollingInput"
        ticker_interval = 1
        file_path = "/proc/stat"
        decoder = "StatDecoder"
    #}
    
    [StatDecoder]
    type = "SandboxDecoder"
    filename = "lua_decoders/linux_stat.lua"
    
    opened by steverweber 11
  • CODE_OF_CONDUCT.md file missing

    CODE_OF_CONDUCT.md file missing

    As of January 1 2019, Mozilla requires that all GitHub projects include this CODE_OF_CONDUCT.md file in the project root. The file has two parts:

    1. Required Text - All text under the headings Community Participation Guidelines and How to Report, are required, and should not be altered.
    2. Optional Text - The Project Specific Etiquette heading provides a space to speak more specifically about ways people can work effectively and inclusively together. Some examples of those can be found on the Firefox Debugger project, and Common Voice. (The optional part is commented out in the raw template file, and will not be visible until you modify and uncomment that part.)

    If you have any questions about this file, or Code of Conduct policies and procedures, please reach out to [email protected]

    (Message COC001)

    opened by Mozilla-GitHub-Standards 0
  • kafka partition issus

    kafka partition issus

    conf: `[kafkaInputTest] type = "KafkaInput" topic = "jie" addrs = ["172.20.3.50:9092"] splitter = "KafkaSplitter" decoder = "ProtobufDecoder"

    [KafkaSplitter] type = "NullSplitter" use_message_bytes = true `

    But error:

    2018/09/17 16:01:18 Input 'kafkaInputTest' error: kafka server: In the middle of a leadership election, there is currently no leader for this partition and hence it is unavailable for writes.

    opened by kadisyy 0
  • build failed on master branch

    build failed on master branch

    [ 85%] Performing build step for 'lua_sandbox' Scanning dependencies of target lua-5_1_5 [ 1%] Creating directories for 'lua-5_1_5' [ 2%] Performing download step (git clone) for 'lua-5_1_5' Cloning into 'lua-5_1_5'... remote: Repository not found. fatal: repository 'https://github.com/trink/lua.git/' not found

    opened by zdxie 3
  • MySQL Slow Query Log Decoder issue

    MySQL Slow Query Log Decoder issue

    [hekad] maxprocs = 2 #base_dir = "./base_dir" share_dir = "/usr/share/heka" #log_info_filename = "logs/info.log" #log_error_filename = "logs/error.log" #log_file_max_size = 64 #log_file_max_backups = 7

    [Sync-1_5-SlowQuery] type = "LogstreamerInput" log_directory = "/data/soft/" file_match = 'mysqlslowq.log' parser_type = "regexp" delimiter = "\n(# [email protected]:)" delimiter_location = "start" decoder = "MySqlSlowQueryDecoder"

    [MySqlSlowQueryDecoder] type = "SandboxDecoder" filename = "lua_decoders/mysql_slow_query.lua"

    [MySqlSlowQueryDecoder.config]
    truncate_sql = 64
    

    [ESJsonEncoder] index = "%{Type}-%{%Y.%m.%d}" es_index_from_timestamp = true type_name = "%{Type}" [ESJsonEncoder.field_mappings] Timestamp = "@timestamp" Severity = "level"

    [output_file] type = "FileOutput" message_matcher = "TRUE" path = "/data/mysql-output.log" perm = "666" flush_count = 100 flush_operator = "OR" encoder = "ESJsonEncoder"

    #######################################################################

    [[email protected] heka]# hekad -config="mysql.toml"

    2018/01/25 09:59:20 Pre-loading: [output_file] 2018/01/25 09:59:20 Pre-loading: [Sync-1_5-SlowQuery] 2018/01/25 09:59:20 Pre-loading: [MySqlSlowQueryDecoder] 2018/01/25 09:59:20 Pre-loading: [ESJsonEncoder] 2018/01/25 09:59:20 Pre-loading: [ProtobufDecoder] 2018/01/25 09:59:20 Loading: [ProtobufDecoder] 2018/01/25 09:59:20 Pre-loading: [ProtobufEncoder] 2018/01/25 09:59:20 Loading: [ProtobufEncoder] 2018/01/25 09:59:20 Pre-loading: [TokenSplitter] 2018/01/25 09:59:20 Loading: [TokenSplitter] 2018/01/25 09:59:20 Pre-loading: [HekaFramingSplitter] 2018/01/25 09:59:20 Loading: [HekaFramingSplitter] 2018/01/25 09:59:20 Pre-loading: [NullSplitter] 2018/01/25 09:59:20 Loading: [NullSplitter] 2018/01/25 09:59:20 Loading: [MySqlSlowQueryDecoder] 2018/01/25 09:59:20 Loading: [ESJsonEncoder] 2018/01/25 09:59:20 Loading: [Sync-1_5-SlowQuery] 2018/01/25 09:59:20 unknown config setting for 'Sync-1_5-SlowQuery': parser_type 2018/01/25 09:59:20 Loading: [output_file] 2018/01/25 09:59:20 Error reading config: 1 errors loading plugins

    opened by ytc301 1
  • config max_message_size 10M in v0.10.0, heka didn't send data to ES when input message about 1.4M

    config max_message_size 10M in v0.10.0, heka didn't send data to ES when input message about 1.4M

    RT need any other config items ?

    follow is my part config [hekad] maxprocs = 16 base_dir = "/export/home/hekad" max_message_size = 10485760

    #attack log default [nginx_udp_551] type = "UdpInput" address = "172.18.182.162:551" decoder = "JsonDecoder" send_decode_failures = true log_decode_failures = true

    [JsonDecoder] type = "SandboxDecoder" filename = "lua_decoders/json.lua"

    [JsonDecoder.config] payload_keep = false map_fields = true Timestamp = "time_stamp" Type = "log_type" #type = "ngx_log"

    opened by chwma 1
Releases(v0.10.0)
Owner
Mozilla Services
see also http://blog.mozilla.com/services
Mozilla Services
Open source framework for processing, monitoring, and alerting on time series data

Kapacitor Open source framework for processing, monitoring, and alerting on time series data Installation Kapacitor has two binaries: kapacitor – a CL

InfluxData 2.2k Sep 21, 2022
Prometheus Common Data Exporter can parse JSON, XML, yaml or other format data from various sources (such as HTTP response message, local file, TCP response message and UDP response message) into Prometheus metric data.

Prometheus Common Data Exporter Prometheus Common Data Exporter 用于将多种来源(如http响应报文、本地文件、TCP响应报文、UDP响应报文)的Json、xml、yaml或其它格式的数据,解析为Prometheus metric数据。

null 7 May 18, 2022
A stream processing API for Go (alpha)

A data stream processing API for Go (alpha) Automi is an API for processing streams of data using idiomatic Go. Using Automi, programs can process str

Vladimir Vivien 767 Sep 6, 2022
DataKit is collection agent for DataFlux.

DataKit DataKit is collection agent for DataFlux Build Dependencies apt-get install gcc-multilib: for building oracle input apt-get install tree: for

null 132 Sep 22, 2022
Go Collection Stream API, inspired in Java 8 Stream.

GoStream gostream 是一个数据流式处理库。它可以声明式地对数据进行转换、过滤、排序、分组、收集,而无需关心操作细节。 Changelog 2021-11-18 add ToSet() collector Roadmap 移除go-linq依赖 Get GoStream go get

null 1 Jan 10, 2022
Dud is a lightweight tool for versioning data alongside source code and building data pipelines.

Dud Website | Install | Getting Started | Source Code Dud is a lightweight tool for versioning data alongside source code and building data pipelines.

Kevin Hanselman 111 Sep 2, 2022
CUE is an open source data constraint language which aims to simplify tasks involving defining and using data.

CUE is an open source data constraint language which aims to simplify tasks involving defining and using data.

null 3k Sep 23, 2022
xyr is a very lightweight, simple and powerful data ETL platform that helps you to query available data sources using SQL.

xyr [WIP] xyr is a very lightweight, simple and powerful data ETL platform that helps you to query available data sources using SQL. Supported Drivers

Mohammed Al Ashaal 56 Jul 14, 2022
Kanzi is a modern, modular, expendable and efficient lossless data compressor implemented in Go.

kanzi Kanzi is a modern, modular, expendable and efficient lossless data compressor implemented in Go. modern: state-of-the-art algorithms are impleme

null 422 Aug 15, 2022
churro is a cloud-native Extract-Transform-Load (ETL) application designed to build, scale, and manage data pipeline applications.

Churro - ETL for Kubernetes churro is a cloud-native Extract-Transform-Load (ETL) application designed to build, scale, and manage data pipeline appli

churrodata 13 Mar 10, 2022
Dev Lake is the one-stop solution that integrates, analyzes, and visualizes software development data

Dev Lake is the one-stop solution that integrates, analyzes, and visualizes software development data throughout the software development life cycle (SDLC) for engineering teams.

Merico 57 Sep 23, 2022
A library for performing data pipeline / ETL tasks in Go.

Ratchet A library for performing data pipeline / ETL tasks in Go. The Go programming language's simplicity, execution speed, and concurrency support m

Daily Burn 385 Jan 19, 2022
A distributed, fault-tolerant pipeline for observability data

Table of Contents What Is Veneur? Use Case See Also Status Features Vendor And Backend Agnostic Modern Metrics Format (Or Others!) Global Aggregation

Stripe 1.6k Sep 6, 2022
Data syncing in golang for ClickHouse.

ClickHouse Data Synchromesh Data syncing in golang for ClickHouse. based on go-zero ARCH A typical data warehouse architecture design of data sync Aut

好未来技术 847 Sep 16, 2022
sq is a command line tool that provides jq-style access to structured data sources such as SQL databases, or document formats like CSV or Excel.

sq: swiss-army knife for data sq is a command line tool that provides jq-style access to structured data sources such as SQL databases, or document fo

Neil O'Toole 388 Sep 25, 2022
Machine is a library for creating data workflows.

Machine is a library for creating data workflows. These workflows can be either very concise or quite complex, even allowing for cycles for flows that need retry or self healing mechanisms.

whitaker-io 116 Sep 27, 2022
Stream data into Google BigQuery concurrently using InsertAll() or BQ Storage.

bqwriter A Go package to write data into Google BigQuery concurrently with a high throughput. By default the InsertAll() API is used (REST API under t

null 11 Sep 1, 2022
Fast, efficient, and scalable distributed map/reduce system, DAG execution, in memory or on disk, written in pure Go, runs standalone or distributedly.

Gleam Gleam is a high performance and efficient distributed execution system, and also simple, generic, flexible and easy to customize. Gleam is built

Chris Lu 3.1k Sep 21, 2022