omniparser: a native Golang ETL streaming parser and transform library for CSV, JSON, XML, EDI, text, etc.

Overview

omniparser

CI codecov Go Report Card PkgGoDev Mentioned in Awesome Go

Omniparser is a native Golang ETL parser that ingests input data of various formats (CSV, txt, fixed length/width, XML, EDI/X12/EDIFACT, JSON, and custom formats) in streaming fashion and transforms data into desired JSON output based on a schema written in JSON.

Golang Version: 1.14

Documentation

Docs:

References:

Examples:

In the example folders above you will find pairs of input files and their schema files. Then in the .snapshots sub directory, you'll find their corresponding output files.

Online Playground

Use https://omniparser.herokuapp.com/ (may need to wait for a few seconds for heroku instance to wake up) for trying out schemas and inputs, yours or existing samples, to see how ingestion and transform work.

Why

  • No good ETL transform/parser library exists in Golang.
  • Even looking into Java and other languages, choices aren't many and all have limitations:
    • Smooks is dead, plus its EDI parsing/transform is too heavyweight, needing code-gen.
    • BeanIO can't deal with EDI input.
    • Jolt can't deal with anything other than JSON input.
    • JSONata still only JSON -> JSON transform.
  • Many of the parsers/transforms don't support streaming read, loading entire input into memory - not acceptable in some situations.

Requirements

  • Golang 1.14

Recent Major Feature Additions/Changes

  • Added Transform.RawRecord() for caller of omniparser to access the raw ingested record.
  • Deprecated custom_parse in favor of custom_func (custom_parse is still usable for back-compatibility, it is just removed from all public docs and samples).
  • Added NonValidatingReader EDI segment reader.
  • Added fixed-length file format support in omniv21 handler.
  • Added EDI file format support in omniv21 handler.
  • Major restructure/refactoring
    • Upgrade omni schema version to omni.2.1 due a number of incompatible schema changes:
      • 'result_type' -> 'type'
      • 'ignore_error_and_return_empty_str -> 'ignore_error'
      • 'keep_leading_trailing_space' -> 'no_trim'
    • Changed how we handle custom functions: previously we always use strings as in param type as well as result param type. Not anymore, all types are supported for custom function in and out params.
    • Changed the way how we package custom functions for extensions: previously we collect custom functions from all extensions and then pass all of them to the extension that is used; This feels weird, now changed to only the custom functions included in a particular extension are used in that extension.
    • Deprecated/removed most of the custom functions in favor of using 'javascript'.
    • A number of package renaming.
  • Added CSV file format support in omniv2 handler.
  • Introduced IDR node cache for allocation recycling.
  • Introduced IDR for in-memory data representation.
  • Added trie based high performance times.SmartParse.
  • Command line interface (one-off transform cmd or long-running http server mode).
  • javascript engine integration as a custom_func.
  • JSON stream parser.
  • Extensibility:
    • Ability to provide custom functions.
    • Ability to provide custom schema handler.
    • Ability to customize the built-in omniv2 schema handler's parsing code.
    • Ability to provide a new file format support to built-in omniv2 schema handler.

Footnotes

Comments
  • [feature] parsing multiple nested records under fixed-length parser

    [feature] parsing multiple nested records under fixed-length parser

    Hi,

    Are there plans or some documentation around how to support parsing of multiple nested objects in fixed-length parser?

    i.e. format like this one, that have repeating and nested elements like NWR under GRHNWR and also SPU SPT, SWT, SWR under NWR records.

    HDRPB123456789SAMPLE MEDIA MUSIC                        01.1000000003200123412340713
    GRHNWR0000102.100000003035
    
    NWR0000000000000000Song 1 - 1 Pub1/ 0 Wrt                                        00000000100001T000000000100000000            POP000000Y      ORI         TM (BULK)
    SPU000000000000000101Pub10    Publisher 10                         E 00000000000700000010              010012000340120009901200 N                                          OG
    SPT0000000000000002RMM            002090041800000I2136N001
    
    NWR0000000100000000Song 2 - 1 Pub2/ 0 Wrt // 1 SPU-AM                            00000000100002T000000000200000000            POP000000Y      ORI         TM (BULK)
    SPU000000010000000101Pub20    Publisher 20                         E 00000000000700000020              010012000340120009901200 N                                          OG
    SPU000000010000000201RMM      SAMPLE MEDIA MUSIC                   AM000000000005470437533745610       010000000340000009900000 N                            5594837       PG
    SPT0000000100000003RMM            002090041800000I2136N001
    
    NWR0000000300000000Song 4 - 1 Pub1/ 1 Wrt1                                       00000000100004T000000000400000000            POP000000Y      ORI         TM (BULK)
    SPU000000030000000101Pub10    Publisher 10                         E 00000000000700000010              010012000340120009901200 N                                          OG
    SPT0000000300000002RMM            002090041800000I2136N001
    SWR0000000300000003Wrt100   Writer 100                                   Controlled                     CA00000000000700000100010040000340400009904000 N
    SWT0000000300000004Wrt100   004180000000000I2136N001
    
    NWR0000000500000000Song 5 - 2 Pub1,2 / 2 Wrt - 1 new 1 old                       00000000100005T000000000500000000            POP000000Y      ORI         TM (BULK)
    SPU000000050000000101Pub10    Publisher 10                         E 00000000000700000010              010010000340100009901000 N                                          OG
    SPT0000000500000002RMM            002090041800000I2136N001
    SPU000000050000000301Pub50    Publisher 50                         E 00000000000700000050              010010000340100009901000 N                                          OG
    SPT0000000500000004RMM            002090041800000I2136N001
    SWR0000000500000005Wrt100   Writer 100                                   Controlled                     CA00000000000700000100010030000340300009903000 N
    SWT0000000500000006Wrt100   004180000000000I2136N001
    SWR0000000500000007Wrt500   Writer 500                                   Controlled                     CA00000000000700000500010030000340300009903000 N
    SWT0000000500000008Wrt500   004180000000000I2136N001
    
    GRT000010000000100000163
    TRL000010000000100000165
    
    opened by cavke 36
  • [EDI] Huge memory leak when parsing ~20MB EDI file

    [EDI] Huge memory leak when parsing ~20MB EDI file

    Hi, we found a huge memory leak on parsing the EDI file with a size of over 20Mb

    • EDI is a sample INVOIC type with millions of LIN items.
    • attaching SVG of export --memprofile of benchmark,
    • it allocates over 5GB memory for the single ~23MB EDI file

    is there any alternative to rather than calling ingester.Parse() for the large files?

    opened by jurabek 10
  • Return unique hash for input

    Return unique hash for input

    So i'm busy ingesting shipments, they arrive as either csv, json, xml or edi

    The interface I'm working should take an array of shipments, divide that into individual shipments, hash those and store the original input for success/audit/retry/failure tracking reasons. This would make it easier to ingest 99/100 shipments and retry (after localizing and fixing the issue) that one shipment that's invalid for whatever reason.

    In order to decide whether something has been ingested correctly I thought a solution could be hashing it 'unit' of input and storing the original input somewhere as well

    Quite easy for csv

    Weird python-and-bash-esque psuedocode:

    for line in csv:
      process(line) && hash(line) && gzip(line) -> store result, hash, line in db
    

    It becomes less so for json and xml, even marshal and unmarshal is not 100% identical to the input

    Even worse is EDI

    So, even though I liked the idea of storing the original it quickly becomse cumbersome. A decent alternative is is hashing and storing the output of transform.Read()

    But that comes with several issues

    • I can change the output and thus the hash using the schema (not really an issue)
    • its not original (but it is more consistent (all json)), so kind of bug/feature
    • I don't see what I haven't told omniparser to see, so new fields that might have been added

    None of these are a major issue, but part of hashing a new representation of the input, not the input itself

    I was wondering how hard would it be to hash the input of whatever generates the output would be? So: hash, data, err := transform.Read

    Is your internal data stable enough? That you could say 'for loop' the IDR input through the sha256 encoder (it supports streaming) and return a stable/unchanging hash?

    As in, in theory ["a", "b", "c"] should return the same hash for a, b and c regardless of ordering

    Also, I imagine being able to verify whether a file has been fully processed is interesting for more than one usecase

    opened by DGollings 8
  • [EDI] Handle segment compression

    [EDI] Handle segment compression

    Disclaimer: I only assume this is segment compression, as defined in the manual

    7.1 Exclusion of segments Conditional segments containing no data shall be omitted (including their segment tags).

    This is what I encountered in the schema, basically a mandatory/conditional sandwich.

    SG25 R 99
    43 NAD M 1
    44 LOC Orts 9 O
    
    SG25 R 99
    45 NAD M 1
    46 LOC O 9
        SG29 C 9
        47 RFF M 1
    
    SG25 O 99
    48 NAD M 1
    
    SG25 D 99
    49 NAD M 1
    
    SG25 D 99
    50 NAD M 1
    
    SG25 O 99
    51 NAD M 1
    
    SG25 M 99
    52 NAD M 1
        SG29 C 9
        53 RFF M 1
    
    SG25 D 99
    54 NAD M 1
    
    SG25 R 99
    55 NAD M 1
    
    SG25 R 99
    56 NAD M 1
        SG26 C 9
        57 CTA O 1
        58 COM O 9
    

    None of the conditional statements were present in the data I was trying to parse, ended up fixing it using:

                        "name": "SG25-SENDER",
                        "min": 1,
                        "type": "segment_group",
                        "child_segments": [
                          {
                            "name": "NAD",
                            "min": 1,
                            "elements": [
                              { "name": "cityName", "index": 1 },
                              { "name": "provinceCode", "index": 2 },
                              { "name": "postalCode", "index": 3 },
                              { "name": "countryCode", "index": 4 }
                            ]
                          },
                          { "name": "LOC", "min": 0 }
                        ]
                      },
                      {
                        "name": "SG25-RECEIVER",
                        "min": 1,
                        "type": "segment_group",
                        "child_segments": [
                          { "name": "NAD", "min": 1 },
                          { "name": "LOC", "min": 0 },
                          {
                            "name": "SG29",
                            "min": 0,
                            "type": "segment_group",
                            "child_segments": [{ "name": "RFF", "min": 1 }]
                          }
                        ]
                      },
                      {
                        "name": "SG25-OTHERS",
                        "min": 0,
                        "max": 99,
                        "type": "segment_group",
                        "child_segments": [
                          {
                            "name": "SG26",
                            "min": 0,
                            "type": "segment_group",
                            "child_segments": [
                              { "name": "CTA", "min": 0 },
                              { "name": "COM", "min": 0, "max": -1 }
                            ]
                          },
                          { "name": "NAD", "min": 0, "max": -1 },
                          { "name": "LOC", "min": 0 },
                          {
                            "name": "SG29",
                            "min": 0,
                            "type": "segment_group",
                            "child_segments": [{ "name": "RFF", "min": 1 }]
                          }
                        ]
                      },
    

    The message I'm trying to parse

    NAD+CZ+46388514++Foo A/S+Foo 2+Foo++Foo+DK'
    NAD+CN+46448510++NL01001 Foo Foo Foo:Foo+Foo 6+Foo++Foo+NL'
    CTA+CN+AS:NL01001 Foo'
    COM+0031765140344:TE'
    [email protected]:EM'
    NAD+LP+04900000250'
    

    Which basically means, grab the two explicit ones (luckily at top), and do as you wish with the others in whatever order you encounter them. I'm not sure how I would have handled it if I did care about NAD+LP

    Also had to use min/max 1 instead of the specified 99, as it only considers NAD, not NAD+FIRSTVALUE when 'collapsing' similar but not same segments.

    Basically, the EDI specification has a lot of implicitness which I think is quite hard to easily parse.

    EDI 
    opened by DGollings 6
  • supports a

    supports a "segment_prefix" in the edi parser file declaration

    I'm working with a non-standard EDI format that includes a segment prefix. For example, a message might be:

    |HDR|1|2|3|
    |DAT|X|
    |EOF|
    

    where every segment begins with a pipe. I thought that I could get around this by making the segment delimiter include the next pipe (ie |\n|), but this doesn't catch the very first pipe.

    I propose including a new (optional) "segment_prefix" field in the file_declaration to catch segment prefixes.

    opened by samolds 5
  • Edi parsing failing, with error segment needs min occur 1, but only got 0

    Edi parsing failing, with error segment needs min occur 1, but only got 0

    opened by rohanbr 5
  • [EDI] Handle (or ignore?) line endings

    [EDI] Handle (or ignore?) line endings

    Consider the header

    UNA:+.? '

    This wouldn't be

        "component_delimiter": ":",
        "element_delimiter": "+",
        "segment_delimiter": "'",
        "release_character": "?",
    

    But instead is

        "component_delimiter": ":",
        "element_delimiter": "+",
        "segment_delimiter": "'\n", <--
        "release_character": "?",
    

    If the file you're trying to read contains line endings. Maybe the safest/easiest option would be removing all line endings?

    EDI 
    opened by DGollings 5
  • Error in go get install latest version

    Error in go get install latest version

    Hi,

    I am getting following error while installing the latest version v1.0.2 of omniparser

    extensions/omniv21/fileformat/flatfile/fixedlength/.snapshots/TestReadAndMatchRowsBasedEnvelope-non-empty_buf;_no_read;_match;_create_IDR: malformed file path "extensions/omniv21/fileformat/flatfile/fixedlength/.snapshots/TestReadAndMatchRowsBasedEnvelope-non-empty_buf;_no_read;_match;_create_IDR": invalid char ';'

    probably accidental typo in file name is checked in

    opened by deokapil 4
  • JSON/XML to EDI conversion

    JSON/XML to EDI conversion

    Could you please clarify if JSON / XML data can be converted to EDI EDIFACT data? If feasible, could you please provide a small example? Thank you in advance.

    opened by sankethpb 4
  • Ignore blank rows CSV

    Ignore blank rows CSV

    Hi,

    encountered a csv (converted from xlsx) where one line was all blank (,,,,,,)

    Errored on 'cannot convert "" to int'

    Now I fixed it at the source, but in general, it might be useful to add the option 'ignore blank row' or be able to set a default (like you can in EDI) for empty values? Especially when casting? (in this case, setting '' to 0)

    opened by DGollings 4
  • EDIFACT parser segment skip

    EDIFACT parser segment skip

    Could you please let me know if it's possible to skip a segment if it's not declared on schema but it's present on the input and vice versa?

    Error generated: bad request: transform failed. err: input 'test-input' at segment no.8 (char[247,247]): segment 'details/NAD' needs min occur 1, but only got 0 Input: UNA:+.? ' UNB+UNOC:3+9999999999999:14+9999999999998:14+210419:1622+446047262+ORDERS' UNH+1+ORDERS:D:96A:UN:EAN008' BGM+220::9+6666666666+9' DTM+137:20210419:102' DTM+2:20210518:102' FTX+PUR+3++STORE ORDER:DR01' RFF+PUR+3++STORE ORDER PLEN:DR01' NAD+BY+9999999999999::9'

    Schema: { "parser_settings":{ "version":"omni.2.1", "file_format_type":"edi" }, "file_declaration":{ "segment_delimiter":"'", "element_delimiter":"+", "component_delimiter":":", "ignore_crlf":true, "segment_declarations":[ { "name":"details", "is_target":true, "type":"segment_group", "min":0, "max":-1, "child_segments":[ { "name":"UNA", "elements":[ { "name":"random1", "index":1 } ] }, { "name":"UNB", "elements":[ { "name":"syntaxIdentifier", "index":1 }, { "name":"buyerGln", "index":2 }, { "name":"sellerGln", "index":3 }, { "name":"docDate", "index":4 }, { "name":"transferNumber", "index":5 }, { "name":"documentType", "index":6 } ] }, { "name":"UNH", "elements":[ { "name":"documentType2", "index":1 }, { "name":"fileFormatType", "index":2 } ] }, { "name":"BGM", "elements":[ { "name":"orderType", "index":1 }, { "name":"orderNumber", "index":2 }, { "name":"SignatureForOriginal", "index":3 } ] }, { "name":"DTM", "elements":[ { "name":"qualifierDocDate", "index":1, "component_index":1 }, { "name":"docDate", "index":1, "component_index":2 }, { "name":"formatDate", "index":1, "component_index":3 } ] }, { "name":"DTM", "elements":[ { "name":"qualifierDeliveryDate", "index":1, "component_index":1 }, { "name":"deliveryDate", "index":1, "component_index":2 }, { "name":"deliveryformatDate", "index":1, "component_index":3 } ] }, { "name":"FTX", "elements":[ { "name":"containPurchaseInformation", "index":1 }, { "name":"defaultValue", "index":2 }, { "name":"freeText1", "index":4, "component_index":1 }, { "name":"freeText2", "index":4, "component_index":2 } ] }, { "name":"NAD", "elements":[ { "name":"partyQualifier1", "index":1 }, { "name":"partyGln1", "index":2, "component_index":1 }, { "name":"partyIDcode1", "index":2, "component_index":3 } ] } ] } ] }, "transform_declarations":{ "FINAL_OUTPUT":{ "object":{ "una_elem1":{ "xpath":"UNA/random1" }, "header1":{ "object":{ "syntaxIdentifier":{ "xpath":"UNB/syntaxIdentifier" }, "buyerGln":{ "xpath":"UNB/buyerGln" }, "sellerGln":{ "xpath":"UNB/sellerGln" }, "docDate":{ "xpath":"UNB/docDate" }, "transferNumber":{ "xpath":"UNB/transferNumber" }, "documentType":{ "xpath":"UNB/documentType" } } }, "heade2":{ "object":{ "documentType2":{ "xpath":"UNH/documentType2" }, "fileFormatType":{ "xpath":"UNH/fileFormatType" } } }, "document":{ "object":{ "documentType2":{ "xpath":"BGM/orderType" }, "orderNumber":{ "xpath":"BGM/orderNumber" }, "SignatureForOriginal":{ "xpath":"BGM/SignatureForOriginal" } } }, "docDate":{ "object":{ "qualifierDocDate":{ "xpath":"DTM/qualifierDocDate" }, "docDate":{ "xpath":"DTM/docDate" }, "formatDate":{ "xpath":"DTM/formatDate" } } }, "deliveryDate":{ "object":{ "qualifierDeliveryDate":{ "xpath":"DTM/qualifierDeliveryDate" }, "deliveryDate":{ "xpath":"DTM/deliveryDate" }, "deliveryformatDate":{ "xpath":"DTM/deliveryformatDate" } } }, "freeText":{ "object":{ "containPurchaseInformation":{ "xpath":"FTX/containPurchaseInformation" }, "defaultValue":{ "xpath":"FTX/defaultValue" }, "freeText1":{ "xpath":"FTX/freeText1" }, "freeText2":{ "xpath":"FTX/freeText2" } } }, "PartyInformation1":{ "object":{ "partyQualifier":{ "xpath":"NAD/partyQualifier1" }, "partyGln":{ "xpath":"NAD/partyGln1" }, "partyIDcode":{ "xpath":"NAD/partyIDcode1" } } } } } } }

    opened by cnnelrib 3
Releases(v1.0.4)
Owner
JF Technology
JF Technology
Quick and simple parser for PFSense XML configuration files, good for auditing firewall rules

pfcfg-parser version 0.0.1 : 13 January 2022 A quick and simple parser for PFSense XML configuration files to generate a plain text file of the main c

Rory Campbell-Lange 0 Jan 13, 2022
Decode / encode XML to/from map[string]interface{} (or JSON); extract values with dot-notation paths and wildcards. Replaces x2j and j2x packages.

mxj - to/from maps, XML and JSON Decode/encode XML to/from map[string]interface{} (or JSON) values, and extract/modify values from maps by key or key-

Charles Banning 525 Sep 16, 2022
Convert xml and json to go struct

xj2go The goal is to convert xml or json file to go struct file. Usage Download and install it: $ go get -u -v github.com/wk30/xj2go/cmd/... $ xj [-t

null 28 Sep 27, 2022
wikipedia-jsonl is a CLI that converts Wikipedia dump XML to JSON Lines format.

wikipedia-jsonl wikipedia-jsonl is a CLI that converts Wikipedia dump XML to JSON Lines format. How to use At first, download the XML dump from Wikime

Minoru Osuka 2 Feb 13, 2022
This package provides Go (golang) types and helper functions to do some basic but useful things with mxGraph diagrams in XML, which is most famously used by app.diagrams.net, the new name of draw.io.

Go Draw - Golang MX This package provides types and helper functions to do some basic but useful things with mxGraph diagrams in XML, which is most fa

null 2 Aug 30, 2022
A simple json parser built using golang

jsonparser A simple json parser built using golang Installation: go get -u githu

Krisna Pranav 1 Dec 29, 2021
xmlquery is Golang XPath package for XML query.

xmlquery Overview xmlquery is an XPath query package for XML documents, allowing you to extract data or evaluate from XML documents with an XPath expr

null 307 Sep 11, 2022
parse and generate XML easily in go

etree The etree package is a lightweight, pure go package that expresses XML in the form of an element tree. Its design was inspired by the Python Ele

Brett Vickers 991 Sep 21, 2022
'go test' runner with output optimized for humans, JUnit XML for CI integration, and a summary of the test results.

gotestsum gotestsum runs tests using go test --json, prints formatted test output, and a summary of the test run. It is designed to work well for both

null 1.1k Sep 23, 2022
Go XML sitemap and sitemap index generator

Install go get github.com/turk/go-sitemap Example for sitemapindex func () main(c *gin.Context) { s := sitemap.NewSitemapIndex(c.Writer, true)

Suleyman Yilmaz 3 Jun 29, 2022
Sqly - An easy-to-use extension for sqlx, base on xml files and named query/exec

sqly An easy-to-use extension for sqlx ,base on xml files and named query/exec t

nvac 1 Jun 12, 2022
Extraction politique de conformité : xlsx (fichier de suivi) -> xml (format AlgoSec)

go_policyExtractor Extraction politique de conformité : xlsx (fichier de suivi) -> xml (format AlgoSec). Le programme suivant se base sur les intitulé

Nokeni 0 Nov 4, 2021
axmlfmt is an opinionated formatter for Android XML resources

axmlfmt axmlfmt is an opinionated formatter for Android XML resources. It takes XML that looks like <?xml version="1.0" encoding="utf-8"?> <LinearLayo

Rashad Sookram 2 May 14, 2022
🧑‍💻 Go XML generator without Structs™

exml ??‍?? Go XML generator without Structs™ Package exml allows XML documents to be generated without the usage of structs or maps. It is not intende

Victor 2 May 16, 2022
A fast, easy-of-use and dependency free custom mapping from .csv data into Golang structs

csvparser This package provides a fast and easy-of-use custom mapping from .csv data into Golang structs. Index Pre-requisites Installation Examples C

João Duarte 20 May 10, 2022
Your CSV pocket-knife (golang)

csvutil - Your CSV pocket-knife (golang) #WARNING I would advise against using this package. It was a language learning exercise from a time before "e

Bryan Matsuo 45 Feb 7, 2022
:book: A Golang library for text processing, including tokenization, part-of-speech tagging, and named-entity extraction.

prose prose is a natural language processing library (English only, at the moment) in pure Go. It supports tokenization, segmentation, part-of-speech

Joseph Kato 2.9k Sep 22, 2022
csvplus extends the standard Go encoding/csv package with fluent interface, lazy stream operations, indices and joins.

csvplus Package csvplus extends the standard Go encoding/csv package with fluent interface, lazy stream processing operations, indices and joins. The

Maxim 67 Apr 9, 2022
A general purpose application and library for aligning text.

align A general purpose application that aligns text The focus of this application is to provide a fast, efficient, and useful tool for aligning text.

John Moore 77 Sep 2, 2022