Gota: DataFrames and data wrangling in Go (Golang)

Related tags

Data Structures gota
Overview

Gota: DataFrames, Series and Data Wrangling for Go

This is an implementation of DataFrames, Series and data wrangling methods for the Go programming language. The API is still in flux so use at your own risk.

DataFrame

The term DataFrame typically refers to a tabular dataset that can be viewed as a two dimensional table. Often the columns of this dataset refers to a list of features, while the rows represent a number of measurements. As the data on the real world is not perfect, DataFrame supports non measurements or NaN elements.

Common examples of DataFrames can be found on Excel sheets, CSV files or SQL database tables, but this data can come on a variety of other formats, like a collection of JSON objects or XML files.

The utility of DataFrames resides on the ability to subset them, merge them, summarize the data for individual features or apply functions to entire rows or columns, all while keeping column type integrity.

Usage

Loading data

DataFrames can be constructed passing Series to the dataframe.New constructor function:

df := dataframe.New(
	series.New([]string{"b", "a"}, series.String, "COL.1"),
	series.New([]int{1, 2}, series.Int, "COL.2"),
	series.New([]float64{3.0, 4.0}, series.Float, "COL.3"),
)

You can also load the data directly from other formats. The base loading function takes some records in the form [][]string and returns a new DataFrame from there:

df := dataframe.LoadRecords(
    [][]string{
        []string{"A", "B", "C", "D"},
        []string{"a", "4", "5.1", "true"},
        []string{"k", "5", "7.0", "true"},
        []string{"k", "4", "6.0", "true"},
        []string{"a", "2", "7.1", "false"},
    },
)

Now you can also create DataFrames by loading an slice of arbitrary structs:

type User struct {
	Name     string
	Age      int
	Accuracy float64
    ignored  bool // ignored since unexported
}
users := []User{
	{"Aram", 17, 0.2, true},
	{"Juan", 18, 0.8, true},
	{"Ana", 22, 0.5, true},
}
df := dataframe.LoadStructs(users)

By default, the column types will be auto detected but this can be configured. For example, if we wish the default type to be Float but columns A and D are String and Bool respectively:

df := dataframe.LoadRecords(
    [][]string{
        []string{"A", "B", "C", "D"},
        []string{"a", "4", "5.1", "true"},
        []string{"k", "5", "7.0", "true"},
        []string{"k", "4", "6.0", "true"},
        []string{"a", "2", "7.1", "false"},
    },
    dataframe.DetectTypes(false),
    dataframe.DefaultType(series.Float),
    dataframe.WithTypes(map[string]series.Type{
        "A": series.String,
        "D": series.Bool,
    }),
)

Similarly, you can load the data stored on a []map[string]interface{}:

df := dataframe.LoadMaps(
    []map[string]interface{}{
        map[string]interface{}{
            "A": "a",
            "B": 1,
            "C": true,
            "D": 0,
        },
        map[string]interface{}{
            "A": "b",
            "B": 2,
            "C": true,
            "D": 0.5,
        },
    },
)

You can also pass an io.Reader to the functions ReadCSV/ReadJSON and it will work as expected given that the data is correct:

csvStr := `
Country,Date,Age,Amount,Id
"United States",2012-02-01,50,112.1,01234
"United States",2012-02-01,32,321.31,54320
"United Kingdom",2012-02-01,17,18.2,12345
"United States",2012-02-01,32,321.31,54320
"United Kingdom",2012-02-01,NA,18.2,12345
"United States",2012-02-01,32,321.31,54320
"United States",2012-02-01,32,321.31,54320
Spain,2012-02-01,66,555.42,00241
`
df := dataframe.ReadCSV(strings.NewReader(csvStr))
jsonStr := `[{"COL.2":1,"COL.3":3},{"COL.1":5,"COL.2":2,"COL.3":2},{"COL.1":6,"COL.2":3,"COL.3":1}]`
df := dataframe.ReadJSON(strings.NewReader(jsonStr))

Subsetting

We can subset our DataFrames with the Subset method. For example if we want the first and third rows we can do the following:

sub := df.Subset([]int{0, 2})

Column selection

If instead of subsetting the rows we want to select specific columns, by an index or column name:

sel1 := df.Select([]int{0, 2})
sel2 := df.Select([]string{"A", "C"})

Updating values

In order to update the values of a DataFrame we can use the Set method:

df2 := df.Set(
    []int{0, 2},
    dataframe.LoadRecords(
        [][]string{
            []string{"A", "B", "C", "D"},
            []string{"b", "4", "6.0", "true"},
            []string{"c", "3", "6.0", "false"},
        },
    ),
)

Filtering

For more complex row subsetting we can use the Filter method. For example, if we want the rows where the column "A" is equal to "a" or column "B" is greater than 4:

fil := df.Filter(
    dataframe.F{"A", series.Eq, "a"},
    dataframe.F{"B", series.Greater, 4},
) 
fil2 := fil.Filter(
    dataframe.F{"D", series.Eq, true},
)

Filters inside Filter are combined as OR operations whereas if we chain Filter methods, they will behave as AND.

Arrange

With Arrange a DataFrame can be sorted by the given column names:

sorted := df.Arrange(
    dataframe.Sort("A"),    // Sort in ascending order
    dataframe.RevSort("B"), // Sort in descending order
)

Mutate

If we want to modify a column or add one based on a given Series at the end we can use the Mutate method:

// Change column C with a new one
mut := df.Mutate(
    series.New([]string{"a", "b", "c", "d"}, series.String, "C"),
)
// Add a new column E
mut2 := df.Mutate(
    series.New([]string{"a", "b", "c", "d"}, series.String, "E"),
)

Joins

Different Join operations are supported (InnerJoin, LeftJoin, RightJoin, CrossJoin). In order to use these methods you have to specify which are the keys to be used for joining the DataFrames:

df := dataframe.LoadRecords(
    [][]string{
        []string{"A", "B", "C", "D"},
        []string{"a", "4", "5.1", "true"},
        []string{"k", "5", "7.0", "true"},
        []string{"k", "4", "6.0", "true"},
        []string{"a", "2", "7.1", "false"},
    },
)
df2 := dataframe.LoadRecords(
    [][]string{
        []string{"A", "F", "D"},
        []string{"1", "1", "true"},
        []string{"4", "2", "false"},
        []string{"2", "8", "false"},
        []string{"5", "9", "false"},
    },
)
join := df.InnerJoin(df2, "D")

Function application

Functions can be applied to the rows or columns of a DataFrame, casting the types as necessary:

mean := func(s series.Series) series.Series {
    floats := s.Float()
    sum := 0.0
    for _, f := range floats {
        sum += f
    }
    return series.Floats(sum / float64(len(floats)))
}
df.Capply(mean)
df.Rapply(mean)

Chaining operations

DataFrames support a number of methods for wrangling the data, filtering, subsetting, selecting columns, adding new columns or modifying existing ones. All these methods can be chained one after another and at the end of the procedure check if there has been any errors by the DataFrame Err field. If any of the methods in the chain returns an error, the remaining operations on the chain will become a no-op.

a = a.Rename("Origin", "Country").
    Filter(dataframe.F{"Age", "<", 50}).
    Filter(dataframe.F{"Origin", "==", "United States"}).
    Select("Id", "Origin", "Date").
    Subset([]int{1, 3})
if a.Err != nil {
    log.Fatal("Oh noes!")
}

Print to console

fmt.Println(flights)

> [336776x20] DataFrame
> 
>     X0    year  month day   dep_time sched_dep_time dep_delay arr_time ...
>  0: 1     2013  1     1     517      515            2         830      ...
>  1: 2     2013  1     1     533      529            4         850      ...
>  2: 3     2013  1     1     542      540            2         923      ...
>  3: 4     2013  1     1     544      545            -1        1004     ...
>  4: 5     2013  1     1     554      600            -6        812      ...
>  5: 6     2013  1     1     554      558            -4        740      ...
>  6: 7     2013  1     1     555      600            -5        913      ...
>  7: 8     2013  1     1     557      600            -3        709      ...
>  8: 9     2013  1     1     557      600            -3        838      ...
>  9: 10    2013  1     1     558      600            -2        753      ...
>     ...   ...   ...   ...   ...      ...            ...       ...      ...
>     <int> <int> <int> <int> <int>    <int>          <int>     <int>    ...
> 
> Not Showing: sched_arr_time <int>, arr_delay <int>, carrier <string>, flight <int>,
> tailnum <string>, origin <string>, dest <string>, air_time <int>, distance <int>, hour <int>,
> minute <int>, time_hour <string>

Interfacing with gonum

A gonum/mat.Matrix or any object that implements the dataframe.Matrix interface can be loaded as a DataFrame by using the LoadMatrix() method. If one wants to convert a DataFrame to a mat.Matrix it is necessary to create the necessary structs and method implementations. Since a DataFrame already implements the Dims() (r, c int) method, only implementations for the At and T methods are necessary:

type matrix struct {
	dataframe.DataFrame
}

func (m matrix) At(i, j int) float64 {
	return m.Elem(i, j).Float()
}

func (m matrix) T() mat.Matrix {
	return mat.Transpose{m}
}

Series

Series are essentially vectors of elements of the same type with support for missing values. Series are the building blocks for DataFrame columns.

Four types are currently supported:

Int
Float
String
Bool

For more information about the API, make sure to check:

License

Copyright 2016 Alejandro Sanchez Brotons

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Comments
  • panic: runtime error: invalid memory address or nil pointer dereference

    panic: runtime error: invalid memory address or nil pointer dereference

    when I apply aggregation for a dataframe ,it happen a panic error like this:

    panic: runtime error: invalid memory address or nil pointer dereference

    the full error stack as follow:

    panic: runtime error: invalid memory address or nil pointer dereference
    [signal 0xc0000005 code=0x0 addr=0x20 pc=0x38c064]
    
    goroutine 1 [running]:
    github.com/go-gota/gota/series.Series.Len(...)
            D:/ProgramFiles/goplus/pkg/mod/github.com/go-gota/[email protected]/series/series.go:560
    github.com/go-gota/gota/dataframe.Groups.Aggregation(0xc005d41b30, 0xc000270f00, 0x2, 0x2, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
            D:/ProgramFiles/goplus/pkg/mod/github.com/go-gota/[email protected]/dataframe/dataframe.go:504 +0x904
    

    and my code like this :

    agg := gdf.Aggregation([]dataframe.AggregationType{dataframe.Aggregation_COUNT}, []string{"countn"})
    

    how can i solve it ? thank you.

    bug 
    opened by jameschuh 12
  • Combining filters with AND and user-defined filters

    Combining filters with AND and user-defined filters

    This PR refines filtering of DataFrames: (1) support for combining filters with AND without having to chain the filters, which should be more performant as rows only need to be traversed once (df.FilterAggregation) (2) support for user-defined filters (series.CompFunc and func(el series.Element) bool) (3) test cases for both new features (4) updated README reflecting the additions

    It should be backwards compatible as df.Filter retains OR semantics.

    Cheers Christoph

    opened by chrstphlbr 10
  • Add DataFrame.Describe for reporting summary statistics

    Add DataFrame.Describe for reporting summary statistics

    This PR adds some summary statistics helper functions to the Series struct type and the DataFrame.Describe function to obtain summary statistics for the given DataFrame. This is intended to replicate the behavior of the Pandas' DataFrame.describe() function in python.

    opened by danicat 8
  • Filter with In on Quoted String returns False

    Filter with In on Quoted String returns False

    Im attempting to Filter on a String list that is quoted. My Filter is as follows:

    df = df.Filter(dataframe.F{ Colname: "XXX", Comparator: series.In, Comparando: "LA", })

    Here are some sample rows form the column I am filtering on has Strings that can look like: "DC,FC,FS,FW,LA,LC,MG" "DC,FC,FS,FW,LA,LC,MG" "DC,FC,FS,FW,LA,LC,MG" "DC,FC,FS,FW,LA,LC,MG" "DC,FC,FS,FW,LA,LC,MG" "CC,DC,FR,FW,KH,MG,WD,WB" "IS,KH,MG,WD" "CC,FC,FS,SC" "IS,KH,MG,WD" "FC,LA,LC,UQ" "CC,CF,CS,FC,FS,KH,LA,LC,MG,WD,WB" "CC,FC,FS,SC" "DC,FR" "DC,FR" UNK UNK "DC,FR" FW

    This should return 7 rows: "DC,FC,FS,FW,LA,LC,MG" "DC,FC,FS,FW,LA,LC,MG" "DC,FC,FS,FW,LA,LC,MG" "DC,FC,FS,FW,LA,LC,MG" "DC,FC,FS,FW,LA,LC,MG" "FC,LA,LC,UQ" "CC,CF,CS,FC,FS,KH,LA,LC,MG,WD,WB"

    But when I run this I get 0 rows back. I think this could be due to the quoted strings which I cant control since they come from a csv file. Or is there a way to pass Regex or a wildcard in as my comparando?

    opened by dp1140a 7
  • How I can apply multiple values to a Dataframe Filter function.

    How I can apply multiple values to a Dataframe Filter function.

    I have following: df = df.Rename("Account", "AccountID"). Filter(dataframe.F{"Event", "==", "bounced"}). Filter(dataframe.F{"Event", "==", "Sent"})

    I want to do it like as:

    df = df.Rename("Account", "AccountID"). Filter(dataframe.F{"Event", "==", ["bounced","Sent"]})

    Is that possible?

    opened by arshpreetsingh 7
  • DataFrame ToMatrix function

    DataFrame ToMatrix function

    Could you, please, implement the dataframe toMatrix (mat.Matrix) function which is hinted in the readme?

    I am new to golang and am trying to replicate a python pipeline(as part of transitioning to golang) which uses StandardScaler but I get an exception that the mat.Matrix as indicated in the error here.

    Note that I am using golang "github.com/pa-m/sklearn/preprocessing" package

    # command-line-arguments ./go_csv.go:18:10: m.columns undefined (type matrix has no field or method columns) ./go_csv.go:21:21: undefined: mat64 ./go_csv.go:22:9: undefined: mat64 ./go_csv.go:123:12: cannot use selDf1 (type dataframe.DataFrame) as type mat.Matrix in argument to scaler.Fit: dataframe.DataFrame does not implement mat.Matrix (missing At method) ./go_csv.go:125:27: cannot use selDf1 (type dataframe.DataFrame) as type mat.Matrix in argument to scaler.Transform: dataframe.DataFrame does not implement mat.Matrix (missing At method)

    opened by vitasiku 6
  • Support Append data (new row) to a Series and/or DataFrame

    Support Append data (new row) to a Series and/or DataFrame

    I can see that func (s *Series) Append(values interface{}) can be used to append data to the end of the Series.

    How can I insert a new row at an arbitrary position?

    It would be great to be able to do it for a DataFrame too. Not just for a Series.

    opened by pjebs 6
  • Allowing type specification through a map rather than a variadic string argument would be more flexible

    Allowing type specification through a map rather than a variadic string argument would be more flexible

    Right now I have to either specify all types or no types at all. Specifying the types in a map[string]string (column name -> type name) would add the possibility to specify types only for the columns you want to and fallback to auto typing for the other columns.

    It could possibly also shorten the code in ReadRecords by simply checking if the column name is in the map and if not fallback to findType.

    What do you think?

    enhancement 
    opened by tobgu 6
  • Problem installing on windows 10 x64 go 16

    Problem installing on windows 10 x64 go 16

    I'm trying to install dataframes with go 16.6 on windows 10 x64 and got the following message. After that the go.mod file is not updated:

    go get: github.com/kniren/[email protected] updating to
            github.com/kniren/[email protected]: parsing go.mod:     
            module declares its path as: github.com/go-gota/gota
                    but was required as: github.com/kniren/gota
    
    opened by jcbritobr 4
  • feature: groupby and Aggregation

    feature: groupby and Aggregation

    I am a fan of Gota, I hope Gota will become better

    ADD Feature:

    1. Groupby
    2. Aggregation
    groups := df.GroupBy("key1", "key2") // Group by column "key1", and column "key2" 
    aggre := groups.Aggregation([]AggregationType{Aggregation_MAX, Aggregation_MIN}, []string{"values", "values2"}) // Maximum value in column "values",  Minimum value in column "values2"
    

    image

    image

    opened by wanglong001 4
  • New release tag

    New release tag

    v0.9.0 does not include fix of broken import.

    https://github.com/go-gota/gota/commit/7d8acfb8259f6135fb372d0966d9b6a23161c430#diff-80cec47501f12ea2f50aa0ff5c6bca95

    Could you please tag newer?

    opened by mattn 4
  • Is there a way to  transform DataFrame to user customer struct?

    Is there a way to transform DataFrame to user customer struct?

    I want something like json.Unmarshal , code like this

    type People struct {
    	Name string `dataframe:"name"`
    	Age  int    `dateframe:"age"`
    	Male bool   `dateframe:"male"`
    }
    df := dataframe.LoadRecords(
    	[][]string{
    		[]string{"name", "age", "male"},
    		[]string{"Cloris", "24", "false"},
    		[]string{"Ben", "55", "true"},
    	},
    	dataframe.DetectTypes(false),
    	dataframe.DefaultType(series.String),
    	dataframe.WithTypes(map[string]series.Type{
    		"age":  series.Int,
    		"male": series.Bool,
    	}),
    )
    var peoples []People
    df.Unmarshal(&peoples)
    

    and will get peoples output:

    // peoples
    // People{Name:"Cloris", Age:24, "Male":false}
    // People{Name:"Ben", Age:55, "Male":true}
    
    opened by liracle 3
  • Gota installation error on Go. 1.18 MAC OS X

    Gota installation error on Go. 1.18 MAC OS X

    I've tried to install Gota on MAC OS X 12.1 with Go 1.18 and the below error pops out:

    % go get -v https://github.com/go-gota/gota            
    go: malformed module path "https:/github.com/go-gota/gota": invalid char ':'
    

    I can install other Go packages with no issues. Is that invalid character related to OS?

    opened by alirezastack 1
  • Performance issue

    Performance issue

    hi,

    i am trying to perform the left join operation on these two dataframes (df1 = df1.LeftJoin(df2.Select([]string{"col1","col2","col3", "col4", "col5"}), "col1", "col2")) .where df1 consists of around 50k rows and df2 consists of around 91k rows.but it is taking around 19 minutes to complete this left join operation.is there any way to optimize the execution time of this operation or am i doing it in a wrong way?

    any suggestions on this performance issue would be great?

    opened by susanth19 0
  • How to fill forward?

    How to fill forward?

    Hello, I am used to using Pandas ffill function and trying to figure out how to do the same here. I have a dataframe that looks like this:

        Date       Stock GDP
     0: 2021-10-01 45301.73 24002.815
     1: 2021-10-08 45244.34 NaN
     2: 2021-10-15 45677.43 NaN
     3: 2021-10-22 47000.43 NaN
     4: 2021-10-29 47450.01 NaN
     5: 2021-11-05 48330.86 NaN
     6: 2021-11-12 48497.45 NaN
     7: 2021-11-19 48638.28 NaN
     8: 2021-11-26 48211.43 NaN
     9: 2021-12-03 47034.66 NaN
    

    The GDP data is only quarterly, so Id like to fill those rows with the data from index 0. How can I achieve the same using gota

    opened by crhuber 0
Releases(v0.12.0)
Owner
null
Graphoscope: a solution to access multiple independent data sources from a common UI and show data relations as a graph

Graphoscope A solution to access multiple independent data sources from a common UI and show data relations as a graph: Contains a list of by default

CERT.LV 29 May 26, 2022
golang sorting algorithm and data construction.

data-structures-questions 算法和程序结构是我们学习编程的基础,但是很多的时候,我们很多都是只是在应用,而没有深入的去研究这些,所以自己也在不断的思考和探索,然后分析,学习,总结自己学习的过程,希望可以和大家一起学习和交流下算法! 目录 网络协议 数据结构 算法 数据库 Go

keke 1.7k Sep 26, 2022
low level data type and utils in Golang.

low low level data type and utils in Golang. A stable low level function set is the basis of a robust architecture. It focuses on stability and requir

null 70 Sep 9, 2022
Algorithms and Data Structures Solved in Golang

Algorithms and Data Structures Solved in Golang Hi! I'm Bruno Melo and this repository contains a lot challenges solved on many plataforms using go as

Bruno Melo 3 Aug 13, 2022
Some data structures and algorithms using golang

Some data structures and algorithms using golang

null 62 Aug 13, 2022
Tutorial code for my video Learn to Use Basic Data Structures - Slices, Structs and Maps in Golang

Learn to Use Basic Data Structures - Slices, Structs and Maps in Golang Read text from a file and split into words. Introduction to slices / lists. Co

null 0 Jan 26, 2022
Dasel - Select, put and delete data from JSON, TOML, YAML, XML and CSV files with a single tool.

Select, put and delete data from JSON, TOML, YAML, XML and CSV files with a single tool. Supports conversion between formats and can be used as a Go package.

Tom Wright 3.7k Sep 26, 2022
A simple Set data structure implementation in Go (Golang) using LinkedHashMap.

Set Set is a simple Set data structure implementation in Go (Golang) using LinkedHashMap. This library allow you to get a set of int64 or string witho

Studio Sol Comunicação Digital Ltda 21 Jul 22, 2022
Trie data structure implementation in Golang 🌳

Gotri Gotri is an Unicode character based Trie/prefix tree implementation in Go, with the suggestion/auto-complete feature for character searching. Si

Monir Zaman 9 Jun 17, 2022
Common data structures for solving problems in Golang

datastructs Common data structs for solving problems in Golang. List of data structures can be found in datastructs/pkg Rules for data structures Don'

Akira Masuda 1 Nov 7, 2021
Data structure and algorithm library for go, designed to provide functions similar to C++ STL

GoSTL English | 简体中文 Introduction GoSTL is a data structure and algorithm library for go, designed to provide functions similar to C++ STL, but more p

stirlingx 714 Sep 23, 2022
Data structure and relevant algorithms for extremely fast prefix/fuzzy string searching.

Trie Data structure and relevant algorithms for extremely fast prefix/fuzzy string searching. Usage Create a Trie with: t := trie.New() Add Keys with:

Derek Parker 605 Sep 23, 2022
Graph algorithms and data structures

Your basic graph Golang library of basic graph algorithms Topological ordering, image by David Eppstein, CC0 1.0. This library offers efficient and we

Algorithms to Go 580 Sep 27, 2022
Graph algorithms and data structures

Your basic graph Golang library of basic graph algorithms Topological ordering, image by David Eppstein, CC0 1.0. This library offers efficient and we

Algorithms to Go 9 Jan 25, 2021
Data Structure Libraries and Algorithms implementation

Algorithms Data Structure Libraries and Algorithms implementation in C++ Disclaimer This repository is meant to be used as a reference to learn data s

Priyank Chheda 640 Sep 17, 2022
Data structures and algorithms implementation from ZeroToMastery course

ZeroToMastery Data Structures & Algorithms course This repo includes all the data structure and algorithm exercises solutions and implementations. Ins

Fabio Bozzo 3 Jul 4, 2022
Practice-dsa-go - Data Structures and Algorithms for Interview Preparation in Go

Data Structures and Algorithms for Interview Preparation in Go Data Structures K

Sparsh Srivastava 2 Jul 3, 2022
Implementation of various data structures and algorithms in Go

GoDS (Go Data Structures) Implementation of various data structures and algorithms in Go. Data Structures Containers Lists ArrayList SinglyLinkedList

TopXeQ 0 Jan 25, 2022
A Go implementation of an in-memory bloom filter, with support for boltdb and badgerdb as optional data persistent storage.

Sprout A bloom filter is a probabilistic data structure that is used to determine if an element is present in a set. Bloom filters are fast and space

Samsondeen 26 Jul 4, 2022