Run your MapReduce workloads as a single binary on a single machine with multiple CPUs and high memory. Pricing of a lot of small machines vs heavy machines is the same on most cloud providers.

Overview

gomap

Travis Status for in4it/gomap godoc for in4it/gomap

Run your MapReduce workloads as a single binary on a single machine with multiple CPUs and high memory. Pricing of a lot of small machines vs heavy machines is the same on most cloud providers.

Usage

Import

Context to start using gomap:

import "github.com/in4it/gomap/pkg/context"

Utils and types (for conversions):

import (
  "github.com/in4it/gomap/pkg/utils"
  "github.com/in4it/gomap/pkg/types"
)

WordCount Example

package main

import (
  "github.com/in4it/gomap/pkg/context"
  "github.com/in4it/gomap/pkg/utils"
  "github.com/in4it/gomap/pkg/types"
)

// Print a wordcount of an input file
func main() {
	c := context.New()
	err := c.Read("testdata/sentences.txt").FlatMap(func(str types.RawInput) []types.RawOutput {
		return utils.StringArrayToRawOutput(strings.Split(string(str), " "))
	}).MapToKV(func(input types.RawInput) (types.RawOutput, types.RawOutput) {
		return utils.RawInputToRawOutput(input), utils.StringToRawOutput("1")
	}).ReduceByKey(func(a, b types.RawInput) types.RawOutput {
		return utils.IntToRawOutput(utils.RawInputToInt(a) + utils.RawInputToInt(b))
	}).Run().Print()
	
	if err != nil {
		fmt.Printf("Error: %s", err)
		os.Exit(1)
	}
}

Parquet example

package main

import (
  "github.com/in4it/gomap/pkg/context"
  "github.com/in4it/gomap/pkg/utils"
  "github.com/in4it/gomap/pkg/types"
)

// define parquet schema
type ParquetLine struct {
  Word  string `parquet:"name=word, type=UTF8"`
  Count int64  `parquet:"name=count, type=INT64"` 
}

// Print a wordcount of an input file
func main() {
	c := context.New()
	err := c.ReadParquet("s3://bucket/directory/", new(ParquetLine)).MapToKV(func(input types.RawInput) (types.RawOutput, types.RawOutput) {
		var line ParquetLine
		err := utils.RawDecode(input, &line)
		if err != nil {
			panic(err)
		}
		return utils.StringToRawOutput(line.Word), utils.RawEncode([]ParquetLine{line})
	}).ReduceByKey(func(a, b types.RawInput) types.RawOutput {
		var line1 []ParquetLine
		var line2 []ParquetLine
		err := utils.RawDecode(a, &line1)
		if err != nil {
			panic(err)
		}
		err = utils.RawDecode(b, &line2)
		if err != nil {
			panic(err)
		}
		return utils.RawEncode(append(line1, line2...))
	}).Run().Foreach(func(key, value types.RawOutput) {
		var lines []ParquetLine
		err := utils.RawDecode(value, &lines)
		if err != nil {
			panic(err)
		}
    //
    // you can now use string(key) and lines ([]ParquetLine)
    //
	})

	if err != nil {
		panic(c.err)
	}

Memory usage and spill to disk

If you don't want to keep the full memory set in memory, you can specify a buffer limit. Between steps (Map, FlatMap, ReduceByKey, ...), a buffer is kept. By configuring a different writer, you can influence the memory usage.

Default writer (MemoryWriter)

	c := New()
	c.SetConfig(Config{
		bufferWriter: writers.NewMemoryWriter(),
	})

Memory and Disk Writer (MemoryAndDiskWriter)

	c := New()
	c.SetConfig(Config{
		// argument expects bytes. after 5 MB, the buffer will start spilling to disk. 
		bufferWriter: writers.NewMemoryAndDiskWriter(1024 /* kb */ * 1024 /* mb */ * 5), 
	})

Current implemented functions

Function Description
Map Transform a value
FlatMap Transform and flatten a value into a slice
MapToKV Transform a map to a key value pair
ReduceByKey Group unique keys and apply a reduce function
Foreach Loop over the output of unique keys in a key value result
Filter Filter values
Print Print output
Get Get output values
GetKV Get output keys and values

Current inputs

  • Textfiles (local & S3 using s3:// prefix)
  • Parquet (local & S3 using s3:// prefix)

Concurrency

Multiple input files are split into goroutines. If you have multiple cores, the goroutines can run in parallel

Run gomap on AWS

You can run gomap on AWS on a spot instance using the launcher.

Configuration

Example launch specification (if the AMI is not supplied, it'll launch the latest ubuntu bionic AMI):

{
    "IamInstanceProfile": {
      "Arn": "arn:aws:iam::1234567890:instance-profile/gomap"
    },
    "InstanceType": "r4.large",
    "NetworkInterfaces": [
      {
        "DeviceIndex": 0,
        "Groups": ["sg-0123456789"],
        "SubnetId": "subnet-01234567890"
      }
    ]  
}

Note: the instance profile should have s3 & cloudwatch logs access

Run

Download the wordcount and launch binary from the release page, and run:

aws s3 cp wordcount-linux-amd64 s3://yourbucket/binaries/wordcount
./launch -launchSpecification launchspec.json -region eu-west-1 -cmd "./wordcount -input s3://yourbucket/inputfile.txt" -executable s3://yourbucket/binaries/wordcount
Rclone ("rsync for cloud storage") is a command line program to sync files and directories to and from different cloud storage providers.

Rclone ("rsync for cloud storage") is a command line program to sync files and directories to and from different cloud storage providers.

rclone 32.8k May 13, 2022
Rclone ("rsync for cloud storage") is a command-line program to sync files and directories to and from different cloud storage providers.

Website | Documentation | Download | Contributing | Changelog | Installation | Forum Rclone Rclone ("rsync for cloud storage") is a command-line progr

null 0 Nov 5, 2021
Reads from existing Cloud Providers (reverse Terraform) and generates your infrastructure as code on Terraform configuration

TerraCognita Imports your current Cloud infrastructure to an Infrastructure As Code Terraform configuration (HCL) or/and to a Terraform State. At Cycl

Cycloid 981 May 20, 2022
The simple and easy-to-use program designed to watch user activity for Cloud Providers.

Cloud Agent The simple and easy-to-use program is designed to watch user activity and possible orphan clusters for Cloud Providers: Gardener GCP (work

Filip Strózik 3 May 9, 2022
Gostall - Run go install ./cmd/server and not have the binary install in your GOBIN be called server?

GOSTALL Ever wanted to run go install ./cmd/server and not have the binary insta

David Desmarais-Michaud 0 Jan 7, 2022
archy is an static binary to determine current kernel and machine architecture, with backwards compatible flags to uname, and offers alternative output format of Go runtime (i.e. GOOS, GOARCH).

archy archy is an simple binary to determine current kernel and machine architecture, which wraps uname and alternatively can read from Go runtime std

xargs-dev 3 Mar 18, 2022
A small CLI tool to check connection from a local machine to a remote target in various protocols.

CHK chk is a small CLI tool to check connection from a local machine to a remote target in various protocols.

null 25 Mar 30, 2022
gif effects CLI. single binary, no dependencies. linux, osx, windows.

yeetgif Composable GIF effects CLI, with reasonable defaults. Made for custom Slack/Discord emoji :) Get it Alternative 1: go get Alternative 2: just

Sergey Grebenshchikov 519 May 6, 2022
The runner project is to create an interface for users to run their code remotely without having to have any compiler on their machine

The runner project is to create an interface for users to run their code remotely without having to have any compiler on their machine. This is a work in progress project for TCSS 401X :)

cam 6 May 4, 2022
Allows you to use the magic remote on your webOS LG TV as a keyboard/mouse for your Linux machine

magic4linux Allows you to use the magic remote on your webOS LG TV as a keyboard/mouse for your PC Linux machine. This is a Linux implementation of th

Mathias Fredriksson 0 Feb 7, 2022
🧑‍💻📊 Show off your most used shell commands

tsukae ??‍?? ?? Tsukae, 使え - means use in Japanese (so it refers to commands that you use) Built on top of termui and cobra Big shoutout to jokerj40 f

Ilya Revenko 428 May 9, 2022
git-xargs is a command-line tool (CLI) for making updates across multiple Github repositories with a single command.

Table of contents Introduction Reference Contributing Introduction Overview git-xargs is a command-line tool (CLI) for making updates across multiple

Gruntwork 571 May 12, 2022
git-xargs is a command-line tool (CLI) for making updates across multiple GitHub repositories with a single command

git-xargs is a command-line tool (CLI) for making updates across multiple GitHub repositories with a single command. You give git-xargs:

Maxar Infrastructure 1 Feb 5, 2022
Count once - Just once? no, when appear many it run once, but it can run many times

countOnce just once? no, when appear many it run once, but it can run many times

null 1 Jan 29, 2022
CLI to support with downloading and compiling terraform providers for Mac with M1 chip

m1-terraform-provider-helper A CLI to help with managing the installation and compilation of terraform providers when running a new M1 Mac. Motivation

kreuzwerker GmbH 135 May 11, 2022
null 0 Jan 27, 2022
copy and paste across machines

Copy-paste across machines using GitLab Snippets as a storage backend. This is a simple CLI tool inspired by the usability of pbcopy and pbpaste or xc

Bradley Wood 110 Apr 8, 2022
Generate High Level Cloud Architecture diagrams using YAML syntax.

A commandline tool that generate High Level microservice & serverless Architecture diagrams using a declarative syntax defined in a YAML file.

Luca Sepe 531 Apr 22, 2022
The Keel CLI allows you to setup Keel on your local dev machine or on a Kubernetes cluster

keel-cli What is keel-cli The Keel CLI allows you to setup Keel on your local dev machine or on a Kubernetes cluster, launches and manages Keel instan

null 0 Oct 7, 2021