# Bigmachine

Bigmachine is a toolkit for building self-managing serverless applications in Go. Bigmachine provides an API that lets a driver process form an ad-hoc cluster of machines to which user code is transparently distributed.

User code is exposed through services, which are stateful Go objects associated with each machine. Services expose one or more Go methods that may be dispatched remotely. User services can call remote user services; the driver process may also make service calls.

Programs built using Bigmachine are agnostic to the underlying machine implementation, allowing distributed systems to be easily tested through an in-process implementation, or inspected during development using local Unix processes.

Bigmachine currently supports instantiating clusters of EC2 machines; other systems may be implemented with a relatively compact Go interface.

Help wanted!

# A walkthrough of a simple Bigmachine program

Command bigpi is a relatively silly use of cluster computing, but illustrative nonetheless. Bigpi estimates the value of $\pi$ by sampling $N$ random coordinates inside of the unit square, counting how many $C \le N$ fall inside of the unit circle. Our estimate is then $\pi = 4*C/N$.

This is inherently parallelizable: we can generate samples across a large number of nodes, and then when we're done, they can be summed up to produce our estimate of $\pi$.

To do this in Bigmachine, we first define a service that samples some $n$ points and reports how many fell inside the unit circle.

type circlePI struct{}

// Sample generates n points inside the unit square and reports
// how many of these fall inside the unit circle.
func (circlePI) Sample(ctx context.Context, n uint64, m *uint64) error {
r := rand.New(rand.NewSource(rand.Int63()))
for i := uint64(0); i < n; i++ {
if i%1e7 == 0 {
log.Printf("%d/%d", i, n)
}
x, y := r.Float64(), r.Float64()
if (x-0.5)*(x-0.5)+(y-0.5)*(y-0.5) < 0.25 {
*m++
}
}
return nil
}


The only notable aspect of this code is the signature of Sample, which follows the schema below: methods that follow this convention may be dispatched remotely by Bigmachine, as we shall see soon.

func (service) Name(ctx context.Context, arg argtype, reply *replytype) error


Next follows the program's func main. First, we do the regular kind of setup a main might: define some flags, parse them, set up logging. Afterwards, a driver must call driver.Start, which initializes Bigmachine and sets up the process so that it may be bootstrapped properly on remote nodes. (Package driver provides high-level facilities for configuring and bootstrapping Bigmachine; adventurous users may use the lower-level facilitied in package bigmachine to accomplish the same.) driver.Start() returns a *bigmachine.B which can be used to start new machines.

func main() {
var (
nsamples = flag.Int("n", 1e10, "number of samples to make")
nmachine = flag.Int("nmach", 5, "number of machines to provision for the task")
)
flag.Parse()
b := driver.Start()
defer b.Shutdown()


Next, we start a number of machines (as configured by flag nmach), wait for them to finish launching, and then distribute our sampling among them, using a simple "scatter-gather" RPC pattern. First, let's look at the code that starts the machines and waits for them to be ready.

// Start the desired number of machines,
// each with the circlePI service.
machines, err := b.Start(ctx, *nmachine, bigmachine.Services{
"PI": circlePI{},
})
if err != nil {
log.Fatal(err)
}
log.Print("waiting for machines to come online")
for _, m := range machines {
<-m.Wait(bigmachine.Running)
if err := m.Err(); err != nil {
log.Fatal(err)
}
}


Machines are started with (*B).Start, to which we provide the set of services that should be installed on each machine. (The service object is provided is serialized and initialized on the remote machine, so it may include any desired parameters.) Start returns with with a slice of Machine instances, representing each machine that was launched. Machines can be in a number of states. In this case, we keep it simple and just wait for them to enter their running states, after which the underlying machines are fully bootstrapped and the services have been installed and initialized. At this point, all of the machines are ready to receive RPC calls.

The remainder of main distributes a portion of the total samples to be taken to each machine, waits for them to complete, and then prints with the precision warranted by the number of samples taken. Note that this code further subdivides the work by calling PI.Sample once for each processor available on the underlying machines as defined by Machine.Maxprocs, which depends on the physical machine configuration.

// Number of samples per machine
numPerMachine := uint64(*nsamples) / uint64(*nmachine)

// Divide the total number of samples among all the processors on
// each machine. Aggregate the counts and then report the estimate.
var total uint64
var cores int
g, ctx := errgroup.WithContext(ctx)
for _, m := range machines {
m := m
for i := 0; i < m.Maxprocs; i++ {
cores++
g.Go(func() error {
var count uint64
err := m.Call(ctx, "PI.Sample", numPerMachine/uint64(m.Maxprocs), &count)
if err == nil {
}
return err
})
}
}
log.Printf("distributing work among %d cores", cores)
if err := g.Wait(); err != nil {
log.Fatal(err)
}
log.Printf("total=%d nsamples=%d", total, *nsamples)
var (
pi   = big.NewRat(int64(4*total), int64(*nsamples))
prec = int(math.Log(float64(*nsamples)) / math.Log(10))
)
fmt.Printf("π = %s\n", pi.FloatString(prec))


We can now build and run our binary like an ordinary Go binary.

$go build$ ./bigpi
2019/10/01 16:31:20 waiting for machines to come online
2019/10/01 16:31:24 machine https://localhost:42409/ RUNNING
2019/10/01 16:31:24 machine https://localhost:44187/ RUNNING
2019/10/01 16:31:24 machine https://localhost:41618/ RUNNING
2019/10/01 16:31:24 machine https://localhost:41134/ RUNNING
2019/10/01 16:31:24 machine https://localhost:34078/ RUNNING
2019/10/01 16:31:24 all machines are ready
2019/10/01 16:31:24 distributing work among 5 cores
2019/10/01 16:32:05 total=7853881995 nsamples=10000000000
π = 3.1415527980


Here, Bigmachine distributed computation across logical machines, each corresponding to a single core on the host system. Each machine ran in its own Unix process (with its own address space), and RPC happened through mutually authenticated HTTP/2 connections.

Package driver provides some convenient flags that helps configure the Bigmachine runtime. Using these, we can configure Bigmachine to launch machines into EC2 instead:

$./bigpi -bigm.system=ec2 2019/10/01 16:38:10 waiting for machines to come online 2019/10/01 16:38:43 machine https://ec2-54-244-211-104.us-west-2.compute.amazonaws.com/ RUNNING 2019/10/01 16:38:43 machine https://ec2-54-189-82-173.us-west-2.compute.amazonaws.com/ RUNNING 2019/10/01 16:38:43 machine https://ec2-34-221-143-119.us-west-2.compute.amazonaws.com/ RUNNING ... 2019/10/01 16:38:43 all machines are ready 2019/10/01 16:38:43 distributing work among 5 cores 2019/10/01 16:40:19 total=7853881995 nsamples=10000000000 π = 3.1415527980  Once the program is running, we can use standard Go tooling to examine its behavior. For example, expvars are aggregated across all of the machines managed by Bigmachine, and the various profiles (CPU, memory, contention, etc.) are available as merged profiles through /debug/bigmachine/pprof. For example, in the first version of bigpi, the CPU profile highlighted a problem: we were using the global rand.Float64 which requires a lock; the resulting contention was easily identifiable through the CPU profile: $ go tool pprof localhost:3333/debug/bigmachine/pprof/profile
Fetching profile over HTTP from http://localhost:3333/debug/bigmachine/pprof/profile
Saved profile in /Users/marius/pprof/pprof.045821636.samples.cpu.001.pb.gz
File: 045821636
Type: cpu
Time: Mar 16, 2018 at 3:17pm (PDT)
Duration: 2.51mins, Total samples = 16.80mins (669.32%)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top
Showing nodes accounting for 779.47s, 77.31% of 1008.18s total
Dropped 51 nodes (cum <= 5.04s)
Showing top 10 nodes out of 58
flat  flat%   sum%        cum   cum%
333.11s 33.04% 33.04%    333.11s 33.04%  runtime.procyield
116.71s 11.58% 44.62%    469.55s 46.57%  runtime.lock
76.35s  7.57% 52.19%    347.21s 34.44%  sync.(*Mutex).Lock
65.79s  6.53% 58.72%     65.79s  6.53%  runtime.futex
41.48s  4.11% 62.83%    202.05s 20.04%  sync.(*Mutex).Unlock
34.10s  3.38% 66.21%    364.36s 36.14%  runtime.findrunnable
33s  3.27% 69.49%        33s  3.27%  runtime.cansemacquire
32.72s  3.25% 72.73%     51.01s  5.06%  runtime.runqgrab
24.88s  2.47% 75.20%     57.72s  5.73%  runtime.unlock
21.33s  2.12% 77.31%     21.33s  2.12%  math/rand.(*rngSource).Uint64


And after the fix, it looks much healthier:

$go tool pprof localhost:3333/debug/bigmachine/pprof/profile ... flat flat% sum% cum cum% 29.09s 35.29% 35.29% 82.43s 100% main.circlePI.Sample 22.95s 27.84% 63.12% 52.16s 63.27% math/rand.(*Rand).Float64 16.09s 19.52% 82.64% 16.09s 19.52% math/rand.(*rngSource).Uint64 9.05s 10.98% 93.62% 25.14s 30.49% math/rand.(*rngSource).Int63 4.07s 4.94% 98.56% 29.21s 35.43% math/rand.(*Rand).Int63 1.17s 1.42% 100% 1.17s 1.42% math/rand.New 0 0% 100% 82.43s 100% github.com/grailbio/bigmachine/rpc.(*Server).ServeHTTP 0 0% 100% 82.43s 100% github.com/grailbio/bigmachine/rpc.(*Server).ServeHTTP.func2 0 0% 100% 82.43s 100% golang.org/x/net/http2.(*serverConn).runHandler 0 0% 100% 82.43s 100% net/http.(*ServeMux).ServeHTTP  # GOOS, GOARCH, and Bigmachine When using Bigmachine's EC2 machine implementation, the process is bootstrapped onto remote EC2 instances. Currently, the only supported GOOS/GOARCH combination for these are linux/amd64. Because of this, the driver program must also be linux/amd64. However, Bigmachine also understands the fatbin format, so that users can compile fat binaries using the gofat tool. For example, the above can be run on a macOS driver if the binary is built using gofat instead of 'go': macOS$ GO111MODULE=on go get github.com/grailbio/base/cmd/gofat
go: finding github.com/grailbio/base/cmd/gofat latest
go: finding github.com/grailbio/base/cmd latest
macOS $gofat build macOS$ ./bigpi -bigm.system=ec2
...

• #### ec2system: change handling of default ec2boot binaries

Instead of hardcoding ec2boot in the config (which materializes to, e.g., \$HOME/.bigslice/config), we set the default to an empty string, and fill in the default binary associated with the current release.

We also rewrite binaries with the official prefix to match the current version.

opened by mariusae 6
• #### ec2system: changes for environments other than GRAIL

This PR extends the configurability of ec2 instances to allow for configurations other than GRAIL's.

opened by cosnicolaou 3
• #### bigmachine: use ec2 ssh keys

EC2 instances, in some environments, need to have an ssh key pair specified when they are created so that that ssh key appears in the instance metadata. This PR scans the set of scavenged ssh keys (e.g. those in the ssh agent) and those available in ec2 and finds the first overlap and uses that key. A future PR will allow for setting this key specifically since this heuristic may fail in some setups where the user has multiple overlapping keys.

opened by cosnicolaou 3
• #### Migrate from CoreOS to Flatcar

opened by jcharum 3
• #### ec2system: improve error reporting

This change ensures that task.Tail captures all of the journalctl output since the bigmachine worker started rather than just the last 10 lines.

opened by cosnicolaou 3
• #### Value of a state machine or flow diagram

Please only consider this if you've received interest from others (and they support this). I'm mostly undertaking this as a pet project and don't wish to unduly burden you for my sake.

I've been away for 10 days and, returning to bigmachine, I'm having to rely upon extensive debugging statements in an ongoing attempt to try to grok the underlying mechanism of the solution. I still only have an admittedly loose grasp of this and am fumbling through.

The GCE implementation is able to create remote (containerized) nodes and, I believe, basic (HTTP non-TLS) RPC is working. I'm challenged debugging (particularly go routines) and because I don't have a good overall perspective.

Is|Are there (a) state diagram(s) for bigmachine? I think there are 2-3 different diagrams that are of interest:

1. The internal state diagram of an individual machine
2. The networking state diagram for the process by which the local node drives the remotes
opened by DazWilkin 3
• #### Add (System).Serve to simplify serving

Add a (System).Serve interface method to allow serving on a given listener. This allows us to get rid of the getFreeTCPPort method and mirrors the http.Listen and http.ListenAndServe methods.

I did this while I was trying to fix a non-deterministically failing test, and I think it makes the code a bit easier to understand, as it eliminates the somewhat odd listen-then-close thing that getFreeTCPPort did.

opened by jcharum 2

opened by jcharum 2
• #### Monitor and log spot actions on workers

opened by jcharum 2
• #### Prevent reboot of EC2 instances

Prevent reboot of EC2 instances. There are a few scenarios in which the instance may try to reboot:

• EC2 rebooting when encountering underlying hardware issues.
• EC2 scheduled maintenance.
• Services running that cause reboot, e.g. locksmithd. (This one is avoidable by using an appropriate AMI).

We prevent rebooting because bringing the instance back to a working state would add a lot of complexity, as we would need restore both internal state and the state of defined services.

opened by jcharum 2
• #### exec: add some form of log.Flush.Sync mechanism

opened by cosnicolaou 0
• #### (*System) Shutdown signature && bigmachine.Machine unique IDs

The signature of Start is:

(*System) Start(ctx context.Context, count int) ([]*bigmachine.Machine, error)


Whereas (its converse) Shutdown is:

(*System) Shutdown()


It feels as though it would be more consistent , if Shutdown's signature included both context.Context and []*bigmachine.Machine, also returning an error.

Even then, bigmachine.Machine's type does not include a unique ID for the machine (beyond an IP address; often not used as a key), would it make sense to add one?

I'm not retaining the list of machines created by (Start in) the GCE implementation and so, conversely when asked to Shutdown, I must first enumerate all the instances that (I think) have been created (I'm doing this by tag, could potentially use IP) and then make a call to delete these.

opened by DazWilkin 5
• #### circlePI example problems

I'm attempting to use bigmachine with your circlePI example, but:

go run main.go
2019/10/04 10:07:44 waiting for machines to come online
2019/10/04 10:07:44 resetting http client https://localhost:46237/ while calling to Supervisor.Ping: temporary network error
2019/10/04 10:07:45 https://localhost:46237/ Supervisor.Ping: succeeded after 1 retries
2019/10/04 10:07:45 https://localhost:46237/: zip: not a valid zip file
2019/10/04 10:07:45 machine https://localhost:46237/ STOPPED
2019/10/04 10:07:45 zip: not a valid zip file


Will try digging into this myself but it's discouraging :-)

I was unable to find this example published in the repo. It would be useful as I could then determine more quickly whether this is my error.

Perhaps "Getting Started..."?

opened by DazWilkin 6
• #### Kubernetes System

I think it would be interesting to have a Kubernetes backed implementation.

This would provide a more generic solution than per Cloud implementations and could facilitate cross-Cloud deployments too.

opened by DazWilkin 9
• #### Backend for Azure

opened by mariusae 1
• #### Backend for GCP

opened by mariusae 3
• #### v0.5.8(Jul 15, 2020)

• make log tailing work for local system
• accept func() io.Reader for RPC argument to support more ergonomic call retrying
• use loopback address for local communication (avoids OS X firewall warnings)
• update go.{mod,sum} for grailbio/base upgrade to v0.0.9
Source code(tar.gz)
Source code(zip)
• #### v0.5.7(Jun 24, 2020)

• reduce memory load by using temporary files to buffer files collected from workers
• make default Eventer nil
• close all connections when resetting clients, eliminating a collection leak
• add context to spot instance request errors
• prevent reboot of EC2 instances
• upgrade to github.com/grailbio/base v0.0.9
Source code(tar.gz)
Source code(zip)
• #### v0.5.6(Apr 7, 2020)

• start pulling machine logs earlier to capture more boot logging
• wait for final machine logs when shutting down
• log spot actions on machines
• separate timeouts for binary upload from execution
• allow for default AWS region to be specified
Source code(tar.gz)
Source code(zip)

• #### v0.5.0(Oct 7, 2019)

###### GRAIL
Source code created or maintained by GRAIL, Inc.
###### Learn how to design large-scale systems. Prep for the system design interview. Includes Anki flashcards.

English ∙ 日本語 ∙ 简体中文 ∙ 繁體中文 | العَرَبِيَّة‎ ∙ বাংলা ∙ Português do Brasil ∙ Deutsch ∙ ελληνικά ∙ עברית ∙ Italiano ∙ 한국어 ∙ فارسی ∙ Polski ∙ русский язы

139.2k Jul 23, 2021
###### Golang implementation of the Paice/Husk Stemming Algorithm

##Golang Implementation of the Paice/Husk stemming algorithm This project was created for the QUT course INB344. Details on the algorithm can be found

26 Nov 28, 2019
###### Library for multi-armed bandit selection strategies, including efficient deterministic implementations of Thompson sampling and epsilon-greedy.

Mab Multi-Armed Bandits Go Library Description Installation Usage Creating a bandit and selecting arms Numerical integration with numint Documentation

18 Jun 23, 2021
###### Standard machine learning models

Cog: Standard machine learning models Define your models in a standard format, store them in a central place, run them anywhere. Standard interface fo

70 Jul 27, 2021
###### Go Training Class Material :

9.4k Jul 18, 2021
###### a cheat-sheet for mathematical notation in code form

math-as-code Chinese translation (中文版) Python version (English) This is a reference to ease developers into mathematical notation by showing compariso

11.5k Jul 24, 2021
###### Deploy, manage, and scale machine learning models in production

Deploy, manage, and scale machine learning models in production. Cortex is a cloud native model serving platform for machine learning engineering teams.

7.6k Jul 21, 2021
###### Turn shell commands into web services

webify Turn functions and commands into web services For a real world example, see turning a Python function into a web service. Overview webify is a

865 Jul 18, 2021
###### Gorgonia is a library that helps facilitate machine learning in Go.

Gorgonia is a library that helps facilitate machine learning in Go. Write and evaluate mathematical equations involving multidimensional arrays easily

4.1k Jul 27, 2021
###### Gorgonia is a library that helps facilitate machine learning in Go.

Gorgonia is a library that helps facilitate machine learning in Go. Write and evaluate mathematical equations involving multidimensional arrays easily

4.1k Jul 19, 2021
###### Tensorflow + Go, the gopher way

tfgo: TensorFlow in Go tfgo: TensorFlow in Go Dependencies Installation Getting started Computer Vision using data flow graph Train in Python, Serve i

1.8k Jul 26, 2021
###### A Kubernetes Native Batch System (Project under CNCF)

Volcano is a batch system built on Kubernetes. It provides a suite of mechanisms that are commonly required by many classes of batch & elastic workloa

1.8k Jul 27, 2021
###### fast integer log base 10

Package log10 calculates log base 10 of an integer, fast. It is inspired by Daniel Lemire's blog post on this topic. TODO: Add implementations for oth

10 Jun 6, 2021
###### The Go kernel for Jupyter notebooks and nteract.

gophernotes - Use Go in Jupyter notebooks and nteract gophernotes is a Go kernel for Jupyter notebooks and nteract. It lets you use Go interactively i

3k Jul 23, 2021