Compute cluster (HPC) job submission library for Go (#golang) based on the open DRMAA standard.

Overview

go-drmaa

GoDoc Apache V2 License Go Report Card

This is a job submission library for Go (#golang) which is compatible to the DRMAA standard. The Go library is a wrapper around the DRMAA C library implementation provided by many distributed resource managers (cluster schedulers).

The library was developed using Univa Grid Engine's libdrmaa.so. It was tested with Grid Engine, Torque, and SLURM, but it should work also other resource managers / cluster schedulers which provide libdrmaa.so.

The "gestatus" subpackage only works with Grid Engine (some values are only available on Univa Grid Engine).

The DRMAA (Distributed Resource Management Application API) standard is meanwhile available in version 2. DRMAA2 provides more functionality around cluster monitoring and job session management. DRMAA and DRMAA2 are not compatible hence it is expected that both libraries are co-existing for a while. The Go DRMAA2 can be found here.

Note: Univa Grid Engine 8.3.0 and later added new functions that allows you to submit a job on behalf of another user. This helps creating a DRMAA service (like a web portal) that submits jobs. This functionality is available in the UGE83_sudo branch: https://github.com/dgruber/drmaa/tree/UGE83_sudo The functions are: RunJobsAs(), RunBulkJobsAs(), and ControlAs()

Compilation

First download the package:

   export GOPATH=${GOPATH:-~/src/go}
   mkdir -p $GOPATH
   go get -d github.com/dgruber/drmaa
   cd $GOPATH/github.com/dgruber/drmaa

Next, we need to compile the code.

For Univa Grid Engine and original SGE:

   source /path/to/grid/engine/installation/default/settings.sh
   ./build.sh
   cd examples/simplesubmit
   go build
   export LD_LIBRARY_PATH=$SGE_ROOT/lib/lx-amd64
   ./simplesubmit

For Son of Grid Engine ("loveshack"):

   source /path/to/grid/engine/installation/default/settings.sh
   ./build.sh --sog
   cd examples/simplesubmit
   go build
   export LD_LIBRARY_PATH=$SGE_ROOT/lib/lx-amd64
   ./simplesubmit

For Torque:

If your Torque drmaa.h header file is not located under /usr/include/torque, you will have to modify the build.sh script before running it.

   ./build.sh --torque
   cd examples/simplesubmit
   go build
   ./simplesubmit

For SLURM and the updated SLURM C drmaa binding

   ./build.sh --slurm /usr/local

The example program submits a sleep job into the system and prints out detailed job information as soon as the job is started.

Short Introduction in Go DRMAA

Go DRMAA applications need to open a DRMAA session before the DRMAA calls can be executed. Opening a DRMAA session usually establishes a connection to the cluster scheduler (distributed resource manager). Hence if no more DRMAA calls are made the Exit() method of the session must be executed. This tears down the connection. When an application does not call the Exit() method this can leave a communication handle open on the cluster scheduler side (which can take a while to be removed automatically). It should be always avoided not to call Exit(). In Go the defer statement can be used but remember that the function is not executed when an os.Exit() call is made.

Creating a DRMAA session:

s, err := drmaa.MakeSession()

Usually jobs and job workflows are submitted within DRMAA applications. In order to submit a job first a job template needs to be allocated:

jt, errJT := s.AllocateJobTemplate()
if errJT != nil {
   fmt.Printf("Error during allocating a new job template: %s\n", errJT)
   return
}

Underneath a C job template is allocated which is out-of-scope of the Go system. Hence it must be ensured that the job template is deleted when it is not used anymore. Also here the Go defer statement is useful.

// prevent memory leaks by freeing the allocated C job template at the end
defer s.DeleteJobTemplate(&jt)

The job template contains the specification of the job, like the command to be executed and its parameters. Those can be set by the setter methods of the job.

// set the application to submit
jt.SetRemoteCommand("sleep")
// set the parameter (use SetArgs() when having more parameters)
jt.SetArg("1")

A job can be executed with the session RunJob() method. If the same command should be executed many times, running it as a job array would make sense. In Grid Engine each instance gets a task ID assigned which the job can see in the SGE_TASK_ID environment variable (which is set to unknown for normal jobs). This task ID can be used for finding the right data set the job (array job task) needs to process. Submitting an array job is done with the RunBulkJobs() method.

jobID, errSubmit := s.RunJob(&jt)

// submitting 1000 instances of the same job
jobIDs, errBulkSubmit := s.RunBulkJobs(&jt, 1, 1000, 1)

A job state can also be changed (suspended / resumed / put in hold / deleted):

errTerm := s.TerminateJob(jobID)

The JobInfo data structure contains the runtime information of the job, like exit status or the amount of used resources (memory / IO / etc.). The JobInfo data structure can be get with the Wait() method.

jinfo, errWait := s.Wait(jobID, drmaa.TimeoutWaitForever)

For more details please consult the documentation and the DRMAA standard specifications.

More examples can be found on my blog at http://www.gridengine.eu.

Issues
  • Request idiomatic use of error interface

    Request idiomatic use of error interface

    The choice to return _drmaa.Error instead of error is non-standard. It results in the following strange behaviour when a (_drmaa.Error)(nil) is assigned to an error interface.

    http://play.golang.org/p/hdLuS-gra8

    opened by jvlmdr 6
  • imports should use fully qualified repo

    imports should use fully qualified repo

    We've done this in our fork (can't really do a pull request for this kind of change). Suggest you replace all the import "drmaa" in your subdirectories with the fully qualified repo URL, e.g., import "github.com/dgruber/drmaa".

    opened by DocSavage 4
  • Support for Slurm

    Support for Slurm

    I did the following modification in order to support SLURM. It is based on the last stable release of PSNC DRMAA.

    Apparently the TORQUE binding have similar issues drmaa_get_num_attr_values(drmaa_attr_values_t* values, int *size) VS drmaa_get_num_attr_values(drmaa_attr_values_t* values, size_t *size). I have the same issue with drmaa_get_num_attr_names and applied the same treatment, but I do not have torque available.

    opened by aminiussi 1
  • Updates to support building on darwin

    Updates to support building on darwin

    This commit adds support for using the arch script from UGE to determine library paths and adds unistd.h on MacOS builds in order to provide the gid_t definition.

    opened by cameronbrunner 1
  • Wait() does not retrieve exit status

    Wait() does not retrieve exit status

    I found that Wait() was always returning a JobInfo with ExitStatus() giving 1, even when the job had exited with status 0. This seems to be fixed by changing line 651 in drmaa.go from drmaa_wifexited to drmaa_wexitstatus. The function drmaa_wifexited is already called on line 641.

    opened by jvlmdr 1
  • Support alternate DRMAA location

    Support alternate DRMAA location

    There was no way to specify an alternate drmaa location independently from the drmaa implementation selection.

    Note that there is room for improvement. For one, this option could be provided in addition of the implementation selection.

    As in:

    ./build.sh --drmaa <drmaa dir> --sge
    

    also, if short options are ok, bash builtin getopts could be used.

    opened by aminiussi 0
Releases(v1.0.0)
Owner
Daniel Gruber
Daniel Gruber
A standard library for microservices.

Go kit Go kit is a programming toolkit for building microservices (or elegant monoliths) in Go. We solve common problems in distributed systems and ap

Go kit 23.2k Jun 24, 2022
Raft library Raft is a protocol with which a cluster of nodes can maintain a replicated state machine.

Raft library Raft is a protocol with which a cluster of nodes can maintain a replicated state machine. The state machine is kept in sync through the u

Kalyan Akella 0 Oct 15, 2021
gathering distributed key-value datastores to become a cluster

go-ds-cluster gathering distributed key-value datastores to become a cluster About The Project This project is going to implement go-datastore in a fo

FileDrive Team 15 May 31, 2022
Go Open Source, Distributed, Simple and efficient Search Engine

Go Open Source, Distributed, Simple and efficient full text search engine.

ego 6.1k Jun 25, 2022
💡 A Distributed and High-Performance Monitoring System. The next generation of Open-Falcon

夜莺简介 夜莺是一套分布式高可用的运维监控系统,最大的特点是混合云支持,既可以支持传统物理机虚拟机的场景,也可以支持K8S容器的场景。同时,夜莺也不只是监控,还有一部分CMDB的能力、自动化运维的能力,很多公司都基于夜莺开发自己公司的运维平台。开源的这部分功能模块也是商业版本的一部分,所以可靠性有保

DiDi 4.8k Jun 30, 2022
CockroachDB - the open source, cloud-native distributed SQL database.

CockroachDB is a cloud-native distributed SQL database designed to build, scale, and manage modern, data-intensive applications. What is CockroachDB?

CockroachDB 25k Jun 26, 2022
Parallel Digital Universe - A decentralized identity-based social network

Parallel Digital Universe Golang implementation of PDU. What is PDU? Usage Development Contributing PDU PDU is a decentralized identity-based social n

PDU.PUB 39 Jun 16, 2022
The Go language implementation of gRPC. HTTP/2 based RPC

gRPC-Go The Go implementation of gRPC: A high performance, open source, general RPC framework that puts mobile and HTTP/2 first. For more information

grpc 16.3k Jun 24, 2022
Cross-platform grid-based user interface framework.

Gruid The gruid module provides packages for easily building grid-based applications in Go. The library abstracts rendering and input for different pl

null 64 Jun 22, 2022
A distributed system for embedding-based retrieval

Overview Vearch is a scalable distributed system for efficient similarity search of deep learning vectors. Architecture Data Model space, documents, v

vector search infrastructure for AI applications 1.4k Jun 30, 2022
An implementation of a distributed access-control server that is based on Google Zanzibar

An implementation of a distributed access-control server that is based on Google Zanzibar - "Google's Consistent, Global Authorization System".

authorizer.tech 61 May 16, 2022
Golimit is Uber ringpop based distributed and decentralized rate limiter

Golimit A Distributed Rate limiter Golimit is Uber ringpop based distributed and decentralized rate limiter. It is horizontally scalable and is based

Myntra 601 Jul 3, 2022
Distributed disk storage database based on Raft and Redis protocol.

IceFireDB Distributed disk storage system based on Raft and RESP protocol. High performance Distributed consistency Reliable LSM disk storage Cold and

IceFireDB 872 Jun 27, 2022
Golang client library for adding support for interacting and monitoring Celery workers, tasks and events.

Celeriac Golang client library for adding support for interacting and monitoring Celery workers and tasks. It provides functionality to place tasks on

Stefan von Cavallar 72 Jun 24, 2022
AppsFlyer 488 Jun 24, 2022
Simple, fast and scalable golang rpc library for high load

gorpc Simple, fast and scalable golang RPC library for high load and microservices. Gorpc provides the following features useful for highly loaded pro

Aliaksandr Valialkin 651 Jun 20, 2022
dht is used by anacrolix/torrent, and is intended for use as a library in other projects both torrent related and otherwise

dht Installation Install the library package with go get github.com/anacrolix/dht, or the provided cmds with go get github.com/anacrolix/dht/cmd/....

Matt Joiner 238 Jun 20, 2022
A feature complete and high performance multi-group Raft library in Go.

Dragonboat - A Multi-Group Raft library in Go / 中文版 News 2021-01-20 Dragonboat v3.3 has been released, please check CHANGELOG for all changes. 2020-03

lni 4.3k Jun 23, 2022