A fast data generator that's multi-table aware and supports multi-row DML.

Overview

datagen logo

Build Status Go Report Card

If you need to generate a lot of random data for your database tables but don't want to spend hours configuring a custom tool for the job, then datagen could work for you.

datagen takes its instructions from a configuration file. These configuration files can execute any number of SQL queries, taking advantage of multi-row DML for fast inserts and Go's text/template language is used to acheive this.

This project is still very early days, so some of it is still first-pass. If you don't like something, PRs are very welcome!

"modelgen saves sooo much time. coupled with datagen it just gives you a crazy head start"

"right now datagen and modelgen are god sends to me"

Credit

title, namef, namel, name, email, phone, postcode, address, street, city, county, state, currency, locale, country, country2, country3, ip4, ip6, and user-agent.

Installation

To build from source, either clone the repo, or use go get as follows (datagen will automatically be built by Go):

$ go get -u github.com/codingconcepts/datagen

Building

To build a version of datagen for Linux, MacOS, and Windows, call the build Makefile target with a version argument as follows:

$ make build VERSION=1.0.0-beta

If you'd prefer to download a binary, the releases page has all built version and can be found here.

Usage

See the examples directory for a CockroachDB example that works using the make example command. When running the executable, use the following syntax:

datagen -script script.sql --driver postgres --conn postgres://[email protected]:26257/sandbox?sslmode=disable

datagen accepts the following arguments:

Flag Description
-conn The full database connection string (enclosed in quotes)
-driver The name of the database driver to use [postgres, mysql]
-script The full path to the script file to use (enclosed in quotes)
-datefmt (optional) time.Time format string that determines the format of all database and template dates. Defaults to "2006-01-02"
-debug (optional) If set, the SQL generated will be written to stout. Note that ref, row, and each won't work.

Concepts

Object Description
Block A block of text within a configuration file that performs a series of operations against a database.
Script A script is a text file that contains a number of blocks.

Comments

datagen uses Go's text/template engine where possible but where it's not possible to use that, it parses and makes use of comments. The following comments provide instructions to datagen during block parsing.

Comment Description
-- REPEAT N Repeat the block that directly follows the comment N times. If this comment isn't provided, a block will be executed once. Consider this when using the ntimes function to insert a large amount of data. For example -- REPEAT 100 when used in conjunction with ntimes 1000 will result in 100,0000 rows being inserted using multi-row DML syntax as per the examples.
-- NAME Assigns a given name to the block that directly follows the comment, allowing specific rows from blocks to be referenced and not muddled with others. If this comment isn't provided, no distinction will be made between same-name columns from different tables, so issues will likely arise (e.g. owner.id and pet.id in the examples). Only omit this for single-block configurations.
-- EOF Causing block parsing to stop, essentially simulating the natural end-of-file. If this comment isn't provided, the parse will parse all blocks in the script.

Helper functions

ntimes

Expresses the number of multi-row DML statements that will be generated:

{{range $i, $e := ntimes 1 10 }}
	{{if $i}},{{end}}
	(
		...something
	)
{{end}}

ntimes the name of the function.
1 the minimum value.
10 (optional) the maximum value. If omitted, the number will be exactly equal to the minimum value.

The following script generates 5 entries into the one table and between 5 and 10 entries into the two table as a result of the combination of the -- REPEAT and ntimes configured:

-- REPEAT 1
-- NAME one
insert into "one" (
    "id",
    "name") values
{{range $i, $e := ntimes 5 }}
	{{if $i}},{{end}}
	(
		{{int 1 10000}},
		'{{name}}'
	)
{{end}}
returning "id";

-- REPEAT 1
-- NAME two
insert into "two" (
	"one_id") values
{{range $i, $e := ntimes 5 10 }}
	{{if $i}},{{end}}
	(
		'{{each "one" "id" $i}}'
	)
{{end}};

The ntimes and REPEAT values for table one's insert totalled 5, so you'll see 5 rows in table one:

id
1977
2875
6518
6877
9425

The ntimes and REPEAT values for table two's insert totalled 7 (ntimes generated 7 and we REPEATE once):

one_id count
1977 2
2875 1
6518 2
6877 1
9425 1

By increasing the REPEAT value to 2, we'll generate a total of 14 (ntimes is 7 multiplied by two this time):

one_id count
1977 3
2875 2
6518 3
6877 3
9425 3
string

Generates a random string between a given minimum and maximum length:

'{{string 5 10 "abcABC"}}'

string the name of the function.
5 the minimum string length including any prefix.
10 the maximum string length including any prefix.
"abcABC" (optional) the set of characters to select from.

Note that the apostrophes will wrap the string, turning it into a database string.

stringf

Generates a formatted string using placeholder syntax:

'{{stringf "%s.%[email protected]" 5 10 "abc" 10000 20000}}'

stringf the name of the function.
"%s.%[email protected]" the format string.
5 the minimum string length for the first string placeholder.
10 the minimum string length for the first string placeholder.
"abc" the characters to use for the first string placeholder (leave blank to use defaults).
10000 the minimum value for the integer placeholder.
20000 the minimum value for the integer placeholder.

Note that at present only the following verbs are supported:

  • %s - a string
  • %d - an integer
int

Generates a random 64 bit integer between a minimum and maximum value.

{{int 5 10}}

int the name of the function.
5 the minimum number to generate.
10 the maximum number to generate.

date

Generates a random date between two dates.

'{{date "2018-01-02" "now" "" }}'

date the name of the function.
"2018-01-02" the minimum date to generate.
"2019-01-02" the maximum date to generate.
"" the format to use for input dates, left blank to use the value specified by the -datefmt flag date. If overridden, both the minimum and maximum date arguments should be in the overridden format.

Note that "now" can be passed to both the minimum and maximum dates if required.

'{{date "now" "now" "2006-01-02 15:04:05" }}'

"2006-01-02 15:04:05" the date format you which to be generated

float

Generates a random 64 bit float between a minimum and maximum value.

{{float 1.2345678901 2.3456789012}}

float the name of the function.
1.2345678901 the minimum number to generate.
2.3456789012 the maximum number to generate.

uuid

Generates a random V4 UUID using Google's uuid package.

{{uuid}}

uuid the name of the function.

set

Selects a random value from a set of possible values.

'{{set "alice" 1 2.3"}}'

set the name of the function.
"alice"|"bob" etc. the available options to generate from.

wset

Selects a random value from a set of possible values using weighting.

'{{wset "a" 60 "b" 30 "c" 10}}'

wset the name of the function.
"a" the first option.
60 a weight of 60 for the first option.
"b" the second option.
30 a weight of 30 for the second option.
"c" the third option.
10 a weight of 10 for the first option.

Weights can be any number.

fset

Selects a random value from a set of possible values contained within a file and caches the enumeration so it's not reloaded from disk on every invocation.

'{{fset "./examples/types.txt"}}'

fset the name of the function.
"./examples/types.txt" the path to the file containing the options.

ref

References a random value from a previous block's returned values (cached in memory). For example, if you have two blocks, one named "owner" and another named "pet" and you insert a number of owners into the database, returning their IDs, then wish to assign pets to them, you can use the following syntax (assuming you've provided the value "owner" for the first block's -- NAME comment):

'{{ref "owner" "id"}}',

ref the name of the function.

row

References a random row from a previous block's returned values and caches it so that values from the same row can be used for other column insert values. For example, if you have two blocks, one named "owner" and another named "pet" and you insert a number of owners into the database, returning their IDs and names, you can use the following syntax to get the ID and name of a random row (assuming you've provided the value "owner" for the first block's -- NAME comment):

'{{row "owner" "id" $i}}',
'{{row "owner" "name" $i}}'

row the name of the function.
owner the name of the block whose data we're referencing.
id the name of the owner column we'd like.
$i the group identifier for this insert statement (ensures columns get taken from the same row).

each

Works in a simliar way to row but references sequential rows from a previous block's returned values, allowing all of a previous block's rows to have associated rows in a related table, provided the product of --REPEAT and ntimes is the same as the previous block's.

'{{each "owner" "id" $i}}',
'{{each "owner" "name" $i}}',

each the name of the function.
owner the name of the block whose data we're referencing.
id the name of the owner column we'd like.
$i the group identifier for this insert statement (ensures columns get taken from the same row).

{{range $i, $e := ntimes 1}}
	...something
{{end}}
adj

Generates a random adjective.

{{adj}}
noun

Generates a random noun.

{{noun}}
title

Generates a random title for a random gender.

{{title}}
namef

Generates a random first name for a random gender.

{{namef}}
namel

Generates a random last name.

{{namel}}
name

Generates a random full name for a random gender.

{{name}}
email

Generates a random email address.

{{email}}
phone

Generates a random phone number in E164 format.

{{phone}}
postcode

Generates a random postcode, taking a 2-letter country code.

{{postcode "GB"}}
address

Generates a random American address. It's possible to create addresses for other countries using the various other functions available.

{{address}}
street

Generates a random street name, taking a 2-letter country code.

{{street "GB"}}
city

Generates a random American city name.

{{city}}
county

Generates a random county/state name, taking a 2-letter country code. To generate a random state name without having to pass "US" as an argument, use the state function.

{{county "GB"}}
state

Generates a random American state name.

{{state}}
state2

Generates a random American state name in its 2-letter format.

{{state2}}
currency

Generates a random currency in ISO 4217 format.

{{currency}}
locale

Generates a random locale in BCP 47 format.

{{locale}}
country

Generates a random country name.

{{country}}
country2

Generates a random country name in ISO 3166-1 alpha-2 format.

{{country2}}
country3

Generates a random country name in ISO 3166-1 alpha-3 format.

{{country3}}
ip4

Generates a random v4 IP address.

{{ip4}}
ip6

Generates a random v6 IP address.

{{ip5}}
user-agent

Generates a random user agent to simulate an API client.

{{agent}}

Other database types:

MySQL

datagen -script mysql.sql --driver mysql --conn [email protected]/sandbox

With MySQL's lack of a returning clause, we instead select a random record from the person table when inserting pet records, which is less efficient but provides a workaround.

-- REPEAT 10
-- NAME pet
insert into `pet` (`pid`, `name`) values
{{range $i, $e := ntimes 100 }}
	{{if $i}},{{end}}
	(
		(select `id` from `person` order by rand() limit 1),
		'{{name}}'
	)
{{end}};

Todos

  • Refactor:

    • runner.ResetEach and remove that state. Create block runner and keep state within there. (Will require a change to the store as well).
  • Allow generated columns to be referenced by rows in the same insert.

  • Consider using ($1, $2) with arguments

  • Better handling of connection issues during run.

  • Integration tests.

  • Migrate to travis-ci.com and add coveralls support back in.

Issues
  • add date format usage example

    add date format usage example

    Adding description on how to use date format in sql statement instead of cli flag.

    My IDE added some code markdown formatting though... hope you don't mind. happy to remake PR with fewer changes.

    opened by txbrown 0
Owner
Coding Concepts
Coding Concepts
Re-usable component for subscribing to row-level changes on a postgres database

pgreplicate Re-usable component for subscribing to row-level changes on a postgres database Development In order to run tests the postgres database ne

Crewlinker 0 Dec 21, 2021
Multitenancy in Postgres with Go using Row Level Security (RLS)

tenancy A Go library for multitenancy in Postgres using Row Level Security (RLS). Usage Tenancy as a connection pool. By default, tenancy.Open() begin

Common Fate 7 Jul 7, 2022
Simple SQL table fuzzing

SQLfuzz Load random data into SQL tables for testing purposes. The tool can get the layout of the SQL table and fill it up with random data. Installat

Ferenc Fabian 147 Aug 6, 2022
Typescript type declaration to PostgreSQL CREATE TABLE converter

ts2psql NOTE: This is WIP. Details in this readme are ideal state. Current usage: go build && ./ts2psql (or go build && ts2psql if on Windows OS). A s

null 1 Jan 13, 2022
A minimal, single-table No-SQL database.

SimpleDB SimpleDB is a very basic No-SQL database format for long-term data storage in Golang. It is WIP, has a LOT of drawbacks, and definitely is no

Konnor Klashinsky 0 Jan 16, 2022
A tool to run queries in defined frequency and expose the count as prometheus metrics. Supports MongoDB and SQL

query2metric A tool to run db queries in defined frequency and expose the count as prometheus metrics. Why ? Product metrics play an important role in

S Santhosh Nagaraj 19 Jul 1, 2022
🏋️ dbbench is a simple database benchmarking tool which supports several databases and own scripts

dbbench Table of Contents Description Example Installation Supported Databases Usage Custom Scripts Troubeshooting Development Acknowledgements Descri

Simon Jürgensmeyer 72 Jul 25, 2022
Go package for sharding databases ( Supports every ORM or raw SQL )

Octillery Octillery is a Go package for sharding databases. It can use with every OR Mapping library ( xorm , gorp , gorm , dbr ...) implementing data

BlasTrain Co., Ltd. 167 Jul 26, 2022
Dumpling is a fast, easy-to-use tool written by Go for dumping data from the database(MySQL, TiDB...) to local/cloud(S3, GCP...) in multifarious formats(SQL, CSV...).

?? Dumpling Dumpling is a tool and a Go library for creating SQL dump from a MySQL-compatible database. It is intended to replace mysqldump and mydump

PingCAP 261 Jul 14, 2022
Create key value sqlite3 database from tabular data, fast.

Turn tabular data into a lookup table using sqlite3. This is a working PROTOTYPE with limitations, e.g. no customizations, the table definition is fixed, etc.

Martin Czygan 5 Apr 2, 2022
Make a sqlite3 database from tabular data, fast.

MAKTA make a database from tabular data Turn tabular data into a lookup table using sqlite3. This is a working PROTOTYPE with limitations, e.g. no cus

Martin Czygan 5 Apr 2, 2022
Additions to Go's database/sql for super fast performance and convenience.

gocraft/dbr (database records) gocraft/dbr provides additions to Go's database/sql for super fast performance and convenience. $ go get -u github.com/

null 1.7k Jul 28, 2022
Vectorized SQL for JSON at scale: fast, simple, schemaless

Vectorized SQL for JSON at scale: fast, simple, schemaless Sneller is a high-performance vectorized SQL engine for JSON that runs directly on object s

null 632 Aug 2, 2022
OctoSQL is a query tool that allows you to join, analyse and transform data from multiple databases and file formats using SQL.

OctoSQL OctoSQL is a query tool that allows you to join, analyse and transform data from multiple databases, streaming sources and file formats using

Jacob Martin 3.7k Jul 30, 2022
An observability database aims to ingest, analyze and store Metrics, Tracing and Logging data.

BanyanDB BanyanDB, as an observability database, aims to ingest, analyze and store Metrics, Tracing and Logging data. It's designed to handle observab

The Apache Software Foundation 111 Jul 30, 2022
RecordLite: a library (and executable) that declaratively maintains SQLite tables and views of semi-structured data

RecordLite RecordLite is a library (and executable) that declaratively maintains

François Saint-Jacques 22 May 29, 2022
Library for scanning data from a database into Go structs and more

scany Overview Go favors simplicity, and it's pretty common to work with a database via driver directly without any ORM. It provides great control and

Georgy Savva 621 Aug 1, 2022
Workshop: Building data-intensive apps with SingleStore, Redpanda, and Golang

This repo provides a starting point for building applications using SingleStore, Redpanda (by Vectorized), and the Go language. SingleStore is a scale-out relational database built for data-intensive workloads. Redpanda is a Kafka API compatible streaming platform for mission-critical workloads created by the team at Vectorized.

SingleStore Labs 13 Jun 22, 2022
Use SQL to instantly query resources, data sources and more from Terraform code. Open source CLI. No DB required.

Terraform Plugin for Steampipe Use SQL to query data from Terraform configuration files. Get started → Documentation: Table definitions & examples Com

Turbot 11 Jul 8, 2022