An Elasticsearch Migration Tool.

Overview

An Elasticsearch Migration Tool

Elasticsearch cross version data migration.

Dec 3rd, 2020: [EN] Cross version Elasticsearch data migration with ESM

Features:

  • Cross version migration supported
  • Overwrite index name
  • Copy index settings and mapping
  • Support http basic auth
  • Support dump index to local file
  • Support loading index from local file
  • Support http proxy
  • Support sliced scroll ( elasticsearch 5.0 +)
  • Support run in background
  • Generate testing data by randomize the source document id
  • Support rename filed name
  • Support unify document type name
  • Support specify which _source fields to return from source
  • Support specify query string query to filter the data source
  • Support rename source fields while do bulk indexing
  • Load generating with

ESM is fast!

A 3 nodes cluster(3 * c5d.4xlarge, 16C,32GB,10Gbps)

[email protected]:/tmp# ./esm -s https://localhost:8000 -d https://localhost:8000 -x logs1kw -y logs122 -m elastic:medcl123 -n elastic:medcl123 -w 40 --sliced_scroll_size=60 -b 5 --buffer_count=2000000  --regenerate_id
[12-19 06:31:20] [INF] [main.go:506,main] start data migration..
Scroll 10064570 / 10064570 [=================================================] 100.00% 55s
Bulk 10062602 / 10064570 [==================================================]  99.98% 55s
[12-19 06:32:15] [INF] [main.go:537,main] data migration finished.

Migrated 10,000,000 documents within a minute, Nginx log generated from kibana_sample_data_logs.

Example:

copy index index_name from 192.168.1.x to 192.168.1.y:9200

./bin/esm  -s http://192.168.1.x:9200   -d http://192.168.1.y:9200 -x index_name  -w=5 -b=10 -c 10000

copy index src_index from 192.168.1.x to 192.168.1.y:9200 and save with dest_index

./bin/esm -s http://localhost:9200 -d http://localhost:9200 -x src_index -y dest_index -w=5 -b=100

support Basic-Auth

./bin/esm -s http://localhost:9200 -x "src_index" -y "dest_index"  -d http://localhost:9201 -n admin:111111

copy settings and override shard size

./bin/esm -s http://localhost:9200 -x "src_index" -y "dest_index"  -d http://localhost:9201 -m admin:111111 -c 10000 --shards=50  --copy_settings

copy settings and mapping, recreate target index, add query to source fetch, refresh after migration

./bin/esm -s http://localhost:9200 -x "src_index" -q=query:phone -y "dest_index"  -d http://localhost:9201  -c 10000 --shards=5  --copy_settings --copy_mappings --force  --refresh

dump elasticsearch documents into local file

./bin/esm -s http://localhost:9200 -x "src_index"  -m admin:111111 -c 5000 -q=query:mixer  --refresh -o=dump.bin 

loading data from dump files, bulk insert to another es instance

./bin/esm -d http://localhost:9200 -y "dest_index"   -n admin:111111 -c 5000 -b 5 --refresh -i=dump.bin

support proxy

 ./bin/esm -d http://123345.ap-northeast-1.aws.found.io:9200 -y "dest_index"   -n admin:111111  -c 5000 -b 1 --refresh  -i dump.bin  --dest_proxy=http://127.0.0.1:9743

use sliced scroll(only available in elasticsearch v5) to speed scroll, and update shard number

 ./bin/esm -s=http://192.168.3.206:9200 -d=http://localhost:9200 -n=elastic:changeme -f --copy_settings --copy_mappings -x=bestbuykaggle  --sliced_scroll_size=5 --shards=50 --refresh

migrate 5.x to 6.x and unify all the types to doc

./esm -s http://source_es:9200 -x "source_index*"  -u "doc" -w 10 -b 10 - -t "10m" -d https://target_es:9200 -m elastic:passwd -n elastic:passwd -c 5000 

to migrate version 7.x and you may need to rename _type to _doc

./esm -s http://localhost:9201 -x "source" -y "target"  -d https://localhost:9200 --rename="_type:type,age:myage"  -u"_doc"

filter migration with range query

./esm -s https://192.168.3.98:9200 -m elastic:password -o json.out -x kibana_sample_data_ecommerce -q "order_date:[2020-02-01T21:59:02+00:00 TO 2020-03-01T21:59:02+00:00]"

range query, keyword type and escape

./esm -s https://192.168.3.98:9200 -m test:123 -o 1.txt -x test1  -q "@timestamp.keyword:[\"2021-01-17 03:41:20\" TO \"2021-03-17 03:41:20\"]"

generate testing data, if input.json contains 10 documents, the follow command will ingest 100 documents, good for testing

./bin/esm -i input.json -d  http://localhost:9201 -y target-index1  --regenerate_id  --repeat_times=10 

select source fields

 ./bin/esm -s http://localhost:9201 -x my_index -o dump.json --fields=author,title

rename fields while do bulk indexing

./bin/esm -i dump.json -d  http://localhost:9201 -y target-index41  --rename=title:newtitle

user buffer_count to control memory used by ESM, and use gzip to compress network traffic

./esm -s https://localhost:8000 -d https://localhost:8000 -x logs1kw -y logs122 -m elastic:medcl123 -n elastic:medcl123 --regenerate_id -w 20 --sliced_scroll_size=60 -b 5 --buffer_count=1000000 --compress false 

Download

https://github.com/medcl/esm/releases

Compile:

if download version is not fill you environment,you may try to compile it yourself. go required.

make build

  • go version >= 1.7

Options

Usage:
  esm [OPTIONS]

Application Options:
  -s, --source=                    source elasticsearch instance, ie: http://localhost:9200
  -q, --query=                     query against source elasticsearch instance, filter data before migrate, ie: name:medcl
  -d, --dest=                      destination elasticsearch instance, ie: http://localhost:9201
  -m, --source_auth=               basic auth of source elasticsearch instance, ie: user:pass
  -n, --dest_auth=                 basic auth of target elasticsearch instance, ie: user:pass
  -c, --count=                     number of documents at a time: ie "size" in the scroll request (10000)
      --buffer_count=              number of buffered documents in memory (100000)
  -w, --workers=                   concurrency number for bulk workers (1)
  -b, --bulk_size=                 bulk size in MB (5)
  -t, --time=                      scroll time (1m)
      --sliced_scroll_size=        size of sliced scroll, to make it work, the size should be > 1 (1)
  -f, --force                      delete destination index before copying
  -a, --all                        copy indexes starting with . and _
      --copy_settings              copy index settings from source
      --copy_mappings              copy index mappings from source
      --shards=                    set a number of shards on newly created indexes
  -x, --src_indexes=               indexes name to copy,support regex and comma separated list (_all)
  -y, --dest_index=                indexes name to save, allow only one indexname, original indexname will be used if not specified
  -u, --type_override=             override type name
      --green                      wait for both hosts cluster status to be green before dump. otherwise yellow is okay
  -v, --log=                       setting log level,options:trace,debug,info,warn,error (INFO)
  -o, --output_file=               output documents of source index into local file
  -i, --input_file=                indexing from local dump file
      --input_file_type=           the data type of input file, options: dump, json_line, json_array, log_line (dump)
      --source_proxy=              set proxy to source http connections, ie: http://127.0.0.1:8080
      --dest_proxy=                set proxy to target http connections, ie: http://127.0.0.1:8080
      --refresh                    refresh after migration finished
      --fields=                    filter source fields, comma separated, ie: col1,col2,col3,...
      --rename=                    rename source fields, comma separated, ie: _type:type, name:myname
  -l, --logstash_endpoint=         target logstash tcp endpoint, ie: 127.0.0.1:5055
      --secured_logstash_endpoint  target logstash tcp endpoint was secured by TLS
      --repeat_times=              repeat the data from source N times to dest output, use align with parameter regenerate_id to amplify the data size
  -r, --regenerate_id              regenerate id for documents, this will override the exist document id in data source
      --compress                   use gzip to compress traffic
  -p, --sleep=                     sleep N seconds after finished a bulk request (-1)

Help Options:
  -h, --help                       Show this help message


FAQ

  • Scroll ID too long, update elasticsearch.yml on source cluster.
http.max_header_size: 16k
http.max_initial_line_length: 8k

Versions

From To
1.x 1.x
1.x 2.x
1.x 5.x
1.x 6.x
1.x 7.x
2.x 1.x
2.x 2.x
2.x 5.x
2.x 6.x
2.x 7.x
5.x 1.x
5.x 2.x
5.x 5.x
5.x 6.x
5.x 7.x
6.x 1.x
6.x 2.x
6.x 5.0
6.x 6.x
6.x 7.x
7.x 1.x
7.x 2.x
7.x 5.x
7.x 6.x
7.x 7.x
Issues
  • query不起作用。

    query不起作用。

    源: GET source-index/[email protected]:[2018-03-08T00:00:00 TO 2018-03-08T02:00:00] 返回: { "took": 1, "timed_out": false, "_shards": { "total": 5, "successful": 5, "failed": 0 }, "hits": { "total": 54653, "max_score": 1, "hits": .....

    ./esm -s http://192.168.0.21:9900 -x "source-index" [email protected]:[2018-03-08T00:00:00 TO 2018-03-08T02:00:00] -y "dest-index" -d http://192.168.0.185:19400 --sliced_scroll_size=5 结果: Scroll 150000 / 334089 [===================>------------------------] 44.90% 3s Scroll 334089 / 334089 [=========================================] 100.00% 1m17s Bulk 334062 / 334089 [===========================================] 99.99% 1m53s [03-26 11:08:04] [INF] [main.go:410,main] data migration finished.

    还是做了全量复制。

    opened by chenyg0911 11
  • ESM数据迁移对于long型字段的数据如果长度大于17位,出现了精度丢失问题。

    ESM数据迁移对于long型字段的数据如果长度大于17位,出现了精度丢失问题。

    ESM数据迁移进对于long型字段的数据如果长度大于17位,出现了精度丢失问题。应该是代码中get数据时字符串转Long型出现了bug。

    opened by auzn1025 10
  • 为什么后台不能调用这个工具

    为什么后台不能调用这个工具

    在crontab中写定时任务,或者用bash,python脚本调用,在后台执行也无法成功启动该工具,只能手动在前台输入命令执行

    opened by godjob2014 9
  • -q query can't find documents from source

    -q query can't find documents from source

    my field updated_at type is

    "updated_at": { - "type": "date", "format": "yyyy-MM-dd HH:mm:ss" },


    when I use: -q updated_at:"[2020-06-01T00:00:00 TO 2020-06-29T23:59:59]"

    print the error: [07-01 16:54:57] [ERR] [main.go:161,main] can't find documents from source.

    but the documents have the data

    "_id": "35227656", "_score": 0, "_source": { - "updated_at": "2020-06-29 14:58:38" } },


    when i use -q updated_at:"[2020-06-01 00:00:00 TO 2020-06-29 23:59:59]" print the error: "reason":"Failed to parse query [updated_at:[2020-06-01 00:00:00 TO 2020-06-29 23:59:59]]",

    why?

    opened by yangtingtools 6
  • [10-21 16:51:45] [ERR] [v0.go:66,Bulk] data is empty, skip

    [10-21 16:51:45] [ERR] [v0.go:66,Bulk] data is empty, skip

    esm -s http://xxxx:9200 -x "base" -d http://XXXX:8200 -c 10000 -w=5 -b=50 --shards=5 --copy_settings --copy_mappings --force 报错:[10-21 16:51:45] [ERR] [v0.go:66,Bulk] data is empty, skip,这是正常的么?

    opened by Try-001 5
  • 无法在后台执行,必须要在终端下面跑。

    无法在后台执行,必须要在终端下面跑。

    每次跑得时候必须在终端下才能正常运行,如果在后台运行,则报错:

    panic: Can't get terminal settings: inappropriate ioctl for device

    goroutine 1 [running]: main.main() /Users/medcl/BTSync/github/elasticsearch-migration/main.go:174 +0x4b7b panic: Can't get terminal settings: inappropriate ioctl for device

    goroutine 1 [running]: main.main() /Users/medcl/BTSync/github/elasticsearch-migration/main.go:174 +0x4b7b 重现方法: myshell script: for i in $datelist;do for x in curl -s ${es1}/_cat/indices | awk '{print $3}' | grep ${i};do /opt/bin/linux64/esm -s ${es1} -d ${es2} -w 5 -b 40 -c 10000 -x $x -f --copy_mappings >/dev/null if [ "$?" -eq 0 ]; then es1_doc_num=curl -s ${es1}/_cat/indices/$x | awk '{print $(NF-3)}' es2_doc_num=curl -s ${es2}/_cat/indices/$x | awk '{print $(NF-3)}' doc_diff=expr ${es1_doc_num} - ${es2_doc_num} echo "$x is migrationed done, the ${x}_doc_diff is $doc_diff " else echo "$x is migrated err" fi done done 这个脚本运行时候加 & 在后台跑就会报错,在crontab 里面跑也会报错。 谢谢。

    opened by lkad 5
  •  --copy_mapping is incomplete?

    --copy_mapping is incomplete?

    source version:5.6.16 des version:5.6.16

    ./bin/linux64/esm -s http://127.0.0.1:9202 -d http://127.0.0.2:9200 -x _all --copy_settings --copy_mappings --shards=4 --refresh

    source settings: "number_of_replicas": "1" dest settings: "number_of_replicas": "0"

    I found a problem with this setting in my example, and I don’t know any more settings

    opened by queenns 3
  • An HTTP line is larger than 4096 bytes,是index的内容比较长?

    An HTTP line is larger than 4096 bytes,是index的内容比较长?

    测试版本 es 6.5.4 >> es 7.2.0 测试的命令 ./esm -s http://10.27.69.118:9200 -d http://10.81.176.31:9200 -a -w=5 -b=10 -c=10000 ./esm -s http://10.27.69.118:9200 -d http://10.81.176.31:9200 -a -w=5 -b=10 -c=1000 ./esm -s http://10.27.69.118:9200 -d http://10.81.176.31:9200 -a -w=5 -b=5 -c=1000 ./esm -s http://10.27.69.118:9200 -d http://10.81.176.31:9200 -a -w=5 -b=5 -c=100

    [08-21 09:08:31] [ERR] [scroll.go:49,Next] {"error":{"root_cause":[{"type":"too_long_frame_exception","reason":"An HTTP line is larger than 4096 bytes."}],"type":"too_long_frame_exception","reason":"An HTTP line is larger than 4096 bytes."},"status":400}

    如何理解这种报错的意思? 如何解决呢? 是设置--bulk_size的大小吗?

    opened by wajika 3
  • dump到本地不起作用

    dump到本地不起作用

    我用这个命令不起作用:

    sudo ./esm -s http://x.x.x.x:9200 -x "xxxx" -b 5 -c 5000 -w 5 --refresh -o x.x.x.x:9200.json

    什么原因呢

    opened by ryrnnnlaleaaa 3
  • 数据总量4.2亿,数据同步了1千万,自动退出了

    数据总量4.2亿,数据同步了1千万,自动退出了

    [[email protected] linux]$ ./esm -s http://localhost:9200 -d http://localhost:9204 -x my-data-2021-39 -w=10 -c 8888 -m name:pw -n name:pw my-data-2021-39 [09-26 18:27:01] [INF] [main.go:474,main] start data migration.. Scroll 10016776 / 423557337 [===>------------------------------------------------------------------------] 2.36% 1h11m3s Bulk 10008837 / 423557337 [===>-------------------------------------------------------------------------] 2.36% 1h11m3s [09-26 19:38:04] [INF] [main.go:505,main] data migration finished.

    opened by kukakia 0
  • 执行没多久就报 data is empty, skip

    执行没多久就报 data is empty, skip

    大数据量跨集群 reidnex 时 执行没多久就报 data is empty, skip, 是因为超时了么?

    源es版本 6.5 target es版本 7.10.1

    esm版本: v0.5.0

    opened by xiaoshi2013 0
  • 最新源码中有包无法导入

    最新源码中有包无法导入

    infini.sh/framework/lib/fasthttp infini.sh/framework/core/util

    这两个包已经无法导入了

    opened by ismlsmile 1
  • 导出索引后再导入,settings与mapping丢失

    导出索引后再导入,settings与mapping丢失

    导出索引后再导入,settings与mapping丢失。如何在dump导出到本地文件或导入时保持原索引的settings与mapping?如果能保持,则可用于平常备份。

    opened by gitjacky 0
Releases(v0.5.0)
Owner
Medcl
Developer| Evangelist | Consultant
Medcl
Парсер технологического журнала, основанный на стеке технологий Golang goroutines + Redis + Elasticsearch.

go-techLog1C Парсер технологического журнала, основанный на стеке технологий Golang goroutines + Redis + Elasticsearch. Стек является кросс-платформен

Vadim 20 Oct 14, 2021
Evolutionary optimization library for Go (genetic algorithm, partical swarm optimization, differential evolution)

eaopt is an evolutionary optimization library Table of Contents Changelog Example Background Features Usage General advice Genetic algorithms Overview

Max Halford 751 Oct 12, 2021
Self hosted search engine for data leaks and password dumps

Self hosted search engine for data leaks and password dumps. Upload and parse multiple files, then quickly search through all stored items with the power of Elasticsearch.

Davide Pataracchia 22 Aug 2, 2021
elPrep: a high-performance tool for analyzing sequence alignment/map files in sequencing pipelines.

Overview elPrep is a high-performance tool for analyzing .sam/.bam files (up to and including variant calling) in sequencing pipelines. The key advant

null 246 Oct 19, 2021
psutil for golang

gopsutil: psutil for golang This is a port of psutil (https://github.com/giampaolo/psutil). The challenge is porting all psutil functions on some arch

shirou 6.8k Oct 18, 2021
Vuls Beater for Elasticsearch - connecting vuls

vulsbeat Welcome to vulsbeat.Please push Star. This software allows you Vulnerability scan results of vuls can be imported to Elastic Stack.

kazuminn 14 Jan 17, 2021
Scaffolding tool for golang based services

Scaffolding tool for golang based services

Praveen Penumaka 1 Jun 14, 2021
An HTTP service for customizing import path of your Go packages.

Go Packages A self-host HTTP service that allow customizing your Go package import paths. Features Reports. Badges. I18N. Preview I launch up a free H

Razon Yang 17 Jun 28, 2021
dockerized (postgres + nginx + golang + react)

PNGR Stack Dockerized (postgres + nginx + golang + react) starter kit Only implements basic user signup, session management, and a toy post type to de

Karl Keefer 633 Oct 23, 2021
A framework for constructing self-spreading binaries

A framework that aids in creation of self-spreading software Requirements go get -u github.com/redcode-labs/Coldfire go get -u github.com/yelinaung/go

Red Code Labs 832 Oct 13, 2021
Wprecon, is a vulnerability recognition tool in CMS Wordpress, 100% developed in Go.

WPrecon (Wordpress Recon) Hello! Welcome. Wprecon (Wordpress Recon), is a vulnerability recognition tool in CMS Wordpress, 100% developed in Go. Featu

blackbinn 162 Oct 10, 2021
Podman: A tool for managing OCI containers and pods

Podman: A tool for managing OCI containers and pods Podman (the POD MANager) is a tool for managing containers and images, volumes mounted into those

Containers 10.8k Oct 23, 2021
Go package providing tools for working with Library of Congress data.

go-libraryofcongress Go package providing tools for working with Library of Congress data. Documentation Tools $> make cli go build -mod vendor -o bin

San Francisco International Airport Museum 1 Oct 14, 2021
Visualize call graph of a Go program using Graphviz

go-callvis go-callvis is a development tool to help visualize call graph of a Go program using interactive view. Introduction The purpose of this tool

Ondrej Fabry 3.6k Oct 15, 2021
A tool that facilitates building OCI images

Buildah - a tool that facilitates building Open Container Initiative (OCI) container images The Buildah package provides a command line tool that can

Containers 4.7k Oct 15, 2021
Report unwanted import path and declaration usages

faillint Faillint is a simple Go linter that fails when a specific set of import paths or exported path's functions, constant, vars or types are used.

Fatih Arslan 174 Oct 17, 2021
A fine gadget

go-ipldtool The multipurpose tool for wrangling data. IPLD is a data interchange standard, with emphasis on utility in the decentralized web. The ipld

IPLD 5 Oct 21, 2021
James is your butler and helps you to create, build, debug, test and run your Go projects

go-james James is your butler and helps you to create, build, debug, test and run your Go projects. When you often create new apps using Go, it quickl

Pieter Claerhout 45 Sep 28, 2021
Hard fork of jteeuwen/go-bindata because it disappeared, Please refer to issues#5 for details.

Warning this repository is not maintained. Questions or suggestions can be posted here. bindata This package converts any file into managable Go sourc

null 628 Oct 13, 2021