A modern text indexing library for go

Overview

bleve bleve

Tests Coverage Status GoDoc Join the chat at https://gitter.im/blevesearch/bleve codebeat Go Report Card Sourcegraph License

modern text indexing in go - blevesearch.com

Features

  • Index any go data structure (including JSON)
  • Intelligent defaults backed up by powerful configuration
  • Supported field types:
    • Text, Numeric, Date
  • Supported query types:
    • Term, Phrase, Match, Match Phrase, Prefix
    • Conjunction, Disjunction, Boolean
    • Numeric Range, Date Range
    • Simple query syntax for human entry
  • tf-idf Scoring
  • Search result match highlighting
  • Supports Aggregating Facets:
    • Terms Facet
    • Numeric Range Facet
    • Date Range Facet

Discussion

Discuss usage and development of bleve in the google group.

Indexing

message := struct{
	Id   string
	From string
	Body string
}{
	Id:   "example",
	From: "[email protected]",
	Body: "bleve indexing is easy",
}

mapping := bleve.NewIndexMapping()
index, err := bleve.New("example.bleve", mapping)
if err != nil {
	panic(err)
}
index.Index(message.Id, message)

Querying

index, _ := bleve.Open("example.bleve")
query := bleve.NewQueryStringQuery("bleve")
searchRequest := bleve.NewSearchRequest(query)
searchResult, _ := index.Search(searchRequest)

License

Apache License Version 2.0

Issues
  • Index接口中的Document(id string)返回值类型发生了改变

    Index接口中的Document(id string)返回值类型发生了改变

    The return value type of the Document(id string) method in the Index interface of v1.0.10 is *document.Document. The return value type of the Document(id string) method in the Index interface of v2.3.1 is index.Document. I would like to ask how to obtain *document.Document information in v2.3.1.

    opened by lh15200218 1
  • Unicode segmenter perf experiment

    Unicode segmenter perf experiment

    In this PR, as an experiment I’ve swapped out the Unicode segmenter in Bleve with this one. I am achieving around a 2x improvement in throughput on the Unicode tokenizer, from ~23 MB/s → ~46 MB/s on my MacBook Pro (Intel).

    Current
    BenchmarkTokenizeEnglishText-4   	    8455	    120171 ns/op	  22.55 MB/s	   49144 B/op	      11 allocs/op
    
    New
    BenchmarkTokenizeEnglishText-4   	   21267	     58150 ns/op	  46.60 MB/s	   49144 B/op	      11 allocs/op
    

    The tests are passing, and I did some work in there to ensure compatibility on token types.

    You’ll find tokenizerbench.sh at the root of the project for repro steps.

    Shameless plug on my part, hopefully of interest.

    opened by clipperhouse 2
  • Keyword search does not work

    Keyword search does not work

    I'am studying the example:beer-search.After I changed some files' the keyword-mapping fields' value with other language such as Chinese which in data directory ,the keyword search(term search/match search) does not work any more. "The Keyword Analyzer does not perform any analysis on the input text", as the help said,the keyword search should has nothing to do with language.But why?

    opened by wolf1860 2
  • Is there a way to keep the last search

    Is there a way to keep the last search "cursor"

    First of all: bleve feels like a very high quality lib! Great job guys!

    I am wondering if there is a way to keep the effort from the first search and move on with the next call. I am using bleve to sort all the data by a date field with bleve.NewMatchAllQuery(). The request takes about 1min. The request is being fired with bleve.NewSearchRequestOptions(query, size, from, false). So far I don't so a way to improve performance when iterating over a huge amount of entries.

    Am I missing something?

    opened by artvel 9
  • Non valid UT8 sequence in explanation

    Non valid UT8 sequence in explanation

    It looks like that search.Explanation outputs invalid UTF-8 sequences. This seems to be due to the fact that the raw termMatch.ID is used, which is an internal ID backed up by a []byte internally.

    https://github.com/blevesearch/bleve/blob/ae28975038cb25655da968e3f043210749ba382b/search/scorer/scorer_term.go#L157

    Would it be possible to be to escape the identifier and use e.g. hex encoding to avoid inserting invalid UTF8 sequences?

    opened by nopper 1
Releases(v2.3.2)
  • v2.3.2(Mar 22, 2022)

    Minor changes
    • Adding configurable default threshold for field TFR cache (https://github.com/blevesearch/bleve/pull/1666)
    • Forked certain third party dependencies (https://github.com/blevesearch/bleve/pull/1667)
    Source code(tar.gz)
    Source code(zip)
  • v2.3.1(Feb 24, 2022)

    Bug Fixes
    • Fix for potential file handle leaks in merger. (https://github.com/blevesearch/bleve/pull/1652)
    • Place a nil guard within TermFacets' Terms() API (https://github.com/blevesearch/bleve/pull/1654)
    • Upgrade zapx versions (https://github.com/blevesearch/bleve/pull/1655)
      • Place bounds check within memUvarintReader's ReadUvarint (https://github.com/blevesearch/zapx/pull/107)
    Source code(tar.gz)
    Source code(zip)
  • v2.3.0(Dec 16, 2021)

    Enhancements
    • Upgrade to golang.org/x/text to v0.3.7 (https://github.com/blevesearch/bleve/pull/1645)
    • Optimize FacetsBuilder's UpdateVisitor - for some significant performance gains (https://github.com/blevesearch/bleve/pull/1405) (changes API)
    • Optimize TermFacets - for significant performance gains (https://github.com/blevesearch/bleve/pull/1404) (changes API)
    • Introduce support for a new document field type: IP that supports range queries (https://github.com/blevesearch/bleve/pull/1546)
    Bug Fixes
    • Fix breakage in highlighting when using the HTML character filter (https://github.com/blevesearch/bleve/pull/1641)
    • Fix issue in parsing query strings over numeric data with boost settings (https://github.com/blevesearch/bleve/pull/1639)
    • Address seg faults seen within zap; zap versions upgrade (https://github.com/blevesearch/zapx/pull/95, https://github.com/blevesearch/zapx/pull/96)
    • Fix out-of-bounds issue while highlighting without term locations (https://github.com/blevesearch/bleve/pull/1590)
    Source code(tar.gz)
    Source code(zip)
  • v2.2.2(Oct 28, 2021)

    Enhancements
    • Adding support for croatian (hr) (https://github.com/blevesearch/bleve/pull/1517)
      • Fixes to import paths (https://github.com/blevesearch/bleve/pull/1634)
    Source code(tar.gz)
    Source code(zip)
  • v2.2.1(Oct 4, 2021)

  • v2.2.0(Sep 21, 2021)

    Enhancements
    • Upgrade to RoaringBitMap/[email protected] (https://github.com/blevesearch/bleve/pull/1626)
      • Involves upgrading zapx, scorch_segment_api, vellum versions
    Bug Fixes
    • Fix issue in read-only mode of scorch index (https://github.com/blevesearch/bleve/pull/1624)
    Source code(tar.gz)
    Source code(zip)
  • v2.1.1(Sep 14, 2021)

    Enhancements
    • Introducing a new hierarchy token filter (https://github.com/blevesearch/bleve/pull/1570)
    Bug Fixes
    • Update version of roaring used by dependency: scorch_segment_api to match bleve (https://github.com/blevesearch/bleve/pull/1622)
    • Fix to firing error on panic in scorch async routines (https://github.com/blevesearch/bleve/pull/1566)
    Source code(tar.gz)
    Source code(zip)
  • v2.1.0(Aug 9, 2021)

    Enhancements
    • Supporting backup (copy) of an online scorch index (https://github.com/blevesearch/bleve/pull/1605)
    Bug Fixes
    • Address a panic in the merger loop (https://github.com/blevesearch/bleve/pull/1613)
    Source code(tar.gz)
    Source code(zip)
  • v2.0.7(Jul 28, 2021)

    Bug Fixes
    • Fix to HighTerm while sorting "missing" values (https://github.com/blevesearch/bleve/pull/1608)
    • Fix to handling empty field name (https://github.com/blevesearch/bleve/pull/1594)
    Source code(tar.gz)
    Source code(zip)
  • v2.0.6(Jun 11, 2021)

    This release hopefully completes the cleanup of v2.x for issues caused by one of our dependencies moving a repository.

    • Roaring Bitmaps updated to v0.7.3
    • Update to vellum v1.0.5
    • Update to zapx v15.2.1 v14.2.1 v13.2.1 v12.2.1 v11.2.1

    Things brings all the repositories to the same versions of Roaring bitmaps and bitset libraries.

    Source code(tar.gz)
    Source code(zip)
  • v2.0.5(Jun 6, 2021)

    Emergency Release

    Update to latest Roaring bitmaps (v0.7.1). This release is being made in an attempt to fix a completely broken repository. This has NOT had the usual testing, and users with backwards compatibility concerns should NOT upgrade at this time.

    We will update this message as we learn more.

    Source code(tar.gz)
    Source code(zip)
  • v2.0.4(Jun 6, 2021)

    Emergency release to try and fix issues caused by the repository rename of one of our dependencies: #1592

    Bug fixes:

    • Update repository name #1592
    • Switch to newer vellum also addressing package rename issue #1596
    • Remove redundant code #1582

    BROKEN: this release still contains dependency issues

    Source code(tar.gz)
    Source code(zip)
  • v2.0.3(Mar 26, 2021)

  • v2.0.2(Feb 12, 2021)

    Enhancements
    • Scorch supports new config setting for bolt_timeout (https://github.com/blevesearch/bleve/pull/1553)
    • Bleve now uses vellum repository in the blevesearch organization (https://github.com/blevesearch/bleve/pull/1556)
    Bug Fixes
    • Fix query memory estimation (https://github.com/blevesearch/bleve/pull/1554)
    Source code(tar.gz)
    Source code(zip)
  • v2.0.1(Jan 14, 2021)

    Enhancements
    • Export an Analyze(..) API for scorch and upsidedown. While the API signatures are slightly different, they will allow users to analyze a document per the index's mapping (https://github.com/blevesearch/bleve/pull/1540)
    • Additional unit tests, formatting fixes and README updates.
    Source code(tar.gz)
    Source code(zip)
  • v2.0.0(Jan 13, 2021)

    This release is a new major version because it contains breaking changes to the package API. For complete details of the contents and reasoning behind these changes, please see: https://github.com/blevesearch/bleve/issues/1495

    Highlights

    • Remove circular dependency between Bleve and Zap modules
    • Make Scorch and Zap v15 the default index/segment type when using the New() method
    • New option to disable freq/norm information for a field
    • Types corrected for MatchQueryOperatorOr and MatchQueryOperatorAnd (see #1410)

    Deprecated features (may be removed in the future)

    • upsidedown index format and all key/value adapters
    • HTTP sub-package
    • bleve command-line tool sub-commands: bulk, create, index, dump
    • config sub-package
    Source code(tar.gz)
    Source code(zip)
  • v1.0.14(Dec 8, 2020)

    This version is identical to v1.0.13 except that the go.mod now refers to a tagged release (v1.0.0) of the blevex module.

    This was necessary as blevex master will soon be evolving to support the planned release of Bleve 2.0.0.

    Source code(tar.gz)
    Source code(zip)
  • v1.0.13(Nov 20, 2020)

    Enhancements
    • Support for zap v15.0.2 a file-format compatible change which improves performance (https://github.com/blevesearch/zap/pull/44)
    Bug Fixes
    • Fix analyzer lookup during search when the field name contains the . character (https://github.com/blevesearch/bleve/pull/1496)
    • Remove duplicated text in the help output of the command-line tool (https://github.com/blevesearch/bleve/pull/1489)
    • Fix an inconsistency between an index mapping's default value for DocValuesDynamic and the omitempty JSON struct tag (https://github.com/blevesearch/bleve/pull/1485)
    Source code(tar.gz)
    Source code(zip)
  • v1.0.12(Oct 6, 2020)

    Enhancements
    • Support for new zap v15 file format (significant space savings) https://github.com/blevesearch/zap/pull/27
    • Make it possible to shutdown the analysis queue freeing goroutines https://github.com/blevesearch/bleve/pull/1414
    • New stat reporting number of deleted items and estimate of space used by deleted items https://github.com/blevesearch/bleve/pull/1470
    Bug Fixes
    • HTML highlighter now performs escaping prior to output (SECURITY) https://github.com/blevesearch/bleve/pull/1465
    • Fix crash in ASCII folding character filter https://github.com/blevesearch/bleve/pull/1434
    • Fix bleve not closing index when encountering an error after open, but before it is returned https://github.com/blevesearch/bleve/pull/1479
    • Fix test issue with Go v1.15 https://github.com/blevesearch/bleve/pull/1466
    • Fix a test which was no longer testing the intended behavior https://github.com/blevesearch/bleve/issues/1458
    Source code(tar.gz)
    Source code(zip)
  • v1.0.10(Aug 24, 2020)

    Enhancements
    • Geo search compute range code cleanup #1447
    • Remove disjunction unadorned avoidance heuristic #1446
    • Remove the tooManyClauses limitation when an unadorned disjunction optimization is possible #1426
    • Improve performance of scorch internal event handling by using atomic ops instead of mutex #1419
    • Improve error message for tooManyClauses to report the field name #1413
    • Allow advanced users to alter the sort function implementation (only used in MultiSearch or searchBefore) #1400
    Bug Fixes
    • Update roaring bitmaps and bbolt, previous versions possibly have "unsafe" issues with newer versions of Go #1422
    • Update to latest vellum, fixes performance issue in corner case see https://github.com/couchbase/vellum/issues/32
    • Fix ineffectual assignment in merge planner options #1450
    • Improve sort mode auto heuristic for detecting numeric terms #1435
    • Fix error handling in numeric range searcher #1445
    • Fix memory leak when performing unadorned conjunction/disjunction optimization #1438
    • Fix error handling in DocIDReader #1443
    • Fix file handle leak for corner case of the merger #1417
    • Fix accounting of TermFieldReader started/finished when the Reader is reset internally performing Advance backwards #1415
    Source code(tar.gz)
    Source code(zip)
  • v1.0.9(Jun 8, 2020)

    Enhancements
    • Add support for zap v14, reduced disk usage over previous versions.
    Bug Fixes
    • Fixed bug in parsing of forceSegmentVersion scorch config when configuration map was parsed from JSON (float64 instead of int). See #1401
    • Fixed bug in the Builder which caused it to always use zap v11, regardless of override configuration being used. See #1406
    Source code(tar.gz)
    Source code(zip)
  • v1.0.8(Jun 8, 2020)

    Enhancements
    • New ForceMerge API available on scorch index. See #1393
    • New Builder API introduced to allow write-only index building, into an optimized single segment. See #1282
    • Add support for zap v13, addresses an index size regression introduced in zap v12.
    • Changed MultiSearch comparison implementation, with possible performance improvement. See #1398
    Bug Fixes
    • Fix analyzer inheritance issue in Document Mapping. See #1391
    Source code(tar.gz)
    Source code(zip)
  • v1.0.7(Jun 8, 2020)

    Bug Fixes
    • Better avoiding panic during highlighting. See #1371
    • Improved regexp replace character filter, now supports references. See #1351
    • Updated to latest versions of zap moving to our fork of mmap-go (fixing file cleanup issue on Windows). See #1289
    Source code(tar.gz)
    Source code(zip)
  • v1.0.5(Jun 8, 2020)

    The first usable release of the v1.0.x line, versions v1.0.0 through v1.0.4 were all unusable due to release issues.

    For full information about the v1.0.x release, see #1350

    Source code(tar.gz)
    Source code(zip)
  • v0.8.1(Sep 23, 2019)

  • v0.8.0(Jul 30, 2019)

  • v0.7.0(Feb 27, 2018)

    Behavior Changes
    • This is the last release of Bleve to support Go 1.5 and Go 1.6
    Enhancements
    • Pure Go Analyzers added for several languages added: Russian, Danish, Finnish, Hungarian, Dutch, Norwegian, Romanian, Swedish, Turkish
    • New token filter UniqueTerm which removes duplicate terms from the token stream
    Bug Fixes
    • Many scorch improvements as it is still under heavy development
    Source code(tar.gz)
    Source code(zip)
  • v0.6.0(Jan 6, 2018)

    Behavior Changes
    • Index Format changed to save space in the backindex (MUST REBUILD INDEX!!!)
    Enhancements
    • New experimental indexing scheme scorch
    • Adjustments to Searchers to avoid moving Term Freq Reader backwards
    • Numeric range queries filter against the term dictionary now
    • Add support for BleveType() alternative for type detection
    • Add support for mapping to recognize/use TextMarshaler interface
    • Command-line query tool supports -sortby option
    • New pure Go Spanish analyzer
    • New pure Go German analyzer
    • Topn collector switches approach (heap vs slice) based on size+skip
    • Add IndexAdvanced() to allow direct indexing of documents without mapping
    • Optimmize heap collector Final() for large counts
    • New term range query
    • New geo bounding box and point distance queries
    • New geo point distance sorting
    • ForestDB K/V store has been removed
    • Reduce garbage created while processing facets
    • Experimental index scheme smolder has been removed
    • Improve query string compatibility with ES
    • New multi phrase query
    • Improve performance of regular expression and wildcard queries
    • Many garbage and allocation improvements from Steve Yen
    Bug Fixes
    • Fix tests to properly close/remove temp indexes
    • Fix race condition in TestIndexMetadataRaceBug198
    • Fix mapping bug where closestDocMapping selecting wrong mapping
    • Fix data race in doc id search
    • Fix token start/end/position values in camelCase tokenizer
    • Fix issue with numeric range queries in query string
    • Fix nil ptr panic when using new text marshaler support
    • Fix panic in term range search
    • Fix geo point distance search
    • Fix race condition in incorrectly shared state in MultiSearch
    • Fix edge ngram output in some corner cases

    NOTE: these release notes are not up to our normal stands due to waiting far too long between releases. We will attempt to release more frequently and more carefully annotate release notes.

    Source code(tar.gz)
    Source code(zip)
  • v0.5.0(Sep 29, 2016)

    Behavior Changes
    • DumpAll(), DumpDoc(), DumpFields() methods removed (#429)
    • New method to create in-memory indexes (#452)
    Enhancements
    • New experimental indexing scheme smolder
    • All bleve utilities migrated to single bleve cmd (#430)
    • Index upside_down performance optimizations
    • Searcher performance optimizaitons
    Bug Fixes
    • Tokenizer regexp no longer produces empty tokens
    • Shingle filter now stateless, was producing bogus output (#431)
    • Fix panic in collector when skipping more than total hits (#453)
    Source code(tar.gz)
    Source code(zip)
  • v0.4.0(Sep 12, 2016)

    Behavior Changes
    • POSSIBLE BREAKING CHANGE - Byte Array Converters removed - SEE #392
    • Firestorm indexing scheme removed
    • Build tags added around references to persistent storage, making it slightly more possible to use in Google App Engine
    Enhancements
    • Match query supports changing default operator AND/OR
    • Facet speedups
    • UpsideDown indexing scheme query perf improvements
    • Index API changes to make internal document identifiers opaque
    • Ability to sort results by indexed fields
    • Support read_only flag for boltdb indexes
    • Improved unit test code coverage to 74%
    • Removed nex query string lexer, replaced with custom lexer
    • New query string lexer supports escaping reserved characters
    • Minor text analysis perf tweaks
    Bug Fixes
    • Corrected Advance() method of regexp, prefix and fuzzy searchers #342
    • Indexing of primitives fixed inside map/struct #389
    • Moss handing of 0-length values
    • Docs updated for date range querying #382
    Source code(tar.gz)
    Source code(zip)
Owner
bleve
modern text indexing for go - supported and sponsored by Couchbase
bleve
omniparser: a native Golang ETL streaming parser and transform library for CSV, JSON, XML, EDI, text, etc.

omniparser Omniparser is a native Golang ETL parser that ingests input data of various formats (CSV, txt, fixed length/width, XML, EDI/X12/EDIFACT, JS

JF Technology 462 Jun 22, 2022
:book: A Golang library for text processing, including tokenization, part-of-speech tagging, and named-entity extraction.

prose prose is a natural language processing library (English only, at the moment) in pure Go. It supports tokenization, segmentation, part-of-speech

Joseph Kato 2.9k Jul 1, 2022
👄 The most accurate natural language detection library in the Go ecosystem, suitable for long and short text alike

?? The most accurate natural language detection library in the Go ecosystem, suitable for long and short text alike

Peter M. Stahl 678 Jun 22, 2022
Fonetic is a library to assess pronounceablility of a given text

fonetic-go assess pronounciblity of text Introduction Fonetic is a library to assess pronounceablility of a given text. For more information, check ou

Somdev Sangwan 34 Jun 23, 2022
Parse placeholder and wildcard text commands

allot allot is a small Golang library to match and parse commands with pre-defined strings. For example use allot to define a list of commands your CL

Sebastian Müller 55 Apr 13, 2022
Guess the natural language of a text in Go

guesslanguage This is a Go version of python guess-language. guesslanguage provides a simple way to detect the natural language of unicode string and

Nikita Vershinin 54 Mar 15, 2022
Produces a set of tags from given source. Source can be either an HTML page, Markdown document or a plain text. Supports English, Russian, Chinese, Hindi, Spanish, Arabic, Japanese, German, Hebrew, French and Korean languages.

Tagify Gets STDIN, file or HTTP address as an input and returns a list of most popular words ordered by popularity as an output. More info about what

ZoomIO 21 Jun 16, 2022
Extract urls from text

xurls Extract urls from text using regular expressions. Requires Go 1.13 or later. import "mvdan.cc/xurls/v2" func main() { rxRelaxed := xurls.Relax

Daniel Martí 918 Jun 27, 2022
Easy AWK-style text processing in Go

awk Description awk is a package for the Go programming language that provides an AWK-style text processing capability. The package facilitates splitt

Scott Pakin 93 May 5, 2022
Change the color of console text.

go-colortext package This is a package to change the color of the text and background in the console, working both under Windows and other systems. Un

Yi Deng 210 Mar 8, 2022
Templating system for HTML and other text documents - go implementation

FAQ What is Kasia.go? Kasia.go is a Go implementation of the Kasia templating system. Kasia is primarily designed for HTML, but you can use it for any

Michał Derkacz 74 Mar 15, 2022
Package sanitize provides functions for sanitizing text in golang strings.

sanitize Package sanitize provides functions to sanitize html and paths with go (golang). FUNCTIONS sanitize.Accents(s string) string Accents replaces

Kenny Grant 317 Jun 7, 2022
Small and fast FTS (full text search)

Microfts A small full text indexing and search tool focusing on speed and space. Initial tests seem to indicate that the database takes about twice as

Bill Burdick 26 Jun 16, 2022
text to speech bot for discord

text to speech bot for discord

takanakahiko 19 Jun 20, 2022
A diff3 text merge implementation in Go

Diff3 A diff3 text merge implementation in Go based on the awesome paper below. "A Formal Investigation of Diff3" by Sanjeev Khanna, Keshav Kunal, and

Keenan Nemetz 19 Apr 4, 2022
gomtch - find text even if it doesn't want to be found

gomtch - find text even if it doesn't want to be found Do your users have clever ways to hide some terms from you? Sometimes it is hard to find forbid

Nicolas Augusto Sassi 27 Apr 22, 2022
Unified text diffing in Go (copy of the internal diffing packages the officlal Go language server uses)

gotextdiff - unified text diffing in Go This is a copy of the Go text diffing packages that the official Go language server gopls uses internally to g

Hexops 82 Jun 24, 2022
Convert scanned image PDF file to text annotated PDF file

Jisui (自炊) This tool is PoC (Proof of Concept). Jisui is a helper tool to create e-book. Ordinary the scanned book have not text information, so you c

Takumasa Sakao 27 Apr 7, 2022
Paranoid text spacing in Go (Golang)

pangu.go Paranoid text spacing for good readability, to automatically insert whitespace between CJK (Chinese, Japanese, Korean) and half-width charact

Vinta Chen 81 Apr 6, 2022