Weaviate is a cloud-native, modular, real-time vector search engine

Overview

Weaviate Weaviate logo

Build Status Go Report Card Coverage Status Slack Newsletter

Demo of Weaviate

Weaviate GraphQL demo on news article dataset containing: Transformers module, GraphQL usage, semantic search, _additional{} features, Q&A, and Aggregate{} function. You can the demo on this dataset in the GUI here: semantic search, Q&A, Aggregate.

Description

Weaviate is a cloud-native, real-time vector search engine (aka neural search engine or deep search engine). There are modules for specific use cases such as semantic search, plugins to integrate Weaviate in any application of your choice, and a console to visualize your data.

GraphQL - RESTful - vector search engine - vector database - neural search engine - semantic search - HNSW - deep search - machine learning - kNN

Features

Weaviate makes it easy to use state-of-the-art AI models while giving you the scalability, ease of use, safety and cost-effectiveness of a purpose-built vector database. Most notably:

  • Fast queries
    Weaviate typically performs a 10-NN neighbor search out of millions of objects in considerably less than 100ms.

  • Any media type with Weaviate Modules
    Use State-of-the-Art AI model inference (e.g. Transformers) for Text, Images, etc. at search and query time to let Weaviate manage the process of vectorizing your data for your - or import your own vectors.

  • Combine vector and scalar search
    Weaviate allows for efficient combined vector and scalar searches, e.g “articles related to the COVID 19 pandemic published within the past 7 days”. Weaviate stores both your objects and the vectors and make sure the retrieval of both is always efficient. There is no need for a third party object storage.

  • Real-time and persistent
    Weaviate let’s you search through your data even if it’s currently being imported or updated. In addition, every write is written to a Write-Ahead-Log (WAL) for immediately persisted writes - even when a crash occurs.

  • Horizontal Scalability
    Scale Weaviate for your exact needs, e.g. High-Availability, maximum ingestion, largest possible dataset size, maximum queries per second, etc. (Currently under development, ETA Fall 2021)

  • Cost-Effectiveness
    Very large datasets do not need to be kept entirely in memory in Weaviate. At the same time available memory can be used to increase the speed of queries. This allows for a conscious speed/cost trade-off to suit every use case.

  • Graph-like connections between objects
    Make arbitrary connections between your objects in a graph-like fashion to resemble real-life connections between your data points. Traverse those connections using GraphQL.

Documentation

You can find detailed documentation in the developers section of our website or directly go to one of the docs using the links in the list below.

Additional reading

Examples

You can find code examples here

Support

Contributing

Issues
  • Vectorization mask for classes

    Vectorization mask for classes

    Currently all string/text values as well as the class name and property names are considered in the vectorization. However not all property names and values have be important for the context. Take the following meta class of a table as an example:

    {
                    "class": "Column",
                    "description": "",
                    "properties":[
                        {
                            "cardinality": "atMostOne",
                            "description": "",
                            "dataType": ["int"],
                            "keywords": [],
                            "name": "index"
                        },
                        {
                            "cardinality": "atMostOne",
                            "description": "",
                            "dataType": ["text"],
                            "keywords": [],
                            "name": "name"
                        },
                        {
                            "cardinality": "atMostOne",
                            "description": "",
                            "dataType": ["string"],
                            "keywords": [],
                            "name": "dataType"
                        }
                    ]
                }
    

    In this case the vector would be created based on Column, index, name, data, type and the values of the properties. However the class name and properties shift the vector into the context of tables while the context should be based solely on the column name and dataType values.

    Proposal:

    • If nothing is specified the vector gets created automatically based on all information in the class
    • The user can explicitly mask information away from the vectorization in the schema:
    {
                    "class": "Column",
                    "description": "",
                    "vectorizeClassName": False
                    "properties":[
                        {
                            "cardinality": "atMostOne",
                            "description": "",
                            "dataType": ["int"],
                            "keywords": [],
                            "name": "index"
                        },
                        {
                            "cardinality": "atMostOne",
                            "description": "",
                            "dataType": ["text"],
                            "keywords": [],
                            "name": "name"
                            "vectorizePropertyName": False
                            "vectorizePropertyValue": True
                        },
                        {
                            "cardinality": "atMostOne",
                            "description": "",
                            "dataType": ["string"],
                            "keywords": [],
                            "name": "dataType"
                            "vectorizePropertyName": False
                            "vectorizePropertyValue": True
                        }
                    ]
                }
    
    opened by fefi42 30
  • Add _snippets in GraphQL and REST API

    Add _snippets in GraphQL and REST API

    When an object is indexed on a larger text item (e.g., a paragraph like in the news article demo) certain search terms can be found in sentences.

    The idea is to add a starting point and endpoint of the most important part of the text corpus in the __meta end-point as a potentialAnswer which can be enabled or disabled by setting an distanceToAnswer. This can work both for explore filters as where filters.

    Idea

    I was searching for something on WikiPedia under the search term: Is herbalife a pyramid scheme? and got this response.

    Because Google isn't giving the actual answer but a location for the answer, we should be able to calculate something similar.

    Screenshot 2020-06-05 at 10 53 31

    Explore example

    {
      Get{
        Things{
          Article(
            explore: {
              concepts: ["I want a spare rib"],
              certainty: 0.7,
              moveAwayFrom: {
                concepts: ["bacon"],
                force: 0.45
              }
            }
          ){
            name
            __meta {
              potentialAnswer(
                distanceToAnswer: 0.5 # <== optional
              ) {
                start
                end
                property
                distanceToQuery
              }
            }
          }
        }
      }
    }
    

    Result where the start and end give the starting and ending position and in which property the answer / most important part can be found.

    {
        "data": {
            "Get": {
                "Things": {
                    "Article": [
                        {
                            "name": "Bacon ipsum dolor amet tri-tip hamburger leberkas short ribs chicken turkey sirloin tenderloin shoulder pig bresaola. Pastrami ham hock meatball rump ribeye cupim, capicola venison burgdoggen brisket meatloaf. Turducken t-bone landjaeger pork chop, bresaola pig prosciutto pastrami sausage pancetta capicola short ribs hamburger tail spare ribs. Jerky kevin doner cupim pork belly picanha, pancetta capicola pork loin alcatra corned beef shank. Bacon chislic landjaeger doner corned beef, hamburger beef ribs filet mignon turducken tri-tip andouille pastrami chuck pork loin capicola. Prosciutto shankle chislic, shoulder tri-tip turducken meatball ham pork loin fatback hamburger pork chop bacon pork belly. Kevin sausage salami spare ribs tenderloin t-bone meatball picanha flank jowl pork chop tail turducken tri-tip.",
                            "__meta": [ // <== array because multiple results are possible and/or multiple 
                                {
                                    "property": "name",
                                    "distanceToQuery": 0.0, // <== distance to the query
                                    "start": 26,// <== just a random example
                                    "end": 130  // <== just a random example
                                }
                            ]
                        }
                    ]
                }
            }
        },
        "errors": null
    }
    

    Where example

    {
      Get {
        Things {
          Article(where: {
                path: ["name"],
                operator: Like,
                valueString: "New *"
            }) {
            name
            __meta {
              potentialAnswer(
                distanceToAnswer: 0.5 # <== optional
              ) {
                start
                end
                property
                distanceToQuery
              }
            }
          }
        }
      }
    }
    

    Result where the start and end give the starting and ending position and in which property the answer / most important part can be found.

    {
        "data": {
            "Get": {
                "Things": {
                    "Article": [
                        {
                            "name": "Bacon ipsum dolor amet tri-tip hamburger leberkas short ribs chicken turkey sirloin tenderloin shoulder pig bresaola. Pastrami ham hock meatball rump ribeye cupim, capicola venison burgdoggen brisket meatloaf. Turducken t-bone landjaeger pork chop, bresaola pig prosciutto pastrami sausage pancetta capicola short ribs hamburger tail spare ribs. Jerky kevin doner cupim pork belly picanha, pancetta capicola pork loin alcatra corned beef shank. Bacon chislic landjaeger doner corned beef, hamburger beef ribs filet mignon turducken tri-tip andouille pastrami chuck pork loin capicola. Prosciutto shankle chislic, shoulder tri-tip turducken meatball ham pork loin fatback hamburger pork chop bacon pork belly. Kevin sausage salami spare ribs tenderloin t-bone meatball picanha flank jowl pork chop tail turducken tri-tip.",
                            "__meta": [ // <== array because multiple results are possible and/or multiple properties might be indexed
                                {
                                    "property": "name",
                                    "distanceToQuery": 0.0, // <== distance to the query
                                    "start": 26,// <== just a random example
                                    "end": 130  // <== just a random example
                                }
                            ]
                        }
                    ]
                }
            }
        },
        "errors": null
    }
    

    Suggested (first) implementation

    1. Results are returned like the current implementation.
    2. The vectorization of the query is used to find the closest match of a word in a sentence. *
    3. When the closest word is found, the start and endpoint are found at the beginning and end of the sentence.
    4. The distanceToAnswer if the minimal distance, if it is not set, no start and end-points will be available, if multiple sentences make the mark, they will all be part of the array.

    *- there might be potential to also do this on groups of words or complete sentences.

    Related

    #1136 #1139 #1155 #1156

    graphql Contextionary autoclosed _underscoreProp 
    opened by bobvanluijt 22
  • Add geotype to datatypes

    Add geotype to datatypes

    Todos

    • [x] design decisions
      • [x] name of the field
        • current proposals: geoCoordinate, geoLocation, geoPoint
          • my personal favorite being geoCoordinate
          • cc @laura-ham @bobvanluijt
      • [x] design of the where filter
        • @laura-ham suggestions?
    • [x] spike out happy path in janusgraph only
      • [x] index creation
      • [x] adding property
      • [x] searching by property within range
    • [x] add new data type on import, goal: an import with geoCoordinates field succeeds
      • [x] allow in schema creation
      • [x] allow in class instance creation
      • [x] validation?
        • [x] add basic validation
        • [x] refactor validateSchemaInBody (it's way too long and extremely difficult to read/extend)
      • [x] janus graph create vertex
    • [x] include on simple read queries
      • [x] Local Get
      • [x] Network Get
    • [x] filter by property
      • [x] Local Filters
        • [x] extract filter from graphql
        • [x] set required validations, so that required fields cannot be omitted
        • [x] apply filter in connectors (Janusgraph)
      • ~~Network Filters~~
        • nothing to do here, they use the same code as local filters
    • [x] deal with property in GetMeta and Aggregate
      • proposal for now to simply not support those fields there
      • long-term?
    • [x] rename according to latest decisions
      • [x] pluralize name geoCoordinate -> geoCoordinates
      • [x] restructure where filter
        • [x] WithinRange -> WithinGeoRange
        • [x] valueRange -> valueGeoRange
        • [x] wrap distance and geoCoordinates in separate objects
    • [x] Update docs (@laura-ham volunteered to help out here)
    • [x] e2e/acceptance test

    Original Content below: for most up-to-date summary see comments below

    Abstract Examples

    • A location in a 2-dimensional space, i.e. x=3, y=5
    • Coordinates that point to a location on a world map

    Research Questions

    • Are geo types always two-dimensional or can they also be more dimensional?
      • Part 1: Do we have use cases for more than 2d?
      • Part 2: Is there technical support for more than 2d?
    • Optimal way to query, e.g. within 500m coordinates x,y mixes concepts of metrical distance and coordinates, what API do we want

    Features we'd need

    • Import/update geo coordinates
    • use geo in where filters (see research question)
    • do we want to allow aggregation functions like Aggregate or GetMeta
      • if so, what should they look like?
      • if so, is there technical support in our current stack?
      • if so, right from the start or later?

    cc @bobvanluijt @laura-ham

    enhancement Core implementation graphql discussion documentation API design & UX data types 
    opened by etiennedi 22
  • Dropping 'things' or 'actions' from the filter.

    Dropping 'things' or 'actions' from the filter.

    In the section on filters in the GraphQL documentation, the paths of the filters are prefixed with "things" and "actions".

    This is superfluous information. The class names cannot overlap in any case.

    Current query:

     Get(where:{
          operator: And,
          operands: [{
            path: ["Things", "Animal", "age"],
            operator: LessThan
            valueInt: 5
          }, {
            path: ["Things", "Animal", "inZoo", "Zoo", "name"],
            operator: Equal,
            valueString: "London Zoo"
          }]
        }) { ... }
    

    After removing the kind of class:

     Get(where:{
          operator: And,
          operands: [{
            path: ["Animal", "age"],
            operator: LessThan
            valueInt: 5
          }, {
            path: ["Animal", "inZoo", "Zoo", "name"],
            operator: Equal,
            valueString: "London Zoo"
          }]
        }) { ... }
    
    opened by moretea 19
  • (DOC) REST API authentication to a WCS cluster

    (DOC) REST API authentication to a WCS cluster

    Hi,

    I'm trying to connect to an authentication enabled WCS cluster, with the REST API (for WooCommerce owners who want to use WCS hosting rather than installing Weaviate).

    Could you point me to the documentation ? I received a 404 link after the cluster was created https://www.semi.technology/developers/weaviate/v1.8.0/configuration/authentication

    I noticed the Enterprise token, but I do not think it is what I need. https://www.semi.technology/developers/weaviate/current/configuration/authentication.html is only about self setup I suspect.

    opened by eostis 18
  • Docker-compose DB fails:

    Docker-compose DB fails: "Could not create SSTable component"

    Weaviate doesn't start with Docker-compose and keeps in a failing loop;

    db_1        | ERROR 2019-02-27 15:30:53,471 [shard 0] sstable - Could not create SSTable component /var/lib/scylla/
    data/system_schema/keyspaces-abac5682dea631c5b535b3d6cffd0fb6/system_schema-keyspaces-ka-114-TOC.txt.tmp. Found exc
    eption: std::system_error (error system:2, No such file or directory)
    db_1        | ERROR 2019-02-27 15:30:53,471 [shard 0] database - failed to write sstable /var/lib/scylla/data/syste
    m_schema/keyspaces-abac5682dea631c5b535b3d6cffd0fb6/system_schema-keyspaces-ka-114-Data.db: std::system_error (erro
    r system:2, No such file or directory)
    db_1        | WARN  2019-02-27 15:30:53,471 [shard 0] sstable - Unable to delete /var/lib/scylla/data/system_schema
    /keyspaces-abac5682dea631c5b535b3d6cffd0fb6/system_schema-keyspaces-ka-114-TOC.txt because it doesn't exist.
    db_1        | ERROR 2019-02-27 15:31:03,471 [shard 0] sstable - Could not create SSTable component /var/lib/scylla/
    data/system_schema/keyspaces-abac5682dea631c5b535b3d6cffd0fb6/system_schema-keyspaces-ka-116-TOC.txt.tmp. Found exc
    eption: std::system_error (error system:2, No such file or directory)
    db_1        | ERROR 2019-02-27 15:31:03,472 [shard 0] database - failed to write sstable /var/lib/scylla/data/syste
    m_schema/keyspaces-abac5682dea631c5b535b3d6cffd0fb6/system_schema-keyspaces-ka-116-Data.db: std::system_error (erro
    r system:2, No such file or directory)
    db_1        | WARN  2019-02-27 15:31:03,472 [shard 0] sstable - Unable to delete /var/lib/scylla/data/system_schema
    /keyspaces-abac5682dea631c5b535b3d6cffd0fb6/system_schema-keyspaces-ka-116-TOC.txt because it doesn't exist.
    db_1        | ERROR 2019-02-27 15:31:13,472 [shard 0] sstable - Could not create SSTable component /var/lib/scylla/
    data/system_schema/keyspaces-abac5682dea631c5b535b3d6cffd0fb6/system_schema-keyspaces-ka-118-TOC.txt.tmp. Found exc
    eption: std::system_error (error system:2, No such file or directory)
    db_1        | ERROR 2019-02-27 15:31:13,510 [shard 0] database - failed to write sstable /var/lib/scylla/data/syste
    m_schema/keyspaces-abac5682dea631c5b535b3d6cffd0fb6/system_schema-keyspaces-ka-118-Data.db: std::system_error (erro
    r system:2, No such file or directory)
    db_1        | WARN  2019-02-27 15:31:13,510 [shard 0] sstable - Unable to delete /var/lib/scylla/data/system_schema
    /keyspaces-abac5682dea631c5b535b3d6cffd0fb6/system_schema-keyspaces-ka-118-TOC.txt because it doesn't exist.
    
    bug docker 
    opened by bobvanluijt 18
  • SUGGESTION: Dump vectors

    SUGGESTION: Dump vectors

    Some data scientists might want to leverage the vectorization mechanism in Weaviate to train new models. The result would be similar to /things and /things/{UUID} but with a focus on a matrix to download all objects.

    RESTful URL suggestion: /c11y/vectors and /c11y/vectors/{UUID}.

    c11y/vectors/{UUID}

    returns:

    {
        "type": "thing",
        "vector": [
            0,
            0,
            0,
            0,
            //etc
        ]
    }
    

    c11y/vectors?page=0

    returns:

    {
        "result": {
            "UUID_1": {
                "type": "thing",
                "vector": [ //Array is same object as c11y/vectors/{UUID}
                    0,
                    0,
                    0,
                    0,
                    //etc
                ]
            },
            "UUID_1": {
                "type": "thing",
                "vectors": [ //Array is same object as c11y/vectors/{UUID}
                    0,
                    0,
                    0,
                    0,
                    //etc
                ]
            }
        },
        "pages": 100 // total pages
    }
    
    wontfix 
    opened by bobvanluijt 17
  • Inconsistent http error responses

    Inconsistent http error responses

    Issue

    Our error responses are inconsistent. I would propose a uniform approach so our users always know which attribute to access when looking for an error message.

    Background

    I listed some reproducible examples below using Postman and a Dummy config local Weaviate and sending the specified queries to the /graphql extension (note: this issue is not endpoint-specific). The actual requests/responses are not of interest here, my point is about the response type.

    1. Unprocessable query

    This query returns a 400 (bad request) in the header and a JSON Body

    Request body:

    []
    

    To: /graphql

    Response:

    {
        "code": 400,
        "message": "parsing body body from \"\" failed, because json: cannot unmarshal object into Go value of type models.GraphQLQuery"
    }
    

    2. Empty query

    Only returns a 422 (unprocessable entity) code in the header

    Request body:

    {}
    

    To: /graphql

    Response:

    {}
    

    3. Valid GQL query asking for attribute of a class that has no instances in the database

    Returns a nested JSON error in the body and a 200 in the header. *Note that this request returns a 200 because the request itself is valid.

    Request body:

    {"query":"{\n  Local {\n    Get {\n      Things {\n        TestThing {\n          uuid\n        }\n      }\n    }\n  }\n}\n","variables":null,"operationName":null}
    

    To: /graphql

    Response:

    {
        "data": {
            "Local": null
        },
        "errors": [
            {
                "locations": [
                    {
                        "column": 3,
                        "line": 2
                    }
                ],
                "message": "runtime error: invalid memory address or nil pointer dereference",
                "path": [
                    "Local"
                ]
            }
        ]
    }
    

    Three queries resulting in an error, three different types of output.

    Questions

    • Do we want to standardize error output?
    • Assuming standardisation is preferred, which return value do we want to use?
    • Any of the options in the example?
    • Something else entirely?

    cc: @moretea @bobvanluijt

    question Core implementation discussion 
    opened by JeroenStravers1 17
  • Suggestion: auto cut-off Explore search results

    Suggestion: auto cut-off Explore search results

    Problem & current behaviour

    If the Explore filter in a GraphQL Get query is used, it is unclear how to set and control the certainty (or distance) parameter. This parameter controls what results to return, but with the current design, the user does not know what to set this parameter, to get optimal results.

    You don't want to see 'bad' results amongst the results, but you don't know beforehand where the cut off point of the certainty is. Additionally, we observed that the user prefers not to see any results if there are no good results at all.

    Proposed solution

    Automatically find a cut-off threshold for which results to show. This threshold can be calculated by e.g. an elbow in distances between each result and the query, or between a cluster of results and the query.

    IMG_20200508_144440458

    Questions

    1. Should the user have the option to set whether they want to enable this auto function on their query?
    2. What value to set the (relative) cut-off point to? (=how big should the gap between the points or clusters relatively be?)
    graphql Developer Experience API design & UX DX autoclosed 
    opened by laura-ham 16
  • Meta information in Get{} results

    Meta information in Get{} results

    When Weaviate's semantic features are used to provide results to the end-user, we might be able to improve the UX to why Weaviate came to these results by introducing a __meta "property" that contains additional information per class.

    Available keys

    • semanticPath.concept (aka shortest semantic path) = A concept as defined in the docs or a entity in the form of a beacon.
    • semanticPath.distanceToQuery = The distance to the Explore{} filter.
    • semanticPath.distanceToParent = If there is a parent, the semantic distance.
    • semanticPath.distanceToNext = If there is a next item in the path, the distance.
    • distanceToQuery = The distance to the Explore{} filter.
    • distanceToParent = If there is a parent, the semantic distance.

    Example Query

    A query might look like this (also includes the proposed values)

    Note: results function just as examples

    {
      Get {
        Things {
          Article(
            explore: {
              concepts: ["beatles"]
            }
            limit: 1
          ) {
            title
            InPublication {
              ... on Publication {
                name
                __meta {
                  semanticPath {
                    concept
                    distanceToQuery
                    distanceToParent
                    distanceToNext
                  }
                  distanceToQuery
                  distanceToParent
                }
              }
            }
            __meta {
              semanticPath {
                concept
                distanceToQuery
                distanceToParent
                distanceToNext
              }
              distanceToQuery
              distanceToParent
            }
          }
        }
      }
    }
    

    Example result

    Proposed result based on the above query.

    {
      "data": {
        "Get": {
          "Things": {
            "Article": [
              {
                "InPublication": [
                  {
                    "name": "The Guardian"
                    "__meta": {
                        "semanticPath": [{
                            "concept": "foo",
                            "distanceToQuery": 0.4, // the Explore{} filter
                            "distanceToParent": 0.8, // In this case the parent Article
                            "distanceToNext": null // there is no next item
                        }, {
                            "concept": "weaviate://localhost/things/6fe82690-bef8-3be4-9e22-7f1374e27fed", // a concept can be a word or a thing/action
                            "distanceToQuery": 0.6, // the Explore{} filter
                            "distanceToParent": 0.45, // In this case the parent Article
                            "distanceToNext": 0.2 // in this case the concept "foo"
                        }]
                        "distanceToQuery": 0.3, // the Explore{} filter
                        "distanceToParent": 0.2 // In this case the parent Article
                    }
                  }
                ],
                "title": "Opinion | John Lennon Told Them ‘Girls Don’t Play Guitar.’ He Was So Wrong."
                "__meta": {
                    "semanticPath": [{
                        "concept": "baz",
                        "distanceToQuery": 0.4, // the Explore{} filter
                        "distanceToParent": null, // there is no parent
                        "distanceToNext": null // there is no next item
                    }, {
                        "concept": "weaviate://localhost/things/6fe82690-bef8-3be4-9e22-7f1374e27fed", // a concept can be a word or a thing/action
                        "distanceToQuery": 0.6, // the Explore{} filter
                        "distanceToParent": null, // there is no parent
                        "distanceToNext": 0.2 // in this case the concept "baz"
                    }]
                    "distanceToQuery": 0.3, // the Explore{} filter
                    "distanceToParent": null // there is no parent
                }
              }
            ]
          }
        }
      },
      "errors": null
    }
    
    enhancement graphql Developer Experience API design & UX 
    opened by bobvanluijt 16
  • API Proposal for Classification Feature

    API Proposal for Classification Feature

    API for #948

    Three API use cases need to be addressed:

    1. Start a classification, see status of ongoing classification
    2. View classified items
    3. View meta data about a thing which was part of a classification

    Here are proposals for all three:

    1. Trigger Classification, see Status

    POST /v1/classifications 
    {
      "class": "Company",
      "classifyProperties": ["materialGroupOne", "materialGroupTwo"],
      "basedOnProperties": ["description"],
      "type": "knn", // optional, can default to knn
      "k": 3 // optional, can default to something reasonable
    }
    
    returns GET response + Header 'Location: /v1/classifications/<id>'
    
    GET /v1/classifications/<classificationID>
    {
      "id": "abcdefgh",
      "class": "Company",
      "classifyProperties": ["materialGroupOne", "materialGroupTwo"],
      "basedOnProperties": ["description"],
      "status": "running",
      "started": <timestamp>,
      "completed": <timestamp> || null,
      "type": "knn"
      "k": 3 
    }
    

    2. View classified items

    No new APIs required, as classifications are just cross-refs. Any existing measures to view crossrefs (resolved in GQL Get or as beacons in REST GET /v1/things/<id> can be used.

    3. View metadata about a thing which received a ref from a classification

    Regular GET /v1/thing/<id> without meta parameter

    Unchanged, same as always:

    GET /v1/things/<id>
    {
      "class": "Company",
      "id": "<id>",
      "schema": {
        "description": "Foo bar baz bazzar",
        "materialGroupOne": [
          {
             "beacon": "..."
          }
        ],
       "materialGroupTwo": [
          {
             "beacon": "..."
          }
        ]
      }
    }
    

    Get with meta parameter GET /v1/thing/<id>?meta=true

    GET /v1/things/<id>
    {
      "class": "Company",
      "id": "<id>",
      "schema": {
        "description": "Foo bar baz bazzar",
        "materialGroupOne": [
          {
             "beacon": "...",
             "meta": null  <----- indicates this has not been classified, but set by a user
          }
        ],
       "materialGroupTwo": [
          {
             "beacon": "...",
             "meta": {
                "classification": {
                  "distanceWinning": 0.5,
                  "distanceLosing": 0.6,
                },
              },
          }
        ]
      }
      "meta": {
        "classification": {
          "id": "<classificatinId>",
          "completed": <timestamp>,
          "classifiedFields": ["matericalGroupTwo"],
          "scope": ["materialGroupOne", "materialGroupTwo"]  <----- note that the scope was to classify both fields, but since the user had already set materialGroupOne the actual classified Fields (see one row up) is only a single field (materialGroupTwo)
          "basedOn": ["description"],
        }
      }
    }
    
    API design & UX Classifications (ML) 
    opened by etiennedi 16
  • TD-148: extend REST-API with a new endpoint to add a cross reference to a data object of a specific class

    TD-148: extend REST-API with a new endpoint to add a cross reference to a data object of a specific class

    The new endpoint POST /objects/{className}/{id}/references/{propertyName} deprecates endpoint POST /objects/{id}/references/{propertyName}. The old endpoint might change the wrong object because uuids are not unique among classes.

    opened by redouan-rhazouani 1
  • Non deterministic results

    Non deterministic results

    I noticed that repeating the same query, results come different.

    First query: image

    Same results, but with different certainty values: image

    Different results: image

    opened by eostis 0
  • Create a script or function to re-index existing data while switching/updating to a new model

    Create a script or function to re-index existing data while switching/updating to a new model

    When switching to a different t2v-transformers model, do I have to re-index the current data that is already in Weaviate? Or does weaviate replace the vectors automatically? I am wondering what happens or what to do with existing data (that probably has vectors based on the old model)

    I would consider it a valid use-case as models get better over time and one would probably switch to a newer model then. Probably need to re-index the whole db from scratch then.

    Can a command be created to facilitate such a usse-case?

    opened by tacohiddink 0
  • Aggregate min/max with filters does not work on array of float

    Aggregate min/max with filters does not work on array of float

    Below, a query with 3 aggregates. First aggregate "stats_filters_array_float_ko" is wrong. The 2 others are fine. It seems that aggregates with a filter on an array of float is wrong.

    {
      search: Get {
        Text2vectransformers1(
          where: {path: ["wpsolr_type"], operator: Equal, valueString: "product"}
        ) {
          wpsolr_title
          wpsolr_pid
          wpsolr__price_f
        }
      }
      stats_filters_array_float_ko: Aggregate {
        Text2vectransformers1(
          where: {path: ["wpsolr_type"], operator: Equal, valueString: "product"}
        ) {
          wpsolr__price_f {
            minimum
            maximum
          }
        }
      }
      stats_no_filters_array_float_ok: Aggregate {
        Text2vectransformers1 {
          wpsolr__price_f {
            minimum
            maximum
          }
        }
      }
      stats_filters_simple_integer_ok: Aggregate {
        Text2vectransformers1(
          where: {path: ["wpsolr_type"], operator: Equal, valueString: "product"}
        ) {
          wpsolr_pid {
            minimum
            maximum
          }
        }
      }
    }
    
    

    Results of the 3 aggregates:

    {
      "data": {
        "search": {
          "Text2vectransformers1": [
            {
              "wpsolr__price_f": [
                10
              ],
              "wpsolr_pid": 44,
              "wpsolr_title": "product 1"
            },
            {
              "wpsolr__price_f": [
                20
              ],
              "wpsolr_pid": 46,
              "wpsolr_title": "product 2"
            }
          ]
        },
        "stats_filters_array_float_ko": {
          "Text2vectransformers1": [
            {
              "wpsolr__price_f": {
                "maximum": 5e-324,
                "minimum": 1.7976931348623157e+308
              }
            }
          ]
        },
        "stats_filters_simple_integer_ok": {
          "Text2vectransformers1": [
            {
              "wpsolr_pid": {
                "maximum": 46,
                "minimum": 44
              }
            }
          ]
        },
        "stats_no_filters_array_float_ok": {
          "Text2vectransformers1": [
            {
              "wpsolr__price_f": {
                "maximum": 20,
                "minimum": 10
              }
            }
          ]
        }
      }
    }
    
    
    opened by eostis 0
Releases(v1.13.2)
  • v1.13.2(May 20, 2022)

    Breaking Changes

    none

    New Features (Preview)

    • L2 distance by @etiennedi in https://github.com/semi-technologies/weaviate/pull/1953 This is in preview/experimental state and will be fully supported in v1.14.0

    Fixes

    • Bugfix: Patching object without vector by @parkerduckworth in https://github.com/semi-technologies/weaviate/pull/1941
    • WVT-91: node values not changed on flatten operation by @aliszka in https://github.com/semi-technologies/weaviate/pull/1948
    • Panic on filtered vector search with flat-search-cutoff by @antas-marcin in https://github.com/semi-technologies/weaviate/pull/1945
    • Fix ReadDeleteNode method in deserializer by @antas-marcin in https://github.com/semi-technologies/weaviate/pull/1950
    • Fix for multi shard unlimited vector search by @antas-marcin in https://github.com/semi-technologies/weaviate/pull/1955
    • WVT-31: missing return on equal keys for setTombstone by @aliszka in https://github.com/semi-technologies/weaviate/pull/1954

    Full Changelog: https://github.com/semi-technologies/weaviate/compare/v1.13.1...v1.13.2

    Source code(tar.gz)
    Source code(zip)
  • v1.13.1(May 3, 2022)

    Breaking Changes

    none

    New Features

    none

    Fixes

    • Fix HNSW Delete performance degredation on concurrent deletes in #1942 by @etienendi

    Full Changelog: https://github.com/semi-technologies/weaviate/compare/v1.13.0...v1.13.1

    Source code(tar.gz)
    Source code(zip)
  • v1.13.0(May 3, 2022)

    Breaking Changes

    none

    New Features

    • Faceted Search / Aggregate + near<Media> @antas-marcin, @parkerduckworth in https://github.com/semi-technologies/weaviate/pull/1790 This release allows combining a vector search (using nearVector, nearObject, nearText, etc.) with an Aggregation. This allows for faceted vector search. In order for such an Aggregation to work the vector search needs to be limiting the space somehow. This can either happen by specifying an explicit limit or by specifying a desired target certainty/distance.

    • Sorting by @antas-marcin in https://github.com/semi-technologies/weaviate/pull/1886, https://github.com/semi-technologies/weaviate/pull/1924 This release adds the ability to sort search results. Sorting does not currently make us of a columnar storage mechanism specified for this property and instead needs to read parts of the affected objects from disk. This has a performance on very large datasets and an improved solution is expected to follow later on.

    • Support filtering by timestamp by @parkerduckworth in https://github.com/semi-technologies/weaviate/pull/1930 Prior to this release timestamps such as creationTimeUnix and lastUpdateTimeUnix could not be used in filters. This release adds the ability to optionally include those fields in the inverted index. If included, they can be used in filters using a special underscore (_) notation. E.g. path: ["_creationTimeUnix"].

    • Batch Delete by Filter by @antas-marcin in https://github.com/semi-technologies/weaviate/pull/1935 This release adds a new /v1/batch endpoint which allows for deleting all objects that match a specific filter.

    • Support DPR transformers models in text2vec-transformers @aliszka in https://github.com/semi-technologies/weaviate/pull/1911 These models differ from "regular" transformers models in that they use two separate models for encoding the query and the passage instead of using the same models. This two-model configuration is now supported.

    Fixes

    • Reduce allocations for map compaction: Overall 264 GB -> 126 GB by @etiennedi in https://github.com/semi-technologies/weaviate/pull/1909
    • Fix flaky aggregate acceptance test by @parkerduckworth in https://github.com/semi-technologies/weaviate/pull/1925
    • Updated DocumentationHref to redirect to correct links by @Asmit2952 in https://github.com/semi-technologies/weaviate/pull/1923
    • Grammatically updated readme file by @omarzain27 in https://github.com/semi-technologies/weaviate/pull/1858

    New Contributors

    • @Asmit2952 made their first contribution in https://github.com/semi-technologies/weaviate/pull/1923
    • @omarzain27 made their first contribution in https://github.com/semi-technologies/weaviate/pull/1858

    Full Changelog: https://github.com/semi-technologies/weaviate/compare/v1.12.2...v1.13.0

    Source code(tar.gz)
    Source code(zip)
  • v1.12.2(Apr 13, 2022)

    Breaking Changes

    none

    New Features

    none

    Bug Fixes

    • Bugfix for #1903 (LSM crash recovery journey for "Map" type) by @etiennedi in https://github.com/semi-technologies/weaviate/pull/1904 Prior to this release a crash or other unexpected interruption may have left the inverted index in an unrecoverable state. It would return a panic on startup after a crash.

    • Fix limiting unlimited vector search by @parkerduckworth in https://github.com/semi-technologies/weaviate/pull/1906 A new bug was introduced in v1.12.0 where - if both a limit and certainty were set on a vector search - the limit may have been ignored in some cases. This release fixes this and increases test coverage around this area to prevent further issues.

    • Modules: init dependencies logic panics on specific module init order by @antas-marcin in https://github.com/semi-technologies/weaviate/pull/1902 Prior to this release initiating Weaviate with modules that have dependencies on other modules has resulted in startup errors in some (rare) cases. This Fix makes dependencies between modules more explicit and solves the startup issues.

    • gh-1900 add nil-check on findBestEntryPoint by @etiennedi in https://github.com/semi-technologies/weaviate/pull/1910 A missing nil-pointer check in the HNSW delete logic may have returned errors in rare cases.

    Full Changelog: https://github.com/semi-technologies/weaviate/compare/v1.12.1...v1.12.2

    Source code(tar.gz)
    Source code(zip)
  • v1.12.1(Apr 7, 2022)

    Breaking Changes

    none

    New Features

    none

    Fixes

    • Index out of range panic #1897 This release fixes an issue that was introduced in v1.12.0 if upgrading from a v1.11.0 or prior. See #1897 for details. If you have run into this issue with v1.12.0, use v1.12.1 instead. If you have imported from scratch into v1.12.0 you should not have been affected by this issue, but upgrading to v1.12.1 is still recommended.
    Source code(tar.gz)
    Source code(zip)
  • v1.12.0(Apr 5, 2022)

    Important: This release may introduce a new bug if you are upgrading from v1.11.0. Please use v1.12.1 instead where this bug has been fixed.

    Breaking Changes

    none

    New Features

    • Index full string field by @aliszka in https://github.com/semi-technologies/weaviate/pull/1862, #1821 This new feature allows turning off tokenization for string fields, so that instead of splitting and indexing at the word boundary, the whole field is indexed. This allows for matching a string including spaces, and avoiding undesired partial string matching, such as returning "light grey" when the search was for "grey".

    • Make Inverted Index stopword lists fully configurable by @parkerduckworth in https://github.com/semi-technologies/weaviate/pull/1870 This feature introduces a fully configurable stopword list to all inverted-index features. This is in anticipation of BM25 support (and mixed BM25/dense vector search) coming soon, but the feature also applies to exact matches on the inverted index.

    • Unlimited vector search by Certainty by @parkerduckworth in https://github.com/semi-technologies/weaviate/pull/1883 Prior to this feature, a vector search with a specified certainty might have cut off too early if the internal limit hit the search first. For example, if the search returned exactly 100 results, but the last result was still within the desired certainty range, there was a chance that there would have been more matches that were not returned. This is especially critical when doing a vector-search-based aggregation (coming soon). This feature allows returning all certainty matches, no matter how many. A global maximum can be configured to prevent a query that matches the whole DB to provoke an OOM situation which would be a potential attack vector.

    • Shard API (Mark shard(s) as read-only) by @parkerduckworth in https://github.com/semi-technologies/weaviate/pull/1860 This new feature exposes the status of the individual shards over the API and allows for marking a shard as ready that was previously marked as read-only. When a shard is marked read-only all read queries can continue but write queries are prohibited.

    • Feature/periodically scan disk by @parkerduckworth in https://github.com/semi-technologies/weaviate/pull/1861 This new feature is the first to make use of the new shard-status API. There are two new configurable thresholds for disk pressure. If the disk usage exceeds a certain percentage (e.g. 80%) a warning is printed. If the disk pressure continues to rise and a second threshold (e.g. 90%) is crossed, all shards on that particular node will be automatically marked read-only.

    Fixes

    • Improve import performance on many-core machines by @etiennedi in https://github.com/semi-technologies/weaviate/pull/1879 tl;dr: With this improvement, we have been able to see 20% faster imports on machines with many cores (e.g. 60 cores) while reducing memory spikes. The long version: Please see #1879 for what changes were made internally. Mainly limiting import workers to the amount of available CPU cores and reducing the necessity of locking by copying more memory to a local thread.

    • Fix HNSW commit log issue where the index would be too large after restart or crash by @etiennedi in https://github.com/semi-technologies/weaviate/pull/1871, #1868 The compaction process for the HNSW index commit logs was losing some information leading to a situation where the links inside the HNSW graph were appended indefinitely, instead of being replaced. This led to massive index sizes after restarts that degraded performance and lead to unnecessarily large memory usage. This fix makes sure that all information is propagated correctly and indices are identical whether initially built-in memory or rebuilt from commit logs that were individually compacted.

    • Fix broken dynamic ef calculation by @etiennedi in https://github.com/semi-technologies/weaviate/pull/1880, #1878 Version v1.9.0 introduced more control over setting ef at runtime. However, it did not work as expected. This commit fixes the values. Those who have never touched the ef setting and were using small limits, will see an improvement in vector search quality with all default ef parameters due to this fix. If you had manually set ef already, this fix has no effect on you.

    • The following internal/non-user-facing fixes were made either to improve reliability or to improve the DX for Weaviate contributors (fix flaky tests, etc):

      • (internal) Upgrade and fix golangci lint errors by @aliszka in https://github.com/semi-technologies/weaviate/pull/1864
      • (internal) Build tag regex pattern update by @parkerduckworth in https://github.com/semi-technologies/weaviate/pull/1869
      • (internal) gh-1872 clean up disk segment list properly on shutdown by @etiennedi in https://github.com/semi-technologies/weaviate/pull/1873
      • (internal) WEAVIATE-62 Remove obsolete hnsw files by @etiennedi in https://github.com/semi-technologies/weaviate/pull/1875
      • (internal) gh-1868 fix broken locks in commit logger, make hnsw clean up after themselves by @etiennedi in https://github.com/semi-technologies/weaviate/pull/1877
      • (internal) gh-1884 WEAVIATE-70 fix flaky multi-shard integration test by @etiennedi in https://github.com/semi-technologies/weaviate/pull/1885

    New Contributors

    • @aliszka made their first contribution in https://github.com/semi-technologies/weaviate/pull/1864

    Full Changelog: https://github.com/semi-technologies/weaviate/compare/v1.11.0...v1.12.0

    Source code(tar.gz)
    Source code(zip)
  • v1.11.0(Mar 14, 2022)

    Changes

    Breaking Changes

    none

    New Features

    • Open AI module: provide API key at query time by @antas-marcin in https://github.com/semi-technologies/weaviate/pull/1817 For untrusted server environments, it is not advised to store the third-party API key along with the setup. Instead it should be provided at runtime which is now possible in this release.

    Fixes

    • Improve way objects are counted on disk #1811 Prior to this release a meta { count } in Aggregate would require reading every object from disk. With this change the net additions of each segment are calculated when initializing the segment and count only has to sum up each segments' values which is orders of magnitude faster

    • Internally version index/shard changes more precisely #1833 This release introduces a new internal shard/index versioning system that will allow introducing breaking changes in a non-breaking fashion. For example, new indexes created with v1.11.0 will store the keys of the Map type in the LSM store in an always-sorted fashion for additional performance (see below). Indexes built prior to v1.11.0 will still work with this version as they will simply be sorted at read-time.

    • Delete using filter leaves objects searchable by @antas-marcin in https://github.com/semi-technologies/weaviate/pull/1845 #1836 This release fixes an issue where duplicate IDs across classes led to issues on delete. Since DELETE /v1/objects/{id} would previously only delete the first object found, there could be a situation where the specified UUID still existed on another class. With this fix, every object with the specified ID - regardless of class - will be deleted. As part of this investigation we have found out that the current DELETE API is suboptimal: It does not take the class name as a parameter and therefore prevents classes from acting as real namespaces. We will deprecate this API in a future release and will add an alternative as part of the same release.

    • HNSW index fails if the initial insert has doc id > 24999 by @antas-marcin in https://github.com/semi-technologies/weaviate/pull/1851 #1848 With the introduction of importing objects without a vector and then adding a vector later in one of the previous releases, a new buggy situation was created: When more than 25,000 objects without a vector were imported, the next import with a vector would fail as the initial size of the HNSW index was smaller than the doc id and there was no check for the initial insert. This release fixes this.

    • Fix segfault by copying memory by @etiennedi in https://github.com/semi-technologies/weaviate/pull/1843 #1837 Prior to this release, it was possible to run into a SEGFAULT when importing objects and querying the inverted index concurrently. The cause for this was memory that was shared longer than a lock was held in the LSM store. Therefore an LSM compaction could remove old segments while the memory of that segment was still in use, thus leading to a SEGFAULT. Since it is not reasonable to hold such a lock for excessively long times, this was fixed by copying the respective memory on read which effectively makes it read-only and thus thread-safe.

    • Fix create/update timestamp issues in GraphQL and REST by @parkerduckworth in https://github.com/semi-technologies/weaviate/pull/1847 #1844 There were two issues: (1) the create timestamp would be overwritten on an update, (2) neither the create, nor the update timestamp could be retrieved using GraphQL. This release fixes both.

    • Fix LSM net additions count and add WAL threshold by @parkerduckworth in https://github.com/semi-technologies/weaviate/pull/1855 #1830 Prior to this release very frequent object updates, such as during a POST /v1/batch/references request would led to very frequent flushing of the memtable leading to a lot of small segments. Initializing and merging those unnecessarily small segments cost a lot of time later on during large imports. This was caused by the memtable size assuming each write was an addition. Thus any replaced value would count into the flush size counter leading to a premature flush. This release fixes this behavior by considering the net additions of a write. In addition, a new threshold is introduced to make sure that update-only requests do not lead to excessively large WALs which don't increase the memtable size counter.

    • Improve how inverted index is stored on disk #1832 For performance reasons keys of the LSM store type Map will now be stored in an always-sorted fashion. This allows for faster merging and scoring at runtime. This change only affects new indices built from this version on and is non-breaking to older versions due to #1833.

    • Sort memtable KV pairs on read by @etiennedi in https://github.com/semi-technologies/weaviate/pull/1853 #1852 Fixes an unreleased performance regression that would have been introduced by #1832

    • Fix nil-pointer in segment cursor by @etiennedi in https://github.com/semi-technologies/weaviate/pull/1859 #1850 Prior to this fix, it was possible to run into errors when listing objects while importing them. The cause for this bug was an incorrectly placed lock that was obtained slightly too late, thus allowing for the possibility of a race to occur. This fixes this by placing the lock correctly.

    • Fix typos/documentation

      • Fix deprecation log field typo by @parkerduckworth in https://github.com/semi-technologies/weaviate/pull/1823
      • Restore build tags that travisbot removed by @parkerduckworth in https://github.com/semi-technologies/weaviate/pull/1857
      • Fixed typo in error message by @illagrenan in https://github.com/semi-technologies/weaviate/pull/1818

    New Contributors

    • @illagrenan made their first contribution in https://github.com/semi-technologies/weaviate/pull/1818
    • @parkerduckworth made their first contribution in https://github.com/semi-technologies/weaviate/pull/1823

    Full Changelog: https://github.com/semi-technologies/weaviate/compare/v1.10.1...v1.10.2

    Source code(tar.gz)
    Source code(zip)
  • v1.10.1(Feb 1, 2022)

    Breaking Changes

    none

    New Features

    none

    Fixes

    • Fixes a bug where grouping in Get {} would error #1810
      • group was returning an error during a type: merge operation due to lack of information about the object's class which is now mandatory to pass due to updates in the module system introduced in v1.9.x
      • group with type:merge was returning an error when there was a reference requested in a GraphQL query. If there was a match and a reference existed the implementation of the group made incorrect assumptions about the internal type of the id field
    Source code(tar.gz)
    Source code(zip)
  • v1.10.0(Jan 27, 2022)

    Breaking Changes

    none

    New Features

    • Open AI module This module gives you a very convenient way to integrate OpenAI embeddings into Weaviate. The module will act as a vectorizer for both importing documents and vectorizing queries. You can choose from any of the supported Open AI models and at inference time Weaviate will send requests to OpenAI. Requires a valid OpenAI API Key.

      See the Weaviate Open AI Module docs page for full usage instructions.

    • QnA rerank based on answer quality #1779 Prior to this version, QnA would always extract the answer from the top 1 result as determined by the semantic search as part of ask: { question: "foo?" }. Now, multiple answer "candidates" can be taken from the top n results. If one of the lower down results has a better qna-specific score than a previous result, this result is pushed up in the results. To enable, set ask: { rerank: true }. Note that this feature also removes the limitation that answer extraction would only happen on the first result (#1791). Be aware that a high limit value will lead to a high number of QnA inference calls.

    • HNSW EF boundaries & better control #1789 This feature gives you more control over the HNSW query time ef parameter and prevents it from dropping too low. Prior to this feature result quality could degrade on requests with low limits if ef was set to -1 which is the default and means "let Weaviate pick". You can now set a lower boundary (dynamicEfMin) which defaults to 100 and an upper boundary (dynamicEfMax) which defaults to 500. You can also alter the factor (dynamicEfFactor) which defaults to 8 and controls how ef is automatically derived from limit.

    • HEAD /v1/objects/{id} #1784 You can now send a HEAD request instead of a GET request to /v1/objects/{id} to efficiently check if an object exists without having to load the entire object from disk and unmarshal all its properties. The response has no body and returns either 204 when the object exists or 404 when it doesn't.

    • Allow adding objects without a vector #1800 Prior to this version a class would either skip vector indexing and never have a vector or it would allow indexing, but then a vector was required. Now you can import objects without a vector and later update them to include a vector.

    • Manually overwrite vector - even when vectorizer module is present #1801 Prior to this version, you would either use a vectorizer or none. But if you decided to go for a vectorizer you could not manually influence the vector. With this new feature, you can now override it. You still have to make sure that your vector is compatible with both its form (same dims), as well as semantics (i.e. matching vector space).

    Fixes

    • Fix a bug where incorrect module defaults would prevent updating HNSW settings #1799

    • Fix unnecessarily strict class name restrictions #1786 The only restrictions are now that a class name starts with an uppercase letter (to distinguish it in ref types from primitive props which are all lowercase) and that it's GraphQL-compatible. So class names, such as A_super_awesome_cl4ss_9001 is now valid.

    Source code(tar.gz)
    Source code(zip)
  • v1.9.1(Jan 19, 2022)

    Breaking Changes

    none

    New Features

    none

    Fixes

    • Allow running "conflicting" modules in the same setup (#1744)

      Prior to this release, Weaviate would not start up if multiple modules would try to provide the same search operators, such as nearText. For example text2vec-contextionary and text2vec-transformers could not run in the same setup. The reason for this was that Explore{} which would search across classes would not be able to handle incompatible vector spaces. This release makes sure that the provided search operator belongs to the configured vectorizer. In turn, cross-class searching across incompatible vector spaces such as using Explore {} will be deactivated if conflicting modules are present.

    • Grouping by ref prop leads to error (#1778)

      Thanks to Alex Cannan for discovering this

    • Delete fails on multi-node setup (#1780)

      Thanks to ayoub louati for discovering this.

    • Bug: Querying Date attributes fail when sharding (#1775)

      Thanks to @zoltan-fedor for discovering this

    • where filter with 2 Anded Like clauses and a nearText filter causes weaviate to panic. (#1772)

      Thanks to @StefanBogdan for discovering this

    • PATCH (merge) fails on multi-node cluster (#1781)

      Thanks to @zoltan-fedor for discovering this

    • Bug: Chained filter finds results when it shouldn't since v1.8.0 (#1770)

      Thanks to Pranav Pawar for discovering this

    • Bug: (another) potential data race in compaction logic (#1762)

    • [Bug] Limit removes viable results when data has been previously deleted (#1765)

      Thanks to @ywchan2005 for your help in investigating this

    Source code(tar.gz)
    Source code(zip)
  • v1.9.1-rc.0(Dec 13, 2021)

    Breaking Changes

    none

    New Features

    none

    Fixes

    • Allow running "conflicting" modules in the same setup (#1744)

      Prior to this release, Weaviate would not start up if multiple modules would try to provide the same search operators, such as nearText. For example text2vec-contextionary and text2vec-transformers could not run in the same setup. The reason for this was that Explore{} which would search across classes would not be able to handle incompatible vector spaces. This release makes sure that the provided search operator belongs to the configured vectorizer. In turn, cross-class searching across incompatible vector spaces such as using Explore {} will be deactivated if conflicting modules are present.
    Source code(tar.gz)
    Source code(zip)
  • v1.9.0(Dec 10, 2021)

    Breaking Changes

    none

    New Features

    • First Multi-modal module: CLIP Module (#1756, #1766)

      This release introduces the multi2vec-clip module, a module that allows for multi-modal vectorization within a single vector space. A class can have image or text fields or both. Similarly, the module provides both a nearText and a nearImage search and allows for various search combinations, such as text-search on image-only content and various other combinations.

      How to use

      The following is a valid payload for a class that vectorizes both images and text fields:

      {
          "class": "ClipExample",
          "moduleConfig": {
              "multi2vec-clip": {
                  "imageFields": [
                      "image"
                  ],
                  "textFields": [
                      "name"
                  ],
                  "weights": {
                    "textFields": [0.7],
                    "imageFields": [0.3]
                  }
              }
          },
          "vectorIndexType": "hnsw",
          "vectorizer": "multi2vec-clip",
          "properties": [
            {
              "dataType": [
                "string"
              ],
              "name": "name"
            },
            {
              "dataType": [
                  "blob"
              ],
              "name": "image"
            }
          ]
        }
      

      Note that:

      • imageFields and textFields in moduleConfig.multi2vec-clip do not both need to be set. However at least one of both must be set.
      • weights in moduleConfig.multi2vec-clip is optional. If only a single property the property takes all the weight. If multiple properties exist and no weights are specified, the properties are equal-weighted.

      You can then import data objects for the class as usual. Fill the text or string fields with text and/or fill the blob fields with a base64-encoded image.

      Limitations

      • As of v1.9.0 the module requires explicit creation of a class. If you rely on auto-schema to create the class for you, it will be missing the required configuration about which fields should be vectorized. This will be addressed in a future release.

    Fixes

    • fix an error where deleting a class with geoCoordinates could lead to a panic due to missing cleanup (#1730)
    • fix an issue where an error in a module would not be forwarded to the user (#1754)
    • fix an issue where a class could not be deleted on some file system (e.g. AWS EFS) (#1757)
    Source code(tar.gz)
    Source code(zip)
  • v1.8.0(Nov 30, 2021)

    Migration Notice

    Version v1.8.0 introduces multi-shard indices and horizontal scaling. As a result the dataset needs to be migrated. This migration is performed automatically - without user interaction - when first starting up with Weaviate version v1.8.0. However, it cannot be reversed. We, therefore, recommend carefully reading the following migration notes and making a case-by-case decision about the best upgrade path for your needs.

    Why is a data migration necessary?

    Prior to v1.8.0 Weaviate did not support multi-shard indices. The feature was already planned, therefore data was already contained in a single shard with a fixed name. A migration is necessary to move the data from a single fixed shard into a multi-shard setup. The amount of shards is not changed. When you run v1.8.0 on a dataset the following steps happen automatically:

    • Weaviate discovers the missing sharding configuration for your classes and fills it with the default values
    • When shards start-up and they do not exist on disk, but a shard with a fixed name from v1.7.x exists, Weaviate automatically recognizes that a migration is necessary and moves the data on disk
    • When Weaviate is up and running the data has been migrated.

    Important Notice: As part of the migration Weaviate will assign the shard to the (only) node available in the cluster. You need to make sure that this node has a stable hostname. If you run on Kubernetes, hostnames are stable (e.g. weaviate-0 for the first node). However with docker-compose hostnames default to the id of the container. If you remove your containers (e.g. docker-compose down) and start them up again, the hostname will have changed. This will lead to errors where Weaviate mentions that it cannot find the node that the shard belongs to. The node sending the error message is the node that owns the shard itself, but it cannot recognize it, since its own name has changed.

    To remedy this, you can set a stable hostname before starting up with v1.8.0 by setting the env var CLUSTER_HOSTNAME=node1. The actual name does not matter, as long as it's stable.

    If you forgot to set a stable hostname and are now running into the error mentioned above, you can still explicitly set the hostname that was used before which you can derive from the error message.

    Example:

    If you see the error message "shard Knuw6a360eCY: resolve node name \"5b6030dbf9ea\" to host", you can make Weaviate usable again, by setting 5b6030dbf9ea as the host name: CLUSTER_HOSTNAME=5b6030dbf9ea.

    Should you upgrade or reimport?

    Please note that besides new features, v1.8.0 also contains a large collection of bugfixes. Some of those bugs also affected how the HNSW index was written to disk. Therefore it cannot be ruled out that the index on disk has a subpar quality compared to a freshly built index in version v1.8.0. Therefore, if you can import using a script, etc, we generally recommend starting with a fresh v1.8.0 setup and reimporting instead of migrating.

    Is downgrading possible after upgrading?

    Note that the data migration which happens at the first startup of v1.8.0 is not automatically reversible. If you plan on downgrading to v1.7.x again after upgrading, you must explicitly create a backup of the state prior to upgrading.

    Changelog

    Breaking Changes

    none, however see migration notice above

    New Features

    • Horizontal Scalability (#1599, #1600, #1601, #1623, #1622, #1653, #1654, #1655, #1658, #1672, #1667, #1679, #1695)

      The big one! Too big for a small release notes page. Instead, we have written extensive documentation on all things around Horizontal Scalability.

      Please see:

    • Improvements for Filtered Vector Search (#1728, #1732)

      See benchmarks here. The Improvements namely consist of three parts:

    • Pagination #1627

      Starting with this release search results can now be paged. This feature is available on:

      • List requests (GET /v1/objects and GraphQL Get { Class { } })
      • Vector Searches (GraphQL near<Media>) and Filter Searches (GraphQL where: {})

      Usage

      To use pagination, one new parameter is introduced (offset) which works in conjunction with the existing limit parameter. For example, to list the first ten results, set limit: 10. Then, to "display the second page of 10", set offset: 10, limit:10 and so on. E.g. to show the 9th page of 10 results, set offset:80, limit:10 to effectively display results 81-90.

      To do so in REST, simply append the two parameters as URL params, e.g. GET /v1/objects?limit=25&offset=75 To do so in GraphQL, simply add the two parameters to the class, e.g. { Get { MyClassName(limit:25, offset: 75) { ... } } }

      Performance and Resource Considerations & Limitations

      The pagination implementation is an offset-based implementation, not a cursor-based implementation. This has the following implications:

      • The cost of retrieving one further page is higher than that of the last. Effectively when searching for search results 91-100, Weaviate will internally retrieve 100 search results and discard results 0-90 before serving them to the user. This effect is amplified if running in a multi-shard setup, where each shard would retrieve 100 results, then the results aggregated and ultimately cut off. So in a 10-shard setup asking for results 91-100 Weavaite will effectively have to retrieve 1000 results (100 per shard) and discard 990 of them before serving. This means high page numbers lead to longer response times and more load on the machine/cluster.
      • Due to the increasing cost of each page outlined above, there is a limit to how many objects can be retrieved using pagination. By default setting the sum of offset and limit to higher than 10,000 objects, will lead to an error. If you must retrieve more than 10,000 objects, you can increase this limit by setting the environment variable QUERY_MAXIMUM_RESULTS=<desired-value>. Warning: Setting this to arbitrarily high values can make the memory consumption of a single query explode and single queries can slow down the entire cluster. We recommend setting this value to the lowest possible value that does not interfere with your users' expectations.
      • The pagination setup is not stateful. If the database state has changed between retrieving two pages there is no guarantee that your pages cover all results. If no writes happened, then pagination can be used to retrieve all possible within the maximum limit. This means asking for a single page of 10,000 objects will lead to the same results overall as asking for 100 pages of 100 results.

    Fixes

    • General Performance Improvments around Memory Allocations (#1620)

      Thanks to @cdpierse for his contributions to this issue
    • Fix behavior that could lead to a crashloop after an unexpected shutdown or crash:

      • Crashloops after unexpected shutdowns #1697 #1698 #1703
      • HNSW integrity compromised after restarts #1701 #1705
      • Improve ingesting WAL at crash recovery startup #1713
      • Fix an issue where parsing the WAL would lead to the creation of another WAL, thus increasing the effort for recovery if it were to fail again. #1716
      • Fix an issue where a failure during memtable flush may have led to an unparsable disk segment #1725
      • Ignore zero-length disk segments files. Previously they could block startup. #1726
    • Fix panic on filters #1750

      Fixes an issue where invalid combinations of prop types and filter types could lead to panics
    • Other fixes

      • Filter by ID (introduced in 1.7.2) #1708
      • Use Feature Projection in text2vec-transformers module #1572
    Source code(tar.gz)
    Source code(zip)
  • v1.8.0-rc.3(Nov 5, 2021)

    New features compared to the previous rc version:

    • Pagination #1627

      Starting with this release search results can now be paged. This feature is available on:

      • List requests (GET /v1/objects and GraphQL Get { Class { } })
      • Vector Searches (GraphQL near<Media>) and Filter Searches (GraphQL where: {})

      Usage

      To use pagination, one new parameter is introduced (offset) which works in conjunction with the existing limit parameter. For example, to list the first ten results, set limit: 10. Then, to "display the second page of 10", set offset: 10, limit:10 and so on. E.g. to show the 9th page of 10 results, set offset:80, limit:10 to effectively display results 81-90.

      To do so in REST, simply append the two parameters as URL params, e.g. GET /v1/objects?limit=25&offset=75 To do so in GraphQL, simply add the two parameters to the class, e.g. { Get { MyClassName(limit:25, offset: 75) { ... } } }

      Performance and Resource Considerations & Limitations

      The pagination implementation is an offset-based implementation, not a cursor-based implementation. This has the following implications:

      • The cost of retrieving one further page is higher than that of the last. Effectively when searching for search results 91-100, Weaviate will internally retrieve 100 search results and discard results 0-90 before serving them to the user. This effect is amplified if running in a multi-shard setup, where each shard would retrieve 100 results, then the results aggregated and ultimately cut off. So in a 10-shard setup asking for results 91-100 Weavaite will effectively have to retrieve 1000 results (100 per shard) and discard 990 of them before serving. This means, high page numbers lead to longer response times and more load on the machine/cluster.
      • Due to the increasing cost of each page outlined above, there is a limit to how many objects can be retrieved using pagination. By default setting the sum of offset and limit to higher than 10,000 objects, will lead to an error. If you must retrieve more than 10,000 objects, you can increase this limit by setting the environment variable QUERY_MAXIMUM_RESULTS=<desired-value>. Warning: Setting this to arbitrarily high values can make the memory consumption of a single query explode and single queries can slow down the entire cluster. We recommend setting this value to the lowest possible value that does not interfere with your users' expectations.
      • The pagination setup is not stateful. If the database state has changed between retrieving two pages there is no guarantee that your pages cover all results. If no writes happened, then pagination can be used to retrieve all possible within the maximum limit. This means asking for a single page of 10,000 objects will lead to the same results overall as asking for 100 pages of 100 results.
    • Filtered Vector Search Flat Search Cutoff #1728 #1729

      As outlined in this article, you can now configure a switch to a flat search when a filtered HNSW search would become too expensive due to the restrictiveness of the filter.

    Source code(tar.gz)
    Source code(zip)
  • v1.8.0-rc.2(Oct 18, 2021)

    This pre-release adds more fixes with regards to crash recovery:

    • #1713 Improve ingesting WAL at crash recovery startup
    • #1716 Fix an issue where parsing the WAL would lead to the creation of another WAL, thus increasing the effort for recovery if it were to fail again.
    • #1725 Fix an issue where a failure during memtable flush may have led to an unparsable disk segment
    • #1726 Ignore zero-length disk segments files. Previously they could block startup.
    Source code(tar.gz)
    Source code(zip)
  • v1.8.0-rc.1(Oct 13, 2021)

    In addition to the previous RC-release this release fixes the following issues or adds the following functionality:

    • Crashloops after unexpected shutdowns #1697 #1698 #1703
    • HNSW integrity compromised after restarts #1701 #1705
    • Filter by ID (introduced in 1.7.2) #1708
    • Use Feature Projection in text2vec-transformers module #1572

    More detailed release notes will follow for the GA release

    Source code(tar.gz)
    Source code(zip)
  • v1.8.0-rc.0(Sep 30, 2021)

    This pre-release introduces the ability to scale Weaviate horizontally and shard classes across multiple nodes in a cluster. Detailed Release Notes will follow at a later point.

    A newly created class will default its shard count to the size of the cluster.

    Source code(tar.gz)
    Source code(zip)
  • v1.7.2(Sep 28, 2021)


    Docker image/tag: semitechnologies/weaviate:1.7.2

    Breaking Changes

    none

    New Features

    none

    Fixes

    • Make property name validations less strict (#1562)

      The property name rules were unnecessarily strict due to historic reasons which no longer apply. Valid proprerty names now include any GraphQL-valid characters. (Note that creating a property also leads to being able to query said property via GraphQL, so if it needs to be GraphQL-compliant). The validation rules is /[_A-Za-z][_0-9A-Za-z]*/. In addition, the following names are restricted: id, _id, _additional, meta. Properties starting with an underscore (_) are allowed, but there is no guarantee that they will not be forbidden in the future as new internal properties are introduced. New internal properties will always start with an underscore, so if you prefix all your properties with a specific sequence, (for example wp_ for a WordPress-Plugin), you will avoid conflicts with future internal properties.

    • Issue aggregating some array properties (#1686)

      This fixes an issue where aggregating over some array data types would lead to an error

    • Add missing array data types (#1691)

      This fix provides the previously missing array data types boolean[] and date[].

    Source code(tar.gz)
    Source code(zip)
  • v1.7.1(Sep 17, 2021)

    Docker image/tag: semitechnologies/weaviate:1.7.1

    Breaking Changes

    none

    New Features

    none

    Fixes

    • Fixes an issue where the text2vec-contextionary would consider a schema invalid that only consists of array types if the class name is not being vectorized. (#1673)
    Source code(tar.gz)
    Source code(zip)
  • v1.7.0(Sep 1, 2021)

    Features

    • Array Datatypes (#1611)

      Starting with this releases, primitive object properties are no longer limited to individual properties, but can also include lists of primitives. Array types can be stored, filtered and aggregated in the same way as other primitives.

      Auto-schema will automatically recognize lists of string/text and number/int. You can also explicitly specify lists in the schema by using the following data types string[], text[], int[], number[]. A type that is assigned to be an array, must always stay an array, even if it only contains a single element.

    • New Module: text-spellcheck - Check and auto-correct misspelled search terms (#1606)

      Use the new spellchecker module to verify user-provided search queries (in existing nearText or ask functions) are spelled correctly and even suggest alternative, correct spellings. Spell-checking happens at query time.

      There are two ways to use this module:

      1. It provides a new additional prop which can be used to check (but not alter) the provided queries: The following query:
       {
         Get {
           Post(nearText:{
             concepts: "missspelled text"
           }) {
             content
             _additional{
               spellCheck{
                 changes{
                   corrected
                   original
                 }
                 didYouMean
                 location
                 originalText
               }
             }
           }
         }
       }
      

      will produce results, similar to the following:

         "_additional": {
           "spellCheck": [
             {
               "changes": [
                 {
                   "corrected": "misspelled",
                   "original": "missspelled"
                 }
               ],
               "didYouMean": "misspelled text",
               "location": "nearText.concepts[0]",
               "originalText": "missspelled text"
             }
           ]
         },
         "content": "..."
       },
      
      1. It extends existing text2vec-modules with a autoCorrect flag, which can be used to correct the query if incorrect in the background.
    • New Module ner-transformers - Extract entities from Weaviate using transformers (#1632)

      Use transformer-based models to extract entities from your existing Weaviate objects on the fly. Entity Extraction happens at query time. Note that for maximum perfomance, transformer-based models should run with GPUs. CPUs can be used, but the throughput will be lower.

      To make use of the modules capabilities, simply extend your query with the following new _additional property:

      {
        Get {
          Post {
            content
            _additional {
              tokens(
                properties: ["content"],    # is required
                limit: 10,                  # optional, int
                certainty: 0.8              # optional, float
              ) {
                certainty
                endPosition
                entity
                property
                startPosition
                word
              }
            }
          }
        }
      }
      
      

      It will return results similar to the following:

       "_additional": {
         "tokens": [
           {
             "property": "content",
             "entity": "PER",
             "certainty": 0.9894614815711975,
             "word": "Sarah",
             "startPosition": 11,
             "endPosition": 16
           },
           {
             "property": "content",
             "entity": "LOC",
             "certainty": 0.7529033422470093,
             "word": "London",
             "startPosition": 31,
             "endPosition": 37
           }
         ]
       }
      

    Fixes

    • Aggregation can get stuck when aggregating number datatypes (#1660)
    Source code(tar.gz)
    Source code(zip)
  • v1.6.0(Aug 11, 2021)

    Docker image/tag: semitechnologies/weaviate:1.6.0

    Breaking Changes

    none

    New Features

    • Zero Shot Classification (#1603)

      This release adds a new classification type zeroshot that works with any vectorizer or custom vectors. It picks the label objects that have the lowest distance to the source objects. The link is made using cross-references, similar to existing classifications in Weaviate.

      To start a zeroshot classification use "type": "zeroshot" in your POST /v1/classficiations request and specify the properties you want classified normally using "classifyProperties": [...].

      As zero shot involves no training data, you cannot set trainingSetWhere filters, but can filter both source ("sourceWhere") and label objects ("targetWhere") directly.

    Fixes

    • Fix nil-pointer panic on updated/deleted HNSW entrypoint (#1650)

      In concurrent update/import scenarios such as during classifications, this bug could lead to a nil pointer dereference panic. This releases fixes this issue.
    Source code(tar.gz)
    Source code(zip)
  • v1.5.2(Aug 10, 2021)

    Docker image/tag: semitechnologies/weaviate:1.5.2

    Breaking Changes

    none

    New Features

    • Fix possible data races (short write) (#1643)

      This release fixes various possible data races that could in the worst case lead to an unrecoverable error "short write". The possibility for those races was introduced in v.1.5.0 and we highly recommend anyone running on the v1.5.x timeline to upgrade to v1.5.2 immediately.

    Fixes

    none

    Source code(tar.gz)
    Source code(zip)
  • v1.5.1(Jul 29, 2021)


    Docker image/tag: semitechnologies/weaviate:1.5.1

    Breaking Changes

    none

    New Features

    none

    Fixes

    • Crashloop after unexpected crash in HNSW commit log (#1635)

      If Weaviate was killed (e.g. OOMKill) while writing the commit log, it could not be parsed after the next restart anymore, thus ending up in a crashloop. This fix removes this. Note that no data will be lost on such a crash: The particually written commit log has not yet been acknolweged to the user, so no write guarantees have been given yet. It is therefore safe to discard.

    • Chained Like operator not working (#1638)

      Prior to this fix, when chaining Like operators in where filters where each valueString or valueText contained a wildcard (*), typically only the first operator's results where reflected. This fix makes sure that the chaining (And or Or) is reflected correctly. This bug did not affect other operators (e.g. Equal, GreaterThan, etc) and only affected those Like queries where a wildcard was used.

    • Fix potential data race in Auto Schema features (#1636)

      This fix improves incorrect synchronization on the auto schema feature which in extreme cases could lead to a data race.

    Source code(tar.gz)
    Source code(zip)
  • v1.5.0(Jul 13, 2021)

    Docker image/tag: semitechnologies/weaviate:1.5.0

    Breaking Changes

    none

    WARNING: This release does not contain any API-level breaking changes, however, it changes the entire storage mechanism inside Weaviate. As a result, an in-place update is not possible. When upgrading from previous versions, a new setup needs to be created and all data reimported. Prior backups are not compatible with this version.

    New Features

    • LSM-Tree based Storage (#1523, #1569, #1570)

      Previous releases of Weaviate used a B+Tree based storage mechanism. This was not fast enough to keep up with the high write speed requirements of a large-scale import. This release completely rewrites the storage layer of Weaviate to use a custom LSM-tree approach. This leads to considerably faster import times, often more than 100% faster than the previous version. E.g.:

      image

    • Auto-Schema Feature (#1539)

      Import data objects without creating a schema prior to import. The classes will be created automatically, they can still be adjusted manually. Weaviate will guess the property type based on the first time it sees a property. The defaults can be configured using the environment variables outlined in #1539. The feature is on by default, but entirely non-breaking. You can still create an explicit schema at will.

    Fixes

    • Improve Aggregation Queries (#1616)

      Reduces the amount of allocations required for some aggregation queries, speeding them up and reduces the amount of timeouts encountered during aggregations.
    Source code(tar.gz)
    Source code(zip)
  • v1.5.0-rc.3(Jul 8, 2021)

    Changes to the previous pre-release

    • fixes an issue where weaviate could crash with a segfault if an Aggregation was started in parallel with an LSM tree compaction, e.g. during periods of importing (#1617)
    • makes aggregations more allocation-efficient, thus reducing the time it takes and need for GC on larger datasets (#1616)
    Source code(tar.gz)
    Source code(zip)
  • v1.5.0-rc.2(Jul 6, 2021)

    Changes to the previous release candidate:

    • Fix WAL error if it ends abruplty (#1612)
    • Don't panic when batch size approaches HNSW index growth interval (#1612)
    • Auto-schema feature (#1539)
    Source code(tar.gz)
    Source code(zip)
  • v1.5.0-rc.1(Jun 29, 2021)

  • v1.5.0-rc.0(Jun 23, 2021)

    A pre-release containing the upcoming LSM Store Change: Object and Index storage is no longer done using a B+Tree approach (bolb/bbolt), but uses a custom LSM tree approach. This speeds up imports by over 100% depending on the use case.

    A reimport is required, there is no upgrade path from 1.4.x other than creating a new setup and reimporting.

    Source code(tar.gz)
    Source code(zip)
  • v1.4.1(Jun 15, 2021)

    Docker image/tag: semitechnologies/weaviate:1.4.1

    Breaking Changes

    none

    New Features

    none

    Fixes

    • Fix issue where some files weren't stored in the correct location (#1596)

      This fixes an issue where the files containing the schema and classification state were not written in the correct location that is configured as PERSISTENCE_DATA_PATH=. This deviation could lead to issues restoring from backups.

      If you had any workarounds in place to deal with the broken paths, this release might break those workarounds as they are no longer required.

    Source code(tar.gz)
    Source code(zip)
  • v1.4.0(Jun 9, 2021)

    Breaking Changes

    none

    New Features

    • New Image Module img2vec-neural (#1525)

      A new vectorizer module which allows vectorizing images using Neural Networks. At the time of the release resnet50 is officially supported, but other models will be added at a later point.

      This module makes use of the new blob data type (see below).

      When adding a class with vectorizer type img2vec-neural the configuration must contain information about which field holds the image. This can be achieved with the following config:

        "moduleConfig": {
          "img2vec-neural": {
            "imageFields": [
              "image"
            ]
          }
        }
      

      If multiple fields are specified, the module will vectorize them separately and use the mean vector of the individual vectors.

      When adding data, make sure that the specified fields are filled with valid image data (e.g. jpg, png, etc.). The blob type itself (see below) requires that all blobs are base64 encoded. To obtain the base64-encoded value of an image, you can run the following command - or use the helper methods in the Weaviate clients - to do so:

      cat my_image.png | base64
      

      At search time you can use the "standard" vector-search options, such as nearVector and nearObject. But in addition, you can also vectorize a new image at search time and search by using the image's vector. To do so, you can use the nearImage search option like so:

      {
        Get {
          MyImage(nearImage: {
            image: "/9j/4AAQSkZJRgABAgE..."
            certainty: 0.7
          }) {
            image
          }
        }
      }
      

    Note that at the moment, but a Tensorflow/Keras and a pytorch-based implementation exists for the model inference containers. Either one has distince advantages over the other:

    • resnet50 (pytorch)

      • :white_check_mark: supports both amd64 and arm64
      • :white_check_mark: supports cuda
      • :x: does not support multi-threaded inference
    • resnet50 (keras)

      • :warning: supports amd64, but not arm64
      • :x: does not support cuda at this time
      • :white_check_mark: supports multi-threaded inference
    • Add Hardware acceleration for amd64 CPUs (Intel, AMD) (#1559, #1577)

      If Weaviate is running on an AVX2-compatible CPU, AVX2 instructions are used to calculate the dot product more efficiently. If AVX2 is not present or Weaviate is running on other architectures, the native Go implementation is used. This speeds up both vector imports, as well as queries

    • Support arm64 technology for entire Weaviate stack

      Starting with this release the entire Weaviate stack can run natively on arm64 hardware, such as Apple M1-based Macs. This eliminates the need for emulation and assures the performance is similar to amd64-based hardware. Currently the following components are supported: Weaviate Core, text2vec-contextionary, text2vec-transformers, qna-transformers, img2vec-neural (only pytorch-based models). All latest Docker images have been pushed as Multi-Architecture images, so no changes are required. A standard docker-comopse up -d will automatically use the arm64-based images on an arm64-based machine, such as M1 Macs.

    • Set ef at search time (#1542, #1564)

      You can now explicitly configure the search time ef parameter for the HNSW vector index. The higher ef is chosen, the more accurate, but also slower a search becomes. This helps in the recall/performance trade-off that is possible with HNSW.

      If you omit setting this field it will default to -1 which means "Let Weaviate pick the right ef value". This is the same behavior that was present prior to this release. In this case Weaviate will pick ef according to this formula.

      To overwrite the default ef value, you can set it (together with efConstruction and maxConnections) in the vectorIndexConfig, for example:

      "vectorIndexConfig": {
        "skip": false,
        "ef": 100,
        "efConstruction": 128,
        "maxConnections": 64
      }
      

      Note that while many parts of a schema class are immutable by design, updating ef (only search-time ef - not efConstruction) is explicitly supported. (#1564)

    • Introduce new dataType blob (#1566, #1567)

      A new primitive data type blob was introduced. Any binary data can be stored. The data must be base64 encoded for safe transfer via REST.

      Note that blob data types are never indexed in the inverted index. This means using blobs in whereFilters is not possible and there is no valueBlob field accordingly.

    • Skip vector-indexing a class (#1580)

      There are situations where it doesn't make sense to vectorize a class. For example if the class is just meant as glue between two other class (consisting only of references) or if the class contains mostly duplicate elements (Note that importing duplicate vectors into HNSW is very expensive as the algorithm uses a check whether a candidate's distance is higher than the worst candidate's distance for an early exit condition. With (mostly) identical vectors, this early exit condition is never met leading to an exhaustive search on each import or query).

      In this case, you can now skip indexing a vector all-together. To do so, set the following options on a schema class:

      "vectorIndexConfig": {
        "skip": true
      }
      

      The newly introduced "skip" fields defaults to false, so this is a backward-compatible change. If not set, classes will be indexed normally.

      Note that the creation of a vector through a module is decoupled from storing the vector in Weaviate. So, simply skipping the indexing does not skip the generation of a vector if a vectorizer other than none is configured on the class (for example through a global default). It is therefore recommended to always set: "vectorizer": "none" explicitly when skipping the vector indexing.

      If vector indexing is skipped, but a vectorizer is configured (or a vector is provided manually) a warning is logged on each import.

    Fixes

    • Various Performance Fixes around the HNSW Vector Index (#1559)

      Various fixes and changes were introduced in the HNSW vector index to improve the query and import times related to the vector index. Most notably:

      • Switch Binary-Tree based Priority-Queue to Custom Heap-Based Priority Queue optimized for fewer allocations
      • Use improved neighbor connection heuristic ("Heuristic 2" from HNSW paper)
      • Introduce reusable visited list pools (inspired by hnswlib C++ implementation)
      • Reduce overhead of the Vector Cache, use multi-sharded locking (The vector cache keeps vectors in mem if memory is available and only reads them from disk when it's full. The size is configurable, it defaults to 2M vectors, which for 768d vector of size float32 corresponds to ~6GB)
      • Cosine-Distance calculations are now actually the dot-product of normalized vectors. This is the same thing, but we only have to normalize them once and each subsequent calculation (of which there are plenty in HNSW indexing) is slightly cheaper.
    • Replace satori/uuid with google/uuid (#1571)

      A CVE was added to the original repo which is unmaintained now. It was thus replaced with a maintained one. Thanks to external contributor @simplechris for this fix.

    • Make property order consistent when vectorizing (#1576)

      When vectorizing text with models where the order of sentences is important, e.g. transformers, the resulting vector was not deterministic prior to this release. This is because the order of iterating over the text/string properties was not guaranteed. With this fix, properties are now always fed to the vectorizer in alphabetical order, meaning two identical objects will receive identical vectors - also with the text2vec-transformers module.

      Thanks to first-time contributor @StefanBogdan for this fix.

    • Fix issues around PATCH API when using custom vectors (#1591)

      The PATCH API (merge, partial update) assumed that a vectorizer would always be present which could then lead to errors when user-specified vectors were used. This release fixes this behavior and makes sure all cases which work with vectorizers also work with custom vectors. Note that the PUT and POST APIs were not affected by this bug.

    • Detect schema settings that will most likely lead to duplicate vectors and print warning (#1582)

      Because (mostly) identical vectors can slow down HNSW indexing (see above), a warning is printed if a vectorizer config will most likely result in duplicate vectors. For example if only the class name is vectorized, but no properties, it is certain that vectors will be identical.

    • Fix missing schema validation on transformers module (#1583)

      Prior to this release, the module-specific schema-validation (e.g. "Are there any fields which a vector can be derived from?") was missing for the text2vec-transformers module. This fix adds the validation rules.

    Source code(tar.gz)
    Source code(zip)
Owner
SeMI Technologies
SeMI Technologies creates database software like the Weaviate vector search engine
SeMI Technologies
Real-time Map displays real-time positions of public transport vehicles in Helsinki

Real-time Map Real-time Map displays real-time positions of public transport vehicles in Helsinki. It's a showcase for Proto.Actor - an ultra-fast dis

ASYNKRON 27 Jun 8, 2022
Vald. A Highly Scalable Distributed Vector Search Engine

Vald is a highly scalable distributed fast approximate nearest neighbor dense vector search engine.

Vector Data as a Service 946 Jun 28, 2022
An open source embedding vector similarity search engine powered by Faiss, NMSLIB and Annoy

Click to take a quick look at our demos! Image search Chatbots Chemical structure search Milvus is an open-source vector database built to power AI ap

The Milvus Project 11.1k Jun 26, 2022
Substation is a cloud native toolkit for building modular ingest, transform, and load (ITL) data pipelines

Substation Substation is a cloud native data pipeline toolkit. What is Substation? Substation is a modular ingest, transform, load (ITL) application f

Brex 19 May 29, 2022
An open-source, distributed, cloud-native CD (Continuous Delivery) product designed for developersAn open-source, distributed, cloud-native CD (Continuous Delivery) product designed for developers

Developer-oriented Continuous Delivery Product ⁣ English | 简体中文 Table of Contents Zadig Table of Contents What is Zadig Quick start How to use? How to

null 0 Oct 19, 2021
A LoRaWAN nodes' and network simulator that works with a real LoRaWAN environment (such as Chirpstack) and equipped with a web interface for real-time interaction.

LWN Simulator A LoRaWAN nodes' simulator to simulate a LoRaWAN Network. Table of Contents General Info Requirements Installation General Info LWN Simu

ARSLab 24 May 17, 2022
provide api for cloud service like aliyun, aws, google cloud, tencent cloud, huawei cloud and so on

cloud-fitter 云适配 Communicate with public and private clouds conveniently by a set of apis. 用一套接口,便捷地访问各类公有云和私有云 对接计划 内部筹备中,后续开放,有需求欢迎联系。 开发者社区 开发者社区文档

null 23 May 8, 2022
A GPU-powered real-time analytics storage and query engine.

AresDB AresDB is a GPU-powered real-time analytics storage and query engine. It features low query latency, high data freshness and highly efficient i

Uber Open Source 2.9k Jun 29, 2022
A real-time `VWAP` (volume-weighted average price) calculation engine

VWAP Overview The goal of this project is to create a real-time VWAP (volume-weighted average price) calculation engine. For this was used the coinbas

Sayi Polia 0 Feb 11, 2022
Fast specialized time-series database for IoT, real-time internet connected devices and AI analytics.

unitdb Unitdb is blazing fast specialized time-series database for microservices, IoT, and realtime internet connected devices. As Unitdb satisfy the

Saffat Technologies 91 Jun 9, 2022
Phalanx is a cloud-native full-text search and indexing server written in Go built on top of Bluge that provides endpoints through gRPC and traditional RESTful API.

Phalanx Phalanx is a cloud-native full-text search and indexing server written in Go built on top of Bluge that provides endpoints through gRPC and tr

Minoru Osuka 232 Jun 30, 2022
MatrixOne is a planet scale, cloud-edge native big data engine crafted for heterogeneous workloads.

What is MatrixOne? MatrixOne is a planet scale, cloud-edge native big data engine crafted for heterogeneous workloads. It provides an end-to-end data

Matrix Origin 996 Jun 29, 2022
CudeX: a cloud native intelligent operation and maintenance engine that provides service measurement, index quantification

简介 CudgX是星汉未来推出的面向云原生时代的AIOps智能运维引擎,它通过各类服务的多维度、大数据量的数据收集及机器学习训练分析,对各种服务进行指标化、数字

Galaxy-Future 63 Jun 16, 2022
RadonDB is an open source, cloud-native MySQL database for building global, scalable cloud services

OverView RadonDB is an open source, Cloud-native MySQL database for unlimited scalability and performance. What is RadonDB? RadonDB is a cloud-native

RadonDB 1.7k Jun 28, 2022
Cloudpods is a cloud-native open source unified multi/hybrid-cloud platform developed with Golang

Cloudpods is a cloud-native open source unified multi/hybrid-cloud platform developed with Golang, i.e. Cloudpods is a cloud on clouds. Cloudpods is able to manage not only on-premise KVM/baremetals, but also resources from many cloud accounts across many cloud providers. It hides the differences of underlying cloud providers and exposes one set of APIs that allow programatically interacting with these many clouds.

null 1 Jan 11, 2022
Time-Based One-Time Password (TOTP) and HMAC-Based One-Time Password (HOTP) library for Go.

otpgo HMAC-Based and Time-Based One-Time Password (HOTP and TOTP) library for Go. Implements RFC 4226 and RFC 6238. Contents Supported Operations Read

Jose Torres 35 Jun 23, 2022
Time struct in Go that uses 4 bytes of memory vs the 24 bytes of time.Time

A tiny time object in Go. Tinytime uses 4 bytes of memory vs the 24 bytes of a standard time.Time{}

Lane Wagner 57 May 2, 2022
Publish Your GIS Data(Vector Data) to PostGIS and Geoserver

GISManager Publish Your GIS Data(Vector Data) to PostGIS and Geoserver How to install: go get -v github.com/hishamkaram/gismanager Usage: testdata fol

Hisham waleed karam 45 Jun 16, 2022
Rasterx is an SVG 2.0 path compliant rasterizer that can use either the golang vector or a derivative of the freetype anti-aliaser.

rasterx Rasterx is a golang rasterizer that implements path stroking functions capable of SVG 2.0 compliant 'arc' joins and explicit loop closing. Pat

Steven R Wiley 97 Jun 11, 2022
Cairo in Go: vector to SVG, PDF, EPS, raster, HTML Canvas, etc.

Canvas is a common vector drawing target that can output SVG, PDF, EPS, raster images (PNG, JPG, GIF, ...), HTML Canvas through WASM, and OpenGL. It h

Taco de Wolff 986 Jun 22, 2022
Static bit vector structures in Go

teivah/bitvector Overview A bit vector is an array data structure that compactly stores bits. This library is based on 5 static different data structu

Teiva Harsanyi 72 Jan 24, 2022
GoVector is a vector clock logging library written in Go.

GoVector is a vector clock logging library written in Go. The vector clock algorithm is used to order events in distributed systems in the absence of a centralized clock. GoVector implements the vector clock algorithm and provides feature-rich logging and encoding infrastructure.

Distributed clocks 163 Jun 18, 2022
Publish Your GIS Data(Vector Data) to PostGIS and Geoserver

GISManager Publish Your GIS Data(Vector Data) to PostGIS and Geoserver How to install: go get -v github.com/hishamkaram/gismanager Usage: testdata fol

Hisham waleed karam 45 Jun 16, 2022
Red team tool that emulates the SolarWinds CI compromise attack vector.

SolarSploit Sample malicious program that emulates the SolarWinds attack vector. Listen for processes that use the go compiler Wait for a syscall to o

TestifySec 18 Feb 11, 2022
A Go package converting a monochrome 1-bit bitmap image into a set of vector paths.

go-bmppath Overview Package bmppath converts a monochrome 1-bit bitmap image into a set of vector paths. Note that this package is by no means a sophi

tunabay 2 Mar 22, 2022
IntSet - Integer based Set based on a bit-vector

IntSet - Integer based Set based on a bit-vector Every integer that is stored will be converted to a bit in a word in which its located. The words are

Jakob Möller 0 Feb 2, 2022
Generate vector tiles for the entire planet on relatively low spec hardware.

Sequentially Generate Planet Mbtiles Sequentially generate and merge an entire planet.mbtiles vector tileset on low memory/power devices for free. com

Jack Bizzell 18 Jun 24, 2022
stratus is a cross-cloud identity broker that allows workloads with an identity issued by one cloud provider to exchange this identity for a workload identity issued by another cloud provider.

stratus stratus is a cross-cloud identity broker that allows workloads with an identity issued by one cloud provider to exchange this identity for a w

robert lestak 1 Dec 26, 2021
Cloud-Z gathers information and perform benchmarks on cloud instances in multiple cloud providers.

Cloud-Z Cloud-Z gathers information and perform benchmarks on cloud instances in multiple cloud providers. Cloud type, instance id, and type CPU infor

CloudSnorkel 16 Jun 8, 2022