Go实现的全文检索数据库Bleve简介头条

ctolib 发布于27天前 阅读121次
0 条评论

TL;DR Bleve is an open-source full-text search database. It is written in Go, it is very fast, and has a small memory footprint. This post focuses on the basics: setting up document mappings, setting up analyzers, and performing basic searches.

Why Bleve at Wireless Registry

At Wireless Registry, we maintain a mapping from wireless names to their associated semantic atoms. Our main registry is in Solr, which has a good support for replication and sharding, out-of-the-box admin interface, and good write throughput. There is however one key issue that we have come up against. We have to handle query bursts coming from our graph analytics cluster. One solution is to scale-out the Solr cluster. It is an easy solution, but it does introduce high latency in our graph computations.

Enter Bleve Search! Bleve is an open source Information Retrieval database written in Go.

Bleve logo.

In principal, it can be used as a drop-in replacement for Solr and ElasticSearch. Interestingly, it does not have its own storage, but rather a generic KV interface, and it supports many KV stores such as boltdbleveldb, and goleveldb. Adding your preferred KV store is as easy as writing a few wrappers. And most importantly, Bleve is fast and has a small memory footprint (especially compared to JVM products). Given this, we baked Bleve straight into our graph processing nodes. A node instructs the Bleve agent to load sets of Solr documents, and then queries the Bleve index locally. This solution has allowed us to scale out our analysis rapidly and cheaply.

Core Concepts

Before jumping into the code, we need to clarify a few key concepts. This is not meant to be an exhaustive description, but rather a subjective take on what the essentials are. For the in-depth information see the documentation, and the godocs.

  1. Bleve indexes documents. A document is a set of fields. Each field is typed as: TextBooleanNumeric or DateTime. A document may contain sub-documents.
  2. Each field is assigned an analyzer. An analyzer takes the field’s value and produces a set of tokens. Each token is then put into the index. The index contains all the tokens for each field that have been extracted from the input documents. Each token entry contains a pointer to the document that contains that token. This index is often referred to as a reverse-index.
  3. Documents are retrieved by searching the index. A search query specifies which tokens the documents we are searching for must have, or which tokens similar to the query tokens the resulting documents must have.

To keep this post simple, we will only focus on text fields. If you are versed in Go, have a look at analysis and searchers packages to get a more in-depth understanding of these concepts.

Document Schema

We will use the following struct as our running example:

type SimpleDoc struct {
  Id int
  Title string
  Body string
}
func (d *SimpleDoc) Type() string {
 return "simple_doc"
}

The Type() function returns a string, which Bleve uses to assign appropriate analyzers during indexing.

Now, we need to define a set of analyzers that will be applied over the Title and Body fields. Recall that the analyzer is a function:

analyzer: Value -> 2^Tokens

In short, an analyzer takes a some value and produces a set of tokens, which is a subset of Tokens (i.e. the set of all possible token values). The simplest analyzer is an identity function that returns the original value v. It is defined as follows:

import (
 “github.com/blevesearch/bleve/analysis/analyzers/custom_analyzer”
 “github.com/blevesearch/bleve/analysis/tokenizers/single_token”
)
func SingleTermAnalyzer() map[string]interface{} {
  return map[string]interface{}{
    “type”: custom_analyzer.Name,
    “char_filters”: []string{},
    “tokenizer”: single_token.Name,
    “token_filters”: []string{},
  }
}

Let us step through the code. We define analyzers using map[string]interface{} structs. In particular, the type property says that this is a custom defined (i.e. by us) analyzer. (The analysis package has a set of predefined analyzers.) The char_filters property is an empty array, meaning that no-preprocessing character filters are applied to the field’s value. The tokenizer property is set to a single_token tokenizer (identified by its Name property). Finally, the token_filters property is set to an empty array, since we do not want to further decompose the generated token.

Now, we can set up a document schema (mapping) for our struct as follows:

indexMapping := bleve.NewIndexMapping()
indexMapping.AddCustomAnalyzer(“single_term”, SingleTermAnalyzer())
docMapping := bleve.NewDocumentMapping()
indexMapping.AddDocumentMapping(“simple_doc”, docMapping)
rawFieldMapping := bleve.NewTextFieldMapping()
rawFieldMapping.Analyzer = “single_term”
docMapping.AddFieldMappingsAt(“Title”, rawFieldMapping)
docMapping.AddFieldMappingsAt("Body", rawFieldMapping)
index, err:= bleve.New("", indexMapping) //in memory index

We first create a new indexMapping, and say that we will use a custom-defined analyzer dubbed single_term. Into this mapping, we then add a document mapping for the type simple_doc (our fancy struct). We can now associate that analyzer to the Title and Body fields (via rawFieldMapping variable). Finally, we open an index given our mapping. Note that the empty string parameter to the New function indicates that the index is stored in memory. If a non-empty string is supplied, then that string is taken as the file-path where Bleve will store the index.

Indexing and Searching

Having set up our document mapping and having instantiated the index, we can finally index and search documents.

doc := &SimpleDoc{1, "Top Gun", "A great film."}
index.Index(doc.Id, doc)
query := bleve.NewMatchQuery("Top Gun").SetField(“Title”)
req := bleve.NewSearchRequest(query)
req.Fields = []string{“*”}
results, err := index.Search(req)

Indexing requires that each document have a unique ID across the whole index. Searching is a slightly more complex affair. First we must select which type of a query we want to construct. In the given example, we are using a Match query, which first analyzes the query string to get the tokens, and searches for documents that satisfy at least one of the tokens. Note that the query string is analyzed using the analyzer assigned to the field over which the query is applied. A simpler query (compared to Match) is a Term query, which does not analyze the query string but rather uses it as the search token. A more complex query is a Phrase query that requires tokens to appear in a fixed order. We can also combine queries using conjunctions and disjunctions to form complex queries. Crucially, with every issued query, we have to say which document fields we want to retrieve in the results. In the given example, the * symbol indicates that we want all fields to be retrieved.

for _, res := range results.Hits {
  fmt.Printf("ID: %v. Score %v.\n", res.ID, res.Score)
  
  for k, v := range res.Fields {
    fmt.Printf("Field %v. Value %v.\n", k, v)
  }
}

Each hit is a DocumentMatch struct that describes a search hit (result). The most important fields are IDScore, and the Fields map defined as map[string]interface{}.

A Few Words on Scoring

Scoring is arguably the most important part of the document search. The basic problem is: given a potentially high number of search hits, how are the resulting documents sorted in terms of query relevancy? Bleve, much like Lucene, uses tf-idf scoring. At its heart is a cosine product between the query tokens and each document’s tokens, normalized over the tokens’ relevancy.

In practice, tf-idf has been proven to work, though there are other techniques one can use. Further exploration of the tf-idf scoring, as implemented in Bleve, is left for a future post.

More Complex Analyzers

So far, our constructed index with the simple analyzer is more or less a “glorified” map[string]interface{} data structure. The power of Bleve really reveals itself when we start using more complex analyzers. Consider the following:

func NgramAnalyzer() map[string]interface{} {
  return map[string]interface{}{
    “type”: custom_analyzer.Name,
    “char_filters”: []string{},
    “tokenizer”: whitespace_tokenizer.Name,
    “token_filters”: []string{
      lower_case_filter.Name,
      “ngram_filter”,
    },
  }
}
func NgramTokenFilter() map[string]interface{} {
  return map[string]interface{}{
    “type”: ngram_filter.Name,
    “min”: float64(3),
    “max”: float64(4),
  }
}

The given analyzer produces a set of lower case ngrams of size 3 to 4 of every word in the given text field. For example, given a text field: “A HORSE JUMPED OVER A TREE” would produce the following ngrams: “hor”, “ors”, “rse”, “hors”, “orse”, “jum”, “ump”, “mpe”, “ped”, “jump”, “umpe”, “mped”, “ove”, “ver”, “over”, “tre”, “ree”, “tree”. First, notice that we have a whitespace tokenizer that breaks up whitespace characters as stop symbols. Second, each produced token is transformed into lower case letters using the stated lower_case_filter. Finally, each resulting token is split into ngrams, which become the final set of tokens that gets indexed. The code above assumes that the ngram_filter string has been registered with the indexMapping. This we do with the following code:

indexMapping.AddCustomTokenFilter(“ngram_filter”,
                                   NgramTokenFilter())
indexMapping.AddCustomAnalyzer(“ngram_analyzer”, NgramAnalyzer())
ngramFieldMapping := bleve.NewTextFieldMapping()
ngramFieldMapping.Analyzer = “ngram_analyzer”
docMapping.AddFieldMappingsAt(“Title”, ngramFieldMapping)

Ngrams are extremely useful, for example, in searches where we expect some words to be misspelled.

There may often be tokens that we do not want to index, like common words that do not add weight to the results (e.g. “a”, “an”, “the” in English). These stop words are simply another token filter that can added to an analyzer. Consider the following:

func NgramAnalyzer() map[string]interface{} {
  return map[string]interface{}{
    “type”: custom_analyzer.Name,
    “char_filters”: []string{},
    “tokenizer”: whitespace_tokenizer.Name,
    “token_filters”: []string{
      lower_case_filter.Name,
      “stop_words_filter”,
      “ngram_filter”,
    },
  }
}
func StopWordsTokenMap() map[string]interface{} {
  return map[string]interface{}{
    “type”: token_map.Name,
    “tokens”: []interface{}{
      “stop”,
      “word”,
    },
  }
}
func StopWordsTokenFilter() map[string]interface{} {
  return map[string]interface{}{
    “type”: stop_tokens_filter.Name,
    “stop_token_map”: “stop_words_map”,
  }
}

We have added a new stop_words_filter, and want to use the stop words as defined in the StopWordsTokenMap(). The following code sets all this up:

indexMapping.AddCustomTokenMap(“stop_words_map”,
                               StopWordsTokenMap())
indexMapping.AddCustomTokenFilter(“stop_words_filter”,
                                  StopWordsTokenFilter())

This is just a taste of the power that different analyzers and token filters can give you. The full list of tokenizers, filters and analyzers is in the analysispackage. The code is very easy to follow.

Next Steps

Hopefully, this post sets up scaffolding to start experimenting with different features. To that end, you can explore different search queries (which we only mentioned) and replace our simple Match query with other queries and see the changes in results.

This posts also completely ignores facets, which are an important feature because they can aggregate information about the search results. They are straight-forward to experiment and use, and they give Bleve an extra edge.

需要 登录 后回复方可回复, 如果你还没有账号你可以 注册 一个帐号。