Indexing with Elasticsearch

1 Requirements

This feature requires an existing installation of Elasticsearch 5 or Elasticsearch 6. Please consult the Elasticsearch documentation for details on installation, configuration and maintenance of Elasticsearch.

The API of the elasticsearch instance must be available from the CMS Server, but should not be available from clients directly.

For indexing file contents, the Ingest Attachment Plugin is also required.

Please use matching versions of Elasticsearch and the Ingest Attachment Plugin. When installing e.g. Elasticsearch v6.2.3, you must install the Ingest Attachment Plugin v6.2.3.

2 Installation and Configuration of Elasticsearch

The Elasticsearch setup documentation includes information on how to setup Elasticsearch and get it running.

Please consult the documentation for the exact version of Elasticsearch you are using.

3 Configuration of the CMS

Indexing and searching must be activated and configured in the CMS:

/Node/etc/conf.d/*.conf


// activate indexing with elasticsearch
$FEATURE["elasticsearch"] = true;

// url for connecting to elasticsearch instance
$ELASTICSEARCH["url"] = "http://localhost:9200/";

// prefix for indices in elasticsearch
$ELASTICSEARCH["index"] = "genticscms";

// number of threads for indexing
$ELASTICSEARCH["threads"] = 10;

3.1 Indexing file contents

For files (but not images), the CMS will also put the content into the Elasticsearch Index to make it searchable. This does not make sense for some files and can lead to resource exhaustion in the Elasticsearch (like Out Of Memory Errors), and can therefore be controlled with black- and whitelists for the mimetypes. It is also possible to further restrict the number of indexed characters (the default limit in Elasticsearch is 100000 characters).

/Node/etc/conf.d/*.conf



// blacklist 
$ELASTICSEARCH["content"]["blacklist"] = array(
    "application/zip",
    "audio/.*",
    "video/.*");

// whitelist
$ELASTICSEARCH["content"]["whitelist"] = ".*";

// restrict indexed characters
$ELASTICSEARCH["content"]["indexedChars"] = 10000;

blacklist and whitelist can be single regular expressions or array of regular expressions
the default blacklist contains application/zip, audio/.* and video/.* (like in the example shown above), the default whitelist is empty (allowing all mimetypes, which are not blacklisted).
blacklist is stronger than whitelist, if a mimetype matches one of the blacklist patterns, the contents will not be indexed, even if it also matches one of the whitelist patterns.
indexedChars is per default empty and the default limit in ElasticSearch is 100000 characters.

3.2 Configuring word decompounder for German

For a better search experience, it is advisable to use a word decompounder for analysis of german contents, e.g. the Hyphenation decompounder token filter

The following steps are necessary to accomplish this:

3.2.1 Step 1 – Hyphenations pattern file

The hyphenations pattern file can be downloaded from the OFFO Sourceforge project

From the zip, only the file offo-hyphenation/hyph/de_DR.xml is used and should be placed under analysis/de_DR.xml in the config directory of the Elasticsearch installation.

3.2.2 Step 2 – German dictionary file

The german dictionary file german-dictionary.txt was generated from the igerman98 project.

The wordlist must be placed under analysis/german-dictionary.txt in the config directory of the Elasticsearch installation.

3.2.3 Step 3 – Configuration of index settings

The index settings for the index genticscms_page_de can be changed like this:

/Node/etc/conf.d/*.conf


// change index setting of pages in german to use a word decompounder
$ELASTICSEARCH["settings"]["page_de"] = array(
    "analysis" => array(
        "analyzer" => array(
            "content_analyzer" => array(
                "type" => "custom",
                "char_filter" => array("html_strip"),
                "tokenizer" => "standard",
                "filter" => array("lowercase", "german_stop", "german_decompounder", "german_snowball")
            ),
            "search_analyzer" => array(
                "type" => "custom",
                "char_filter" => array("html_strip"),
                "tokenizer" => "standard",
                "filter" => array("lowercase", "german_stop", "german_decompounder", "german_snowball")
            )
        ),
        "filter" => array(
            "german_stop" => array(
                "type" => "stop",
                "stopwords" => "_german_"
            ),
            "german_decompounder" => array(
                "type" => "hyphenation_decompounder",
                "word_list_path" => "analysis/german-dictionary.txt",
                "hyphenation_patterns_path" => "analysis/de_DR.xml",
                "min_subword_size" => 4
            ),
            "german_snowball" => array(
                "type" => "snowball",
                "language" => "German"
            )
        )
    )
);

The index setting contains two analyzers, named content_analyzer and the search_analyzer. Both analyzers can be configured identically, or differently depending on the desired indexing and search behaviour.

When the configuration for an already existing index was modified, the index must be dropped and rebuilt using the Search Index Maintenance

3.2.4 `content_analyzer`

The content_analyzer will be used for analysis of the indexed content.

3.2.5 `search_analyzer`

The search_analyzer will be used for analysis of the search terms.

4 Usage

When the CMS is started (or when the configuration is reloaded), the CMS will check for existence of the required indices and will create missing indices:

Type	Index name
Page (no language)	genticscms_page
Page (language with code “xx” )	genticscms_page_xx
Folder	genticscms_folder
Image	genticscms_image
File	genticscms_file

For every language, which is activated for at least one Node, the matching index genticscms_page_[code] will be created. If additional languages are activated for Nodes, the missing indices will automatically be created and if languages are deactivated, superfluous indices will be dropped.

Also the pipeline for ingesting attachments will be created, if necessary.

If at least one index is created, an automatic full reindex run is triggered.

5 Search Index Maintenance

The Content.Admin in the old UI contains a new entry Search Index Maintenance which provides an overview over required indices and their status.

The list will indicate indices, which are not fully functional, either because they do not exist, have incorrect settings or mapping, or do not contain the expected number of documents.

The action Rebuild index will update an incorrect mapping, will remove superfluous documents from the index and will reindex every object.

The action Drop and rebuild index will drop, recreate and fill the index.

6 REST API

POST requests to the REST endpoint /CNPortletapp/rest/elastic will be forwarded to the search API endpoint of Elasticsearch for all applicable indices:

E.g. the request


POST http://[cmshost]/CNPortletapp/rest/elastic/_search?sid=[SID]
{
	"_source": false,
	"query": {
		"query_string": {
			"query": "Test*"
		}
	}
}

will be forwarded to


POST http://localhost:9200/genticscms_page,genticscms_page_de,genticscms_page_en,genticscms_file,genticscms_image,genticscms_folder/_search?sid=[SID]
{
	"_source": false,
	"query": {
		"query_string": {
			"query": "Test*"
		}
	}
}

The response will be extended to contain the CMS objects as attribute _object of the returned hits.

7 Mapping

7.1 Page

CMS property	Index attribute	Index type
page.id	id	integer
page.folder.node.id	nodeId	integer
page.folder.id	folderId	integer
page.name	name	text
page.description	description	text
content	content	text
page.creationdate.timestamp	created	date (epoch_second)
page.creator.id	creatorId	integer
page.editdate.timestamp	edited	date (epoch_second)
page.editor.id	editorId	integer
page.publishdate.timestamp	published	date (epoch_second)
page.publisher.id	publisherId	integer
page.template.id	templateId	integer
page.language.code	languageCode	keyword
page.nice_url	niceUrl	text
page.alternate_urls	alternateUrls	text
Node/Channel IDs, where oage is online	online	integer
Modification status	modified	boolean
Queue status	queued	boolean
Planned status	planned	boolean
Planned publish date	publishAt	date (epoch_second)
Planned offline date	offlineAt	date (epoch_second)
Queued publish date	queuedPublishAt	date (epoch_second)
Queued offline date	queuedOfflineAt	date (epoch_second)
System creation date	systemCreationDate	date (epoch_second)
Custom creation date	customCreationDate	date (epoch_second)
System edit date	systemEditDate	date (epoch_second)
Custom edit date	customEditDate	date (epoch_second)
Deleted status	deleted	boolean

The attributes edited and created will contain custom dates, if set and fall back to the system dates. The properties systemCreationDate and systemEditDate will always contain the system (real) dates, while customCreationDate and customEditDate will contain the customly set dates (and will be empty, if not set).

7.2 Folder

CMS property	Index attribute	Index type
folder.id	id	integer
folder.node.id	nodeId	integer
folder.mother	folderId	integer
folder.name	name	text
folder.description	description	text
folder.creationdate.timestamp	created	date (epoch_second)
folder.creator.id	creatorId	integer
folder.editdate.timestamp	edited	date (epoch_second)
folder.editor.id	editorId	integer
folder.creationdate.timestamp	systemCreationDate	date (epoch_second)
folder.editdate.timestamp	systemEditDate	date (epoch_second)
Deleted status	deleted	boolean

7.3 Image

CMS property	Index attribute	Index type
image.id	id	integer
image.folder.node.id	nodeId	integer
Node/Channel IDs, where image is online	online	integer
image.folder.id	folderId	integer
image.name	name	text
image.description	description	text
image.nice_url	niceUrl	text
image.alternate_urls	alternateUrls	text
image.createdate.timestamp	created	date (epoch_second)
image.creator.id	creatorId	integer
image.editdate.timestamp	edited	date (epoch_second)
image.editor.id	editorId	integer
image.type	mimetype	text
image.createdate.timestamp	systemCreationDate	date (epoch_second)
image.editdate.timestamp	systemEditDate	date (epoch_second)
Deleted status	deleted	boolean

7.4 File

CMS property	Index attribute	Index type
file.id	id	integer
file.folder.node.id	nodeId	integer
Node/Channel IDs, where file is online	online	integer
file.folder.id	folderId	integer
file.name	name	text
file.description	description	text
file.nice_url	niceUrl	text
file.alternate_urls	alternateUrls	text
file.createdate.timestamp	created	date (epoch_second)
file.creator.id	creatorId	integer
file.editdate.timestamp	edited	date (epoch_second)
file.editor.id	editorId	integer
file.type	mimetype	text
binarycontent	content	text
file.createdate.timestamp	systemCreationDate	date (epoch_second)
file.editdate.timestamp	systemEditDate	date (epoch_second)
Deleted status	deleted	boolean

8 New UI

If the feature is activated, the new UI will automatically use the search index for searching objects. Additionally, there will be an “advanced search” panel below the main search bar, which allows further filtering of results by various parameters.

9 Differences between regular and advanced search

While the regular search just finds contents that contain the exact characters entered by the editor (anywhere in the text, even mid word), search with Elasticsearch works completely different.

When content is indexed in Elasticsearch, it is analyzed with the configured content_analyzer. The analyzer will transform the content into a set of tokens which are put into the index.

While searching, the search term entered by the user is analyzed with the configured search_analyzer (which is by default identical to the content_analyzer). The analyzer will also transform the search term into a set of tokens and Elasticsearch will find matching documents by comparing the search tokens with the tokens in the index.

This approach has many advantages over the simple database search used by the regular search:

Scoring of documents determine “how well” search hits match the terms. Documents with better scoring are sorted first
Analyzers can use language specific “stemming” when generating the tokens. (E.g. the german words “Häuser”, “Hauses” and “Haus” will all generate the token “haus”). This improves the search experience, because searching for “Haus” will also find documents containing “Häuser” or “Hauses”.
The german analyzer can be configured with a decompounder, which will create tokens for valid “subwords”. E.g. the german word “Weißkopfseeadler” will create the tokens “weiss”, “kopf”, “see” and “adler”. So when searching for “Adler”, documents containing “Weißkopfseeadler” will be found.
In general, token based search in Elasticsearch works much faster than searching with wildcards in a database.

Gentics CMS Documentation