Search index: Neo4j to Elasticsearch
Neo4j-to-elasticsearch is a Neo4j plugin that enables automatic synchronization between Neo4j and Elasticsearch. This means that all changes to Neo4j are automatically propagated to Elasticsearch.
Neo4j-to-elasticsearch plugin is not compatible with Neo4j v4.x.
Follow these steps to install the plugin:
- Download the GraphAware framework JAR
- Choose a version
A.B.Cmatches your Neo4j version and
xis 44 or later
- Choose a version
- Download the neo4j-to-elasticsearch JAR
- Choose a version
A.B.Cmatches your Neo4j version and
x.yis 44.8 or later
- Choose a version
- Add the following lines to the beginning of your Neo4 configuration file (
neo4j/conf/neo4j.conf):com.graphaware.runtime.enabled=truecom.graphaware.module.ES.1=com.graphaware.module.es.ElasticSearchModuleBootstrappercom.graphaware.module.ES.uri=HOST_OF_YOUR_ELASTICSEARCH_SERVERcom.graphaware.module.ES.port=PORT_OF_YOUR_ELASTICSEARCH_SERVERcom.graphaware.module.ES.mapping=AdvancedMappingcom.graphaware.module.ES.keyProperty=com.graphaware.module.ES.retryOnError=truecom.graphaware.module.ES.asyncIndexation=truecom.graphaware.module.ES.initializeUntil=2000000000000# Set "relationship" to "(false)" to disable relationship (edge) indexation.# Disabling relationship indexation is recommended if you have a lot of relationships and don't need to search them.com.graphaware.module.ES.relationship=com.graphaware.runtime.stats.disabled=truecom.graphaware.server.stats.disabled=true
- Restart Neo4j
- Once Neo4j has finished indexing the data, remove the following line (and only this line) from
Please note that indexation could fail if your data uses different data types for the same property key. For example, if a property representing a date uses ISO strings in some nodes and timestamps in others. If you encounter this issue, please get in touch.
initializeUntil specification is used to trigger the indexation of existing Neo4j data in Elasticsearch. This is because
be set to a number slightly higher than a Java call to
System.currentTimeInMillis() would normally return
when the module is started. Thus, the database will be (re-)indexed only once, and not with every subsequent restart.
In other words, re-indexing will happen if
System.currentTimeInMillis() < com.graphaware.module.ES.initializeUntil.
neo4j-to-elasticsearch plugin is installed, you need to change
the relevant data-source configuration to use
neo2es as its search index vendor.
You can either use the Web user-interface
or edit the configuration file located at
linkurious/data/config/production.json to set the
index.vendor property to the value
For smaller indexation tasks,
neo4j-to-elasticsearch can be used straight out of the box. Larger indexes are trickier. If you find that indexation is failing to complete, or that search is unusually slow, you will need to settle for partial indexation in order to keep Elasticsearch usable.
Partial indexation can be configured in your Neo4j configuration file (
neo4j/conf/neo4j.conf) by specifying a subset of your graph to index. You can do this by selecting which types of nodes and relationships to keep and which properties to index on these nodes.
(For a detailed list of configuration options, please consult Neo4j-to-elasticsearch's official documentation, available at Graphaware's Github repo. This guide will focus on a few use cases that should be adaptable to a wide variety of graph models.)
Partial indexation is handled by four options which can be added to your Neo4j configuration file:
com.graphaware.module.ES.relationship control which nodes and relationships to index.
com.graphaware.module.ES.relationship.property control which properties of these nodes and relationships to index.
Each of these lines is followed by one or more parameters. Parameters are boolean expressions. They can be chained together using standard logical operators
&& (AND) and
|| (OR). The most basic parameters are
(false). They will tell Elasticsearch either to index everything (the default behavior) or to index nothing. Note that the parentheses are required in order to force Elasticsearch to ignore these nodes while loading your database into the index.
To reiterate, if we add the line
to our configuration file, Elasticsearch will not index any of the relationships in our database.
The next step up is parameter functions. These allow for more complex inclusion and exclusion rules. For both nodes and relationships, the following two functions are available:
getProperty('propName', 'defaultValue'): Returns a property value, allowing you to compare it to another value using standard comparison operators.
falsedepending on whether a node or relationship contains a certain property.
Additionally, there are functions specific to nodes and to relationships. For nodes, those that are useful for partial indexation are:
getDegree(): Returns the degree (number of relationships) of a node. Can be used with comparison operators.
falsedepending on whether the specified label is present on a node.
And for relationships, they are:
falsedepending on whether the specified type is present on a relationship.
falsedepending on whether a relationship is outgoing.
falsedepending on whether a relationship is incoming.
(You can find a full list of functions in Graphaware's inclusion policies maintained on their Github repo.)
Say we have a due diligence database containing people, banks, account numbers, addresses, telephone numbers, and email addresses. It's a large database -- several hundred million nodes -- so to fully index the graph would be a lengthy process, and may ultimately be unnecessary if we are only interesting in full-text search on a restricted set of node and relationship properties.
The first question we should ask is about this data we are interested in. Maybe as part of our hypothesis about the data, we want to focus on a certain subset of connections that we believe form patterns of interest.
Let's say that we want to focus on identifying information only -- we think that there are cases where this information is shared by multiple individuals, for instance, and we're interested in analyzing them. The nodes of interest to us will therefore be those nodes which represent pure identifiers and not entities themselves -- addresses, telephone numbers, account numbers, and email addresses. We can tell Elasticsearch to index these and only these by adding the following line to our Neo4j configuration file:
com.graphaware.module.ES.node=hasLabel('Address') || hasLabel('Telephone') || hasLabel('Email') || hasLabel('Account')
com.graphaware.module.ES.node=!hasLabel('Person') && !hasLabel('Bank') will also work.)
And since we aren't including people or banks, we also want to focus on the relationships relevant to our nodes of interest:
com.graphaware.module.ES.relationship=isType('HAS_ADDRESS') || isType('HAS_PHONE') || isType('HAS_EMAIL') || isType('HAS_ACCOUNT')
If we want to be even more specific, we can select only those properties which are relevant to our inquiry by adding
com.graphaware.module.ES.node.property to our config. We follow this with a list of keys, or node property names, (separated by
|| (OR) statements) that we want to include in our index:
com.graphaware.module.ES.node.property=key == 'address1' || key == 'city' || key == 'state' || key == 'number' || key == 'email' || key == 'accountNumber'
And if we only want to index nodes and NOT relationships, we can disable relationship indexation completely:
Since relationships are often more numerous than nodes, excluding them from our index can significantly reduce its storage footprint.
What if we finish our initial analysis and conclude that we need to know more about the financial institutions in our graph? We want to add banks to our index, but we don't need to search everything that is stored on them. Furthermore, we're only interested in banks in a certain region -- Europe, say. We can add the right banks back by modifying our original directive to read:
com.graphaware.module.ES.node=hasLabel('Address') || hasLabel('Telephone') || hasLabel('Email') || hasLabel('Account') || (hasLabel('Bank') && getProperty('bankRegion', 'None') == 'Europe')
And we can add a few bank properties like this:
com.graphaware.module.ES.node.property=key == 'address1' || key == 'city' || key == 'state' || key == 'number' || key == 'email' || key == 'accountNumber' || key == 'bankName' || key == 'bankIdentifier' || key == 'bankRegion'
Keep in mind that these keys will be indexed for every node on which they appear. If banks have a
state property, for example, it will be added to the index for
state. It's worth remembering this when constructing your data model. Namespace collisions can prove computationally costly.
Neo4j-to-Elasticsearch, in combination with partial indexation strategies, can be a very efficient way of handling index synchronization for large graphs. Even if your graph is small enough to index in full, it's worth considering to what extent it may grow. By examining your problem space and choosing only a subset of the information available to you for your traversal needs, you may save yourself future headaches and maximize the efficiency of your graph.