Review: Neo4j supercharges graph analytics

When it comes to tracking relationships, Neo4j is faster, more flexible, and more scalable than relational databases

Review: Neo4j supercharges graph analytics
Thinkstock
At a Glance

Neo4j is both the original graph database and the continued leader in the graph database market. Designed to store entities and relationships, and optimized to perform graph operations such as traversals, clustering, and shortest-path calculations, Neo4j shines at exploring data that consists of huge numbers of many-to-many relationships.

To understand why graph databases are different and sometimes desirable, we need to go back in time. Early databases were little more than indexed file systems for sequential records (ISAM) with fixed field layouts. Soon databases started expressing hierarchical relationships, such as departments belonging to divisions. Then they captured networks using sets with 1:n relationships; you could traverse the sets programmatically. The standard for the network database model was first issued by CODASYL in 1969.

When relational databases were first introduced, in the early 1970s, they were roughly half as fast as CODASYL databases, because of the overhead of the SQL query processor, especially when joining related tables. Fortunately computer hardware was becoming faster — Moore’s Law observed that the density of components on a circuit board was doubling every two years. That observation held for decades.

Relational database limitations

Relational databases are still going strong, and still need powerful server hardware. For really big data under heavy use, however, relational queries tend to slow down, mostly because of large join tables, contention for indexes, and complicated join logic.

Relational databases are not well suited to capturing ad hoc relationships that are not consistent across all records: You wind up with sparsely populated rows and way too many indexes, both of which slow down the database performance. Remember, the relational schema is fixed, so every record in a given table contains every field, whether or not the field is populated.

Standard, non-graph NoSQL databases — whether key-value, document-oriented, or column-oriented — typically store sets of disconnected values, documents, or columns. To connect them, you can embed an aggregate’s identifier inside a record belonging to another aggregate, but this isn’t very efficient. While there are many excellent use cases for all three kinds of NoSQL databases, connections aren’t their forte.

Enter the graph database

That brings us, finally, to graph databases and Neo4j. In 1999 Emil Eifrem and his colleagues at Neo Technology (now Neo4j, Inc.), needing to perform ad hoc analysis and frustrated by the limitations of relational databases, figured out a way to implement the 300-year-old mathematical graph model in a database with nodes as vertices and relationships as edges.

The Neo engineers created a labeled property graph in which nodes contain properties. Properties are key-value pairs, so the properties used in a given class of node may vary from one node to another. Nodes may have one or more labels. Relationships are named and directed (they always have a start and end node), and like nodes, relationships can also contain properties.

Neo Technology also created native graph storage and native graph processing using index-free adjacency rather than relying on a SQL back end. Neo4j complies with the ACID properties of transactional databases, has cluster support, and does runtime failover.

After 18 years of development, Neo4j is a mature graph database platform that you can run on Windows, MacOS, and Linux, in Docker containers, in VMs, and in clusters. Neo4j can handle very large graphs, even in its open source edition, and unlimited graph sizes in its enterprise edition.

Graph database scalability

Does an arbitrarily large graph make sense? Graph databases don’t suffer from the same scaling issues as relational databases (one of which happens when queries use complex joins of large tables), so a very large graph database is still likely to perform well, at least once the relationships have been created. In the Neo4j Enterprise edition you can add as many cluster nodes as you need for performance purposes; the open source Community edition is limited to one server.

One computationally expensive operation is matching the related items in disjoint nodes, the rough equivalent of constructing foreign key constraints in a relational database. But in a graph database that cost is only incurred when you’re building the relationships (for example during data import), not when you’re using them. If you try to do this kind of match in the Neo4j Desktop, you’ll get a warning that says “This query builds a cartesian product between disconnected patterns. This may produce a large amount of data and slow down query processing.” You can still perform the operation, however — the warning is intended for when that wasn’t really what you meant.

Neo4j installation and learning

To learn Neo4j, you should download and install the Neo4j Desktop and try some or all of the online Neo4j sandboxes, for example the Paradise Papers sandbox shown below. The Neo4j Desktop download includes the Neo4j Enterprise Edition for Developers, with a perpetual license. The sandboxes include data, interactive guides with example queries, and sample code. They expire three days after creation unless you extend them.

The Neo4j Desktop download has scripts to create a small movie database and to import the Microsoft Northwind sample database. There are sandboxes with data from the Panama Papers and Paradise Papers, the U.S. Congress, and others — including your own Twitter social graph, extracted from your account. You’ll learn a lot from going through all the samples and guides, although you’ll eventually want to read the documentation, especially for the Cypher query language.

In addition, the Neo4j Desktop has options to install the APOC (Awesome Procedures On Cypher) and graph algorithms libraries. APOC consists of about 300 Cypher procedures for various purposes, and the graph algorithms library provides efficiently implemented, parallel versions of common graph algorithms for Neo4j 3.x, exposed as Cypher procedures.

neo4j sandboxes IDG

Neo4j sandboxes are cloud container instances preloaded with Neo4j and (optionally) graph databases of interest along with tutorial scripts. In the figure above, I have created an instance with the Paradise Papers data set.

Neo4j data import

As shown in the guide to importing data and ETL, Neo4j can import tables into nodes from CSV files, create indexes on the nodes, create uniqueness constraints, and construct relationships using the Cypher query language. If you wish, you can eliminate the intermediate CSV files by selecting tables or views from your relational database programmatically using embedded SQL over its database driver, and adding the data programmatically to the Neo4j data using embedded, parameterized Cypher statements. A beta component of Neo4j Enterprise, Graph ETL, provides a GUI for the data import and schema mapping process.

Cypher Query Language (for SQL programmers)

Cypher looks partly familiar and partly strange to an experienced SQL database programmer (see the SQL to Cypher guide). Some examples of syntax that are the same in both languages: WHERE, ORDER BY, SKIP LIMIT, AND, and p.unitPrice > 10. The syntaxes that are different in Cypher have to do with graphs, patterns, and relationships — all the aspects unique to graph databases.

For example, node patterns are expressed in parentheses: (variable:Label). Attributes, as key-value pairs, go in curly brackets: (item:Product {name:"Chocolade"}). For those SQL mavens playing along at home, yes, that example comes right out of the Northwind database.

Relationship patterns are expressed as arrows, which may be annotated by attributes in square brackets: (x)-[someRel:REL_TYPE]->(y).

The rough equivalent of the SQL SELECT statement in Cypher is the MATCH statement, and the RETURN clause defines the result. Remember, you’re matching patterns. So

SELECT p.*
FROM products as p;

becomes

MATCH (p:Product)
RETURN p;

As I mentioned earlier, WHERE clauses are similar in both languages. However, you can use some shortcuts in Cypher that are not available in SQL. For example,

MATCH (p:Product)
WHERE p.productName = "Chocolade"
RETURN p.productName, p.unitPrice;

can also be expressed as

MATCH (p:Product {productName:"Chocolade"})
RETURN p.productName, p.unitPrice;

There are some minor differences in WHERE clauses between SQL and Cypher.  For example, the LIKE expression using % as a wildcard in SQL becomes STARTS WITH, CONTAINS, or ENDS WITH in Cypher. You can also use regular expressions to the same end in Cypher, for example p.productName =~ "C.*".

Like SQL, Cypher supports inner and outer joins, but the notation becomes MATCH with a relationship pattern for inner joins, and OPTIONAL MATCH for outer joins. Much of the scut work needed to define n:m joins with intermediate join tables in relational schemas goes away in Cypher, because the graph schema is explicit about relationships. Aggregates are simpler, too. For example, the SQL query to find the top-selling employees

SELECT e.EmployeeID, count(*) AS Count
FROM Employee AS e
JOIN Order AS o ON (o.EmployeeID = e.EmployeeID)
GROUP BY e.EmployeeID
ORDER BY Count DESC LIMIT 10;

becomes

MATCH (:Order)<-[:SOLD]-(e:Employee)
RETURN e.name, count(*) AS cnt
ORDER BY cnt DESC LIMIT 10

in Cypher. The GROUP BY clause is not needed in Cypher — it’s implied by the count aggregate. The JOIN clause on EmployeeID values isn’t needed in this particular case because the SOLD relationship pattern has already captured its intention.

All this takes some getting used to, but you’ll probably find that you like it. In programming, simpler is almost always better.

Neo4j graph analytics and graph algorithms

Graph analytics and graph algorithms help you to understand the organization and dynamics of complex systems. These can be applied globally to discover the overall nature of networks and model the behavior of intricate systems, and locally — possibly in real time — to provide a focused view of relationships between specific data points, as shown in the figure below.

Neo4j provides five path-finding and traversal algorithms including parallel depth-first and breadth-first searches, four centrality algorithms including PageRank, and six clustering algorithms including Louvain Modularity. Louvain Modularity is often used for fraud ring detection.

neo4j wilbur ross jr. to anthony grant blumberg IDG

This graph shows a graph algorithm (allShortestPaths) in action on the Paradise Papers data in Neo4j Desktop. Here we show the shortest connections between Wilbur Ross, Jr., the U.S. Secretary of Commerce, and Anthony Blumberg, then CEO of ConvergEx Group LLC, a broker-dealer. All paths between them go through Appleby Trust, offshore legal service providers operating in Bermuda, the Cayman Islands, and other tax havens. Blumberg was cited by the SEC for illegal practices.

Neo4j performance and scalability

While benchmarking Neo4j in a meaningful way is not really possible for me as a reviewer, the company provided several metrics based on its own tests and on customer experience. For example, Neo4j Inc. has compared the performance of the Union-Find and PageRank algorithms in Neo4j and Apache Spark GraphX. The data set contained 1.47 billion relationships and 41.65 million nodes extracted from Twitter. Neo4j outperformed GraphX by roughly a factor of two on Union-Find and roughly a factor of four on PageRank, using clusters of 128 CPUs.

In a customer deployment, Neo4j replaced an Oracle RAC cluster to calculate optimum room pricing for Marriott Hotels and demonstrated 10 times the transaction rate on half the hardware. The Neo4j system at Marriott can perform 300 million pricing operations per day.

Every node in a Neo4j high availability cluster contains the database and a cluster management component, and the cluster can be accessed through a load balancer. The full graph is replicated to each instance of the cluster, and the read capacity of each HA cluster increases linearly with the number of server instances. Neo4j can commit tens of thousands of writes per second while maintaining fully ACID transactions.

In a Neo4j causal cluster, a new Neo4j Enterprise feature, a core cluster of read-write servers is combined with one or more asynchronously updated clusters of read replicas. Any application is guaranteed causal consistency, meaning that it is guaranteed to read at least its own writes, even when hardware and networks fail. The read replicas in a causal cluster may be geographically distributed to improve query performance for users near the replicas.

Neo4j use cases

Neo4j has been used successfully for fraud detection, real-time recommendations, master data management, and network and IT operations. It has also been used for investigative journalism, to analyze both the Panama Papers and the Paradise Papers.

In the fraud detection area, a graph database can quickly reveal abnormal situations, such as a single IP address using many credit card numbers belonging to multiple people. For e-commerce, it helps a great deal if such fraud detection can be done in real time.

Neo4j has been used by Walmart to suggest products to customers based on their preferences, in real time. eBay has used Neo4j for a real-time courier/package routing solution, and reported it “to be literally thousands of times faster than our prior MySQL solution, with queries that require 10 to 100 times less code.”

Powerful connections

Neo4j is both the original graph database and the continued leader in the graph database market. After working with it and looking at some of its case studies, I can see why it continues to attract both open source users and paid enterprise customers.

In its latest Enterprise incarnation, Neo4j has scalability and survivability that rivals CockroachDB, although that isn’t true of the open source version of Neo4j, which doesn’t cluster. As a native graph database with ad hoc properties, Neo4j can explicitly express relationships between entities and capture a variety of information for different nodes without creating sparse rows or a multitude of join tables. That makes Neo4j vastly more efficient than SQL or NoSQL databases for tasks that look at networks of related items, such as fraud detection.

One of the human costs of replacing a SQL database with Neo4j is education: learning the Cypher query language, the two libraries, and graph database design. (A similar statement could be made about most NoSQL databases.) While there is quite a bit of carry-over from SQL to Cypher, especially in WHERE clauses, the Cypher MATCH statement is quite different from a SQL SELECT statement because it acts on graph patterns rather than tables.

Whether Neo4j will be an improvement on your existing relational system will depend very much on how “graph-y” your problems and data sets are. If your table rows tend to be sparsely populated, and your queries tend to involve heavily nested joins, then a graph database is probably right for you—and Neo4j would be an excellent choice.

Cost: Community Edition: Free open source. Enterprise Edition: Free for development and startups. Per-machine Enterprise subscription licensing has two tiers, four cores and 24 cores; expandable bundles typically start with three-machine clusters and include support services.

Platform: Windows, MacOS, Linux (Debian and Red Hat), and Docker.

At a Glance
  • Neo4j is vastly more efficient than SQL or NoSQL databases for tasks that look at networks of related items, but the graph model and Cypher query language will require learning.

    Pros

    • Native graph storage and native graph engine
    • Supports ACID properties
    • Has cluster support and runtime failover
    • Better performance than relational databases for “graph-y” applications
    • Open source version available

    Cons

    • Cypher query language is not exactly SQL and takes some learning
    • Graph database design is different from relational database design
    • Open source engine does not support clustering

Copyright © 2018 IDG Communications, Inc.