Is it faster for your application to write large amounts of data to Neo4j in one large, batched query, or in many, much smaller ones?

One official Neo4j blog post on query tuning prefers the former, saying you should “batch your writes” at the application layer using one large UNWIND query.

On the other hand, that same blog post also says “a number of small optimized queries always run faster than one long, un-optimized query”. It advises devs to “avoid long cypher queries (30-40 lines)”. What does this mean though? It’s very possible to write fast queries that are 30 lines long and slow queries that are 2 lines long. How much is “a number”? Why 30 lines and not 20?

I thought the explanations weren’t written as clear as they could’ve been, so now it’s time for an experiment: Let’s compare loading large amounts of data to Neo4j in (1) a batched way using UNWIND versus (2) multiple smaller queries to find out for ourselves. If you don’t feel like reading through the rests of this, I’ll tell you my conclusion up front: in this experiment, I found that UNWIND was at least 900 times faster than running multiple transactions. Further, it scales linearly with the number of items you throw at it: it takes 1 second to load 70,000 items, and about 100 seconds to load 7,000,000 items.

Experiment setup

Let’s make some fake data, for starters let’s create some objects to represent 70,000 AWS principal to S3 bucket pairs.

def make_fake_data(items_to_make):
    res = []
    for _ in range(items_to_make)
        res.append({
            'principal_arn': f'arn:aws:iam::1234:role/{randint(0,2400)}',
            'resource_arn': f'arn:aws:s3:::{randint(0,4000)}'
        })
    return res

# 70,000 fake items in the list
full_data = make_fake_data(70000)

Note that this code sets an upper bound of 2400 unique AWS roles and 4000 unique S3 buckets.

Approach 1 - UNWIND

The code

def unwind():
    query =  """
        UNWIND {Mapping} as mapping
        MERGE (principal:AWSPrincipal{arn:mapping.principal_arn})
        MERGE (resource:S3Bucket{arn:mapping.resource_arn})
        MERGE (principal)-[r:CAN_READ]->(resource)
        """
    print(f"UNWINDing {len(full_data)} items")
    neo4j_session.run(
        query,
        Mapping=full_data,
        )

This batches up the data in our Python application and writes it all to the graph. Let’s set up our timer and run the experiment!

Results

# Start with an empty DB
neo4j_session.run('match (n) detach delete n')

# Run the experiment
unwind_time = timeit.timeit(unwind, number=1)
print(f"--> unwind = {unwind_time} seconds")

Let’s run the code:

➜  python unwind_vs_each.py
UNWINDing 70000 items
--> unwind = 1.086337526 seconds

1 second! Pretty good! Let’s see how a non-batched, multiple-smaller-queries approach does.

Approach 2 - Multiple smaller transactions

The code

def each():
    query =  """
        MERGE (principal:AWSPrincipal{arn:{principal_arn}})
        MERGE (resource:S3Bucket{arn:{resource_arn}})
        MERGE (principal)-[r:CAN_READ]->(resource)
        """

    print(f"Performing for-each on {len(full_data)} items")
    for d in full_data:
        neo4j_session.run(
            query,
            principal_arn=d['principal_arn'],
            resource_arn=d['resource_arn'],
        )

The key difference here is that the Python code iterates through each item and loads each one as a separate transaction.

It was hard for me to decide on a hypothesis because

  • Approach 2 could be slower because of there is a lot of overhead in setting up and tearing down each individual transaction.

or

  • Approach 2 could be faster because that Neo4j blog post says many smaller queries are faster than large ones.

Let’s find out the answer.

Experiment results

This involves the same setup as last time but with a different function:

# Start with an empty DB
neo4j_session.run('match (n) detach delete n')

# Run the experiment
foreach_time = timeit.timeit(each, number=1)
print(f"--> for-each result = {foreach_time} seconds")

Result

I ran the above and it did not finish. I let it run for at least 15 minutes before I got sleepy and went to bed. When I woke up and checked, it was crashed and I have no idea how far along it got or how long it took to get to this point:

Performing for-each on 70000 items
Traceback (most recent call last):
  File "/Users/alex/.virtualenvs/env7/lib/python3.7/site-packages/neobolt/direct.py", line 408, in _send
    self.socket.sendall(data)
  File "/usr/local/opt/python/Frameworks/Python.framework/Versions/3.7/lib/python3.7/ssl.py", line 1034, in sendall
    v = self.send(byte_view[count:])
  File "/usr/local/opt/python/Frameworks/Python.framework/Versions/3.7/lib/python3.7/ssl.py", line 1003, in send
    return self._sslobj.write(data)
BrokenPipeError: [Errno 32] Broken pipe

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "unwind_vs_each.py", line 82, in <module>
    each_time = timeit.timeit(each, number=1)
  File "/usr/local/opt/python/Frameworks/Python.framework/Versions/3.7/lib/python3.7/timeit.py", line 232, in timeit
    return Timer(stmt, setup, timer, globals).timeit(number)
  File "/usr/local/opt/python/Frameworks/Python.framework/Versions/3.7/lib/python3.7/timeit.py", line 176, in timeit
    timing = self.inner(it, self.timer)
  File "<timeit-src>", line 6, in inner
  File "unwind_vs_each.py", line 59, in each
    aws_update_tag='1595116803'
  File "/Users/alex/.virtualenvs/env7/lib/python3.7/site-packages/neo4j/__init__.py", line 502, in run
    self._connection.send()
  File "/Users/alex/.virtualenvs/env7/lib/python3.7/site-packages/neobolt/direct.py", line 388, in send
    self._send()
  File "/Users/alex/.virtualenvs/env7/lib/python3.7/site-packages/neobolt/direct.py", line 414, in _send
    self.server.address))
neobolt.exceptions.ServiceUnavailable: Failed to write to defunct connection Address(host='localhost', port=7687) (Address(host='127.0.0.1', port=7687))

Interesting! This is the same error observed in Cartography issue #170, which we’ve had a hard time reproducing. I would have thought that multiple smaller writes would avoid weird socket problems like this because there’s less data to write in each interaction, but anyway, this is a topic for another day.

So for this experiment, I observed my code running for at least 15 minutes, which is already 900 times slower than approach 1’s UNWIND method (900 seconds in 15 minutes vs the 1 second it took for UNWIND to run).

Let’s try 7,000,000 items

Ok, we’ve already learned that batching data at the application layer with UNWIND is way faster than individual transactions, but can it handle loading millions of items? I’ll need to first adjust the fake data to allow 240,000 AWS roles and 400,000 S3 buckets.

def make_fake_data(items_to_make):
    res = []
    for _ in range(items_to_make)
        res.append({
            'principal_arn': f'arn:aws:iam::1234:role/{randint(0, 240000)}',
            'resource_arn': f'arn:aws:s3:::{randint(0, 400000)}'
        })
    return res

# 7,000,000 fake items in the list
full_data = make_fake_data(7000000)

Now let’s run the experiment again…

➜  Desktop python unwind_vs_each.py
UNWINDing 7000000 items
--> unwind = 107.39662291500001 seconds

Under 2 minutes to process 7 million items!

Conclusion

So there you have it: batched UNWIND is much faster than for-each. At least 900 times faster.