The optimal Lazy List size

As you know by now, MicroStream has the concept of a Lazy object reference. The data within such a lazy object is not loaded into memory when the StorageManager starts. It is a proxy that, when accessed, can read the data when needed.

The question is now, if we have a large list of data, what is the ideal size of those individual Lazy Lists? That is the question we will try to answer in this blog and some considerations for your project.

List Size matters

If we have a large list of data, we need to make some segmentation decisions. I think it is obvious that loading the list all at once, or loading each individual item are not viable solutions.

Loading the list all at once will load of course way too much data into memory. You probably only need a part of the entire list to be able to respond to the user request. And when your list is huge, it might not even fit into the JVM heap unless we make it extremely large.

And loading each individual item separately looks good from a memory usage point of view, but each Lazy reference also takes up some memory. When having millions of Lazy references, and they are loaded when the StorageManager is created, is also a significant amount of memory.

So we need something between 1 and the entire list of several million. But is there an ideal size?

In code, we will have something like

Map<EntityDiscrimination, Lazy<List<Entity>>> data

 

where we make a segmentation of our data, hold within the Entity class, based on some grouping represented by the EntityDiscrimination key of the Map. This allows us to access a subset of our data corresponding to some criteria in the EntityDiscrimination.

Testing it out

So what better way do we have than testing out a scenario where we vary the number of items in a list. The following experiment is carried out. We have a list of 10 million numbers for which we need to calculate the average value.

In this test, we access all data which is probably not the use case that you have. But it will give us some insight into the performance impact of the lazy List size for exactly the same set of data.

And we timed the case when we have 5 elements in the List, 10, 50, etc up to 10 million, so having all items in 1 lazy list.

We have the time required to start the StorageManager, and the time to access the data within the Lazy list(s). I’m not showing the actual values, only the graph, as the numbers don’t really matter, only the trend that we can see within the results.

Having a lot of small lists is not performant. And that is not a surprise. If our program access a Lazy List, it needs to be loaded into the memory, scanning the data storage for the required data. And if we need to do that a million times instead of a thousand times, that results in a performance difference.

And from the graph, we also see that a List of 500 items and more is the most efficient size. There is a very slight indication that very large lists are again a bit less efficient but that is difficult to prove based on the current setup.

Choose a Large list?

As indicated earlier on, having larger lists is more memory efficient since we have fewer Lazy instances which also take up some memory. But also mentioned that loading a large list into memory might be problematic because you use a lot of memory and probably don’t need all the data for a user request.

So there is some kind of optimal value that will be application dependent.

I can also show the following graph, where we have tested a similar scenario but we just processed a List with 5, 10, etc .. up to 10 million items in a Lazy reference.

The results show a nice linear relationship between the list size and the time required to handle it. So our algorithm within MicroStream is of order O(n) which is not too bad.

Therefore, it is no surprise that loading a lazy reference with fewer items is more efficient.

Choose the optional size

So what are the criteria to choose the optimal size?

First, you must make a segmentation based on an application requirement. That is, make a grouping that makes sense for your scenario. Like all the ‘active’ orders per customer so that you do not need to load all orders of that customer.

This might mean that you need different ‘indexes’ for the same data set. You can have different maps that are holding the orders, and you access the one that gives you the data in the most efficient way. And remember, MicroStream works with references so that the same order is only loaded once if accessed through different indexes.

Making lists smaller is efficient, but not too small as the memory consumption grows. An average size of 500 to 1000 is probably the most efficient in a wide range of scenarios.

You can make use of Apache Lucene for example to efficiently define the Map value for the index you are using if the Map value is not based on simple values like Customer Id and Order status.

Conclusion

The Lazy option of MicroStream allows you to load only the data that is needed, and not everything when the StorageManager starts. However, using a lot of Lazy instances take up memory and must be avoided also. Larger blocks have the drawback that they load slower and probably read more data than needed. Some tests indicate that a List of 500 to 1000 values was ideal, but this might be different for your data.

Total
0
Shares
Leave a Reply

Your email address will not be published. Required fields are marked *

Previous Post

MicroStream Cloud is now online!

Next Post

April 26 – 17:00 – 21:00 CEST MicroStream Fundamentals Course

Related Posts
Secured By miniOrange