Some Data Model considerations for searching and indexing with MicroStream

Avatar

In this blog, we go a bit deeper into some techniques to handle a large set of data with MicroStream.
As you know, the idea of MicroStream is that you use the Java objects loaded into the JVM memory as a database. You can quickly get to the object you want by using pure java techniques like the Stream API and getters. There is no need for any specific query language and no retrieval of the data from an external system, including the latency overhead and conversion requirements.

But when you have a lot of data, you probably cannot hold all your data in memory and you need to make use of the Lazy object of MicroStream to load some of your data on demand into memory.

I’ll describe a few options for searching and indexing when you have a lot of data in the next sections.

Lazy concept

The Lazy concept of MicroStream makes it possible to keep some data at the data storage during start-up. When the StorageManager is created, it scans the data storage that you have configured and loads the data into the memory of the JVM. When it encounters an instance of the Lazy Object (class of one.microstream.reference.Lazy), it does not load the actual data that are referenced by the Lazy instance. The Lazy object acts as a proxy and only when we ask the proxy for its data, through the .get() method of Lazy, the data is loaded into the memory from the data storage.

private Lazy<List<String>> someData = Lazy.Reference(new ArrayList<>());

 

The above definition defines the list of String as lazy loaded and thus not available after the start of the StorageManager. Only when we perform the call someData.get(), we have the list of String available. And when we do not longer need the data in memory, a call to someData.clear() makes sure the data can be removed from memory by the garbage collector. Also without the explicit call to .clear(), data will be cleaned up when there is a need for more free memory blocks within the JVM since Lazy is implemented using the WeakReference class of Java.

Use case with a lot of data

In this blog, we assume the following set of data, which is only a part of the entire data model of course.

We have a large list of customers, each having an address. And for the customers, we keep track of their orders and the products that are ordered.

When we have like 7 million customers, each having 10 or more orders, the data set for these objects alone already become rather large and will not be easy to fit into the JVM memory unless we assign rather large heap sizes.

Store orders

Storing the orders is actually the easiest problem that we need to solve in this case. And it can be seen as an example of using indexes with MicroStream.

private Map<Long, Lazy<List<Order>>> orders;

The indexing solution for accessing the orders of a certain customer very efficiently can be done using a Map variable like orders.

The key value of the map is the unique identification of your customer, like a number, and the value is a Lazy list of Orders. This means that actually none of the orders are loaded when the StorageManager is started.

When we need to have access to the orders of a specific customer, we access the Java map and request the orders through the Lazy reference.

orders.get(customerId).get();

This technique makes that we do not need to load several millions of orders, only those that are required to perform the current request.

Searching

When we need to search through the entire list of customers, based on some letters from their names, things become more complicated. If we can have the data in memory, some simple Stream API statement can give us the result. But that requires probably too much RAM to process. We can use the Lazy object from MicroStream and we will look also at an alternative like a Lucene index and the CQEngine library.

Putting our 7 million customers on a List and searching through it can be as easy as

data.stream()
.filter(c -> c.getLastName().startsWith(letters))
.collect(Collectors.toList());

 

Some data around this solution
– In our example, the 7 million items take 440 Mb of the memory (according to a heap dump)
– The search is done in 224 milliseconds.

This is of course really fast but since everything is in memory, it cannot be scaled since we need more and more memory when the list grows.

Lazy-loaded MicroStream data

We can use the Lazy concept that we already used for loading only the Orders of a certain customer. But now load the customer data in ‘chunks’. By splitting up the list by the first letter of the customer, we only need to load a smaller amount of data into memory. As a consequence, we lose some performance as we need to load the data on demand when we search for customers with a certain name.

private final Map<Character, Lazy<List<Customer>>> customers = new HashMap<>();

And the search becomes

List<Customer> customers = root.getCustomers(firstLetter);

List<Customer> result = customers.stream().filter(c -> c.getLastName().startsWith(letters))
.collect(Collectors.toList());

So, if we search for all names starting with ‘Sch’, we can load all customers that have ‘S’, and then loop over this list with a Stream.
Due to the loading time we have now, the required time is just under 4 seconds on my machine.
We can slightly improve this by not using the first letter as the key of the map, but the first 2 letters. The actual lists are smaller and are loaded slightly faster into memory by the Lazy loading mechanism of MicroStream.

CQEngine

There exist other frameworks that help you search in large collections. One of them is the Collection Query Engine which promises you faster results than the standard Java API. I tried out version 3.6 where you can either store the data in memory, or somewhere through a Persister. Since we already decided that the memory itself would not be an option, otherwise we could go for the Stream API solution, we opted for a DiskPersistence.

These are the statements to generate the collection. It resulted in a file that is 700 Mb in size.

IndexedCollection<Customer> data = new ConcurrentIndexedCollection<>(persistence);
data.addIndex(HashIndex.onAttribute(LAST_NAME));

data.addAll(customers);

And later on, we can perform our query

Query<Customer> query = QueryFactory.startsWith(LAST_NAME, letters);
List<Customer> found = data.retrieve(query).stream().collect(Collectors.toList());

This search took a bit more than 10 seconds. So performance is less than the MicroStream solution.

Lucene Index

Another popular framework is Apache Lucene, an open-source search engine solution. It is capable of indexing Java objects so that you can later search for some text within the properties you have specified. It can return the id of the Customer object for example that matches our search criteria.

This is how you can create a Lucene Index for a Java Object.

Index.DocumentPopulator<Customer> documentPopulator = (document, customer) -> {
  document.add(new StoredField("id", customer.getId()));
  document.add(new TextField("lastName", customer.getLastName(), Field.Store.YES));
};

Index.EntityMatcher<Customer> entityMatcher = LuceneWithLazy::createCustomer;

Index<Customer> index = new Index<>(Customer.class, documentPopulator, entityMatcher);

 

You define the DocumentPopulator that feeds the data into the search engine. Here we specify that we can search on the field lastName and that we also store the id of the Customer object so that we can identify the found objects uniquely later on.
The EntityMatcher instance creates a Customer object again for the matched entries. In this case, we have only the id and the lastName of the Customer. We could add more fields of the customer object to the document, but where do we stop? Should we then also add the address info to it because we want to show it on the search result screen?

More on that in a moment, let us first have a look at how we can perform a search with Lucene.

WildcardQuery query = new WildcardQuery(new Term("lastName", "*" + letters + "*"));
Map<String, List<Long>> temp = index.search(query, Integer.MAX_VALUE)
   .stream()
   .collect(Collectors.groupingBy(c -> String.valueOf(c.getLastName().charAt(0))
      , Collectors.mapping(Customer::getId, Collectors.toList())));

 

The WildcardQuery allows us to search where the letters appear in the names of the customer. We look now for letters that appear everywhere, not just at the start of the name.
If we have enough with the customer id, then this Lucene query is fast to give us the info that we need. In our case, it can give us the list of matching customers in about 1.5 seconds.

But with the Customers id alone, we don’t have enough info to show to the customer. We need all customer fields and the address info so that the end user can decide which customer he intended. This means that we need to look into the Lazy loaded chunks maintained by MicroStream to get our info.
That is the reason why the example creates a Map where the key is the first letter of the matching last names and the value is the list of customer ids. Based on that first letter we can load only the chunk, or chunks, we are interested in.

Depending on how many chunks we need to load, the additional time to get our customer info for the search screen takes 4 seconds additionally (only 1 chunk to load) or 100 seconds (all chunks are needed). Of course, we can implement the restrictions that we abort the search when we need to load too many chunks and ask the user to refine his search criteria, like typing more letters that need to be matched.

To improve the performance of this search by Lucene and lookup through Lazy objects, the Next Generation MicroStream implementation has a new architecture for storing the data. This cell architecture can load Lazy referenced data much faster.

Conclusion

When your dataset becomes really large, and you can’t fit all objects into memory anymore, the Lazy concept of MicroStream helps you to load the data that you need for a specific request. Using a Java Map is similar to an index on the database, you only load the corresponding data like the Orders for a certain customer.
Using the Map structure can also help you to split a very large list like Customer data. Searching for customers where their name starts with certain letters, becomes then again efficient as w don’t need to load and unload each chunk of the large collection. When this is not possible, a Lucene index can help us limit the number of chunks, Lazy collections, that we need to load. The next-generation MicroStream implementation has a more efficient way of loading these Lazy collections also.

 

Total
0
Shares
Leave a Reply

Your email address will not be published. Required fields are marked *

Previous Post

MicroStream with Helidon MP

Next Post
MicroStream joins the Eclipse Foundation

MicroStream is now an Eclipse Foundation Member

Related Posts
Secured By miniOrange