Skip to content
  • Products
    • MicroStream for Java
    • MicroStream for Android
    • MicroStream Serialization
  • Community
    • GitHub
    • Discussion
    • Contribute
    • Issues
    • StackOverflow
  • Resources
    • Get Started
    • Docs
    • Examples
    • Videos
    • Blog
    • Release & Update Plan
  • Services
    • Support
    • Training
  • Company
    • About us
    • Contact
    • Imprint
  • Subscribe to our Newsletter
  • User icon Sign in

Cookie

We use cookies to make it easier to use and to further improve our service.

If you have given us your consent, you can revoke it at any time in the data protection declaration.

Use all cookies Manage cookies Reject cookies

Cookie

?
?
?
Use all cookiessave settingsReject cookiesView Privacy Policy

Necessary

Necessary cookies help make a website usable by enabling basic functions such as page navigation and access to secure areas of the website. The website cannot function properly without these cookies.

Name Anbieter Zweck Ablauf Typ
c3kie_removeCookieWindow microstream.one Saves the consent status of the user whether the cookie window should be displayed. 1 Jahr HTML Local Storage
c3kie_googleAnalytics microstream.one Saves the consent status of the user as to whether Google Analytics is allowed to run. 1 Jahr HTML Local Storage
c3kie_googleAnalytics microstream.one Saves the consent status of the user as to whether Google Analytics is allowed to run. 1 Jahr HTML Local Storage
c3kie_tagManager microstream.one Saves the consent status of the user as to whether Google Tag Manager is allowed to run. 1 Jahr HTML Local Storage
c3kie_facebook microstream.one Saves the consent status of the user as to whether Facebook is allowed to run. 1 Jahr HTML Local Storage
c3kie_matomo microstream.one Saves the consent status of the user as to whether Matomo is allowed to run. 1 Jahr HTML Local Storage
c3kie_youtube microstream.one Saves the consent status of the user as to whether YouTube is allowed to run. 1 Jahr HTML Local Storage

Statistics

Statistics cookies help website owners understand how visitors interact with websites by collecting and reporting information anonymously.

Name Anbieter Zweck Ablauf Typ
_dc_gtm_ Google Used by Google Analytics to limit the request rate. 1 Jahr HTTP Cookie
_gid_ Google Registers a unique ID that is used to generate statistical data on how the visitor uses the website. 2 Jahre HTTP Cookie
_gcl_au Google Used to send data to Google Analytics about the device and visitor behavior. Captures the visitor across devices and marketing channels. Session Pixel Tracker
_gat_ Google Used to store a unique user ID. 1 Tag HTTP Cookie
_gat_gtag_UA_ Google Used to store a unique user ID. 1 Tag HTTP Cookie
yt-player-headers-readable YouTube Used to determine the optimal video quality based on the visitor's device and network settings. Persistent HTML Local Storage

Marketing

Marketing cookies are used to track visitors across websites. The intent is to show ads that are relevant and engaging to the individual user, making them more valuable to publishers and third party advertising providers.

Name Anbieter Zweck Ablauf Typ
VISITOR_INFO1_LIVE YouTube Tries to estimate the range of users on pages with built-in YouTube videos. 179 Tage HTTP Cookie
YSC YouTube Registers a unique ID to keep statistics on which videos from YouTube the user has seen. Session HTTP Cookie
yt.innertube::nextId YouTube Registers a unique ID to keep statistics on which videos from YouTube the user has seen. Persistent HTML Local Storage
yt.innertube::requests YouTube Registers a unique ID to keep statistics on which videos from YouTube the user has seen. Persistent HTML Local Storage
ytidb::LAST_RESULT_ENTRY_KEY YouTube Saves the user's video player settings with embedded YouTube video. Persistent HTML Local Storage
yt-remote-cast-available YouTube Saves the user's video player settings with embedded YouTube video. Session HTML Local Storage
yt-remote-cast-installed YouTube Saves the user's video player settings with embedded YouTube video. Session HTML Local Storage
yt-remote-connected-devices YouTube Saves the user's video player settings with embedded YouTube video. Persistent HTML Local Storage
yt-remote-device-id YouTube Saves the user's video player settings with embedded YouTube video. Persistent HTML Local Storage
yt-remote-fast-check-period YouTube Saves the user's video player settings with embedded YouTube video. Session HTML Local Storage
yt-remote-session-app YouTube Saves the user's video player settings with embedded YouTube video. Session HTML Local Storage
yt-remote-session-name YouTube Saves the user's video player settings with embedded YouTube video. Session HTML Local Storage

Some Data Model considerations for searching and indexing with MicroStream

Dec122022
Blog

In this blog, we go a bit deeper into some techniques to handle a large set of data with MicroStream.
As you know, the idea of MicroStream is that you use the Java objects loaded into the JVM memory as a database. You can quickly get to the object you want by using pure java techniques like the Stream API and getters. There is no need for any specific query language and no retrieval of the data from an external system, including the latency overhead and conversion requirements.

But when you have a lot of data, you probably cannot hold all your data in memory and you need to make use of the Lazy object of MicroStream to load some of your data on demand into memory.

I’ll describe a few options for searching and indexing when you have a lot of data in the next sections.

Lazy concept

The Lazy concept of MicroStream makes it possible to keep some data at the data storage during start-up. When the StorageManager is created, it scans the data storage that you have configured and loads the data into the memory of the JVM. When it encounters an instance of the Lazy Object (class of one.microstream.reference.Lazy), it does not load the actual data that are referenced by the Lazy instance. The Lazy object acts as a proxy and only when we ask the proxy for its data, through the .get() method of Lazy, the data is loaded into the memory from the data storage.

private Lazy<List<String>> someData = Lazy.Reference(new ArrayList<>());

 

The above definition defines the list of String as lazy loaded and thus not available after the start of the StorageManager. Only when we perform the call someData.get(), we have the list of String available. And when we do not longer need the data in memory, a call to someData.clear() makes sure the data can be removed from memory by the garbage collector. Also without the explicit call to .clear(), data will be cleaned up when there is a need for more free memory blocks within the JVM since Lazy is implemented using the WeakReference class of Java.

Use case with a lot of data

In this blog, we assume the following set of data, which is only a part of the entire data model of course.

We have a large list of customers, each having an address. And for the customers, we keep track of their orders and the products that are ordered.

When we have like 7 million customers, each having 10 or more orders, the data set for these objects alone already become rather large and will not be easy to fit into the JVM memory unless we assign rather large heap sizes.

Store orders

Storing the orders is actually the easiest problem that we need to solve in this case. And it can be seen as an example of using indexes with MicroStream.

private Map<Long, Lazy<List<Order>>> orders;

The indexing solution for accessing the orders of a certain customer very efficiently can be done using a Map variable like orders.

The key value of the map is the unique identification of your customer, like a number, and the value is a Lazy list of Orders. This means that actually none of the orders are loaded when the StorageManager is started.

When we need to have access to the orders of a specific customer, we access the Java map and request the orders through the Lazy reference.

orders.get(customerId).get();

This technique makes that we do not need to load several millions of orders, only those that are required to perform the current request.

Searching

When we need to search through the entire list of customers, based on some letters from their names, things become more complicated. If we can have the data in memory, some simple Stream API statement can give us the result. But that requires probably too much RAM to process. We can use the Lazy object from MicroStream and we will look also at an alternative like a Lucene index and the CQEngine library.

Putting our 7 million customers on a List and searching through it can be as easy as

data.stream()
.filter(c -> c.getLastName().startsWith(letters))
.collect(Collectors.toList());

 

Some data around this solution
– In our example, the 7 million items take 440 Mb of the memory (according to a heap dump)
– The search is done in 224 milliseconds.

This is of course really fast but since everything is in memory, it cannot be scaled since we need more and more memory when the list grows.

Lazy-loaded MicroStream data

We can use the Lazy concept that we already used for loading only the Orders of a certain customer. But now load the customer data in ‘chunks’. By splitting up the list by the first letter of the customer, we only need to load a smaller amount of data into memory. As a consequence, we lose some performance as we need to load the data on demand when we search for customers with a certain name.

private final Map<Character, Lazy<List<Customer>>> customers = new HashMap<>();

And the search becomes

List<Customer> customers = root.getCustomers(firstLetter);

List<Customer> result = customers.stream().filter(c -> c.getLastName().startsWith(letters))
.collect(Collectors.toList());

So, if we search for all names starting with ‘Sch’, we can load all customers that have ‘S’, and then loop over this list with a Stream.
Due to the loading time we have now, the required time is just under 4 seconds on my machine.
We can slightly improve this by not using the first letter as the key of the map, but the first 2 letters. The actual lists are smaller and are loaded slightly faster into memory by the Lazy loading mechanism of MicroStream.

CQEngine

There exist other frameworks that help you search in large collections. One of them is the Collection Query Engine which promises you faster results than the standard Java API. I tried out version 3.6 where you can either store the data in memory, or somewhere through a Persister. Since we already decided that the memory itself would not be an option, otherwise we could go for the Stream API solution, we opted for a DiskPersistence.

These are the statements to generate the collection. It resulted in a file that is 700 Mb in size.

IndexedCollection<Customer> data = new ConcurrentIndexedCollection<>(persistence);
data.addIndex(HashIndex.onAttribute(LAST_NAME));

data.addAll(customers);

And later on, we can perform our query

Query<Customer> query = QueryFactory.startsWith(LAST_NAME, letters);
List<Customer> found = data.retrieve(query).stream().collect(Collectors.toList());

This search took a bit more than 10 seconds. So performance is less than the MicroStream solution.

Lucene Index

Another popular framework is Apache Lucene, an open-source search engine solution. It is capable of indexing Java objects so that you can later search for some text within the properties you have specified. It can return the id of the Customer object for example that matches our search criteria.

This is how you can create a Lucene Index for a Java Object.

Index.DocumentPopulator<Customer> documentPopulator = (document, customer) -> {
  document.add(new StoredField("id", customer.getId()));
  document.add(new TextField("lastName", customer.getLastName(), Field.Store.YES));
};

Index.EntityMatcher<Customer> entityMatcher = LuceneWithLazy::createCustomer;

Index<Customer> index = new Index<>(Customer.class, documentPopulator, entityMatcher);

 

You define the DocumentPopulator that feeds the data into the search engine. Here we specify that we can search on the field lastName and that we also store the id of the Customer object so that we can identify the found objects uniquely later on.
The EntityMatcher instance creates a Customer object again for the matched entries. In this case, we have only the id and the lastName of the Customer. We could add more fields of the customer object to the document, but where do we stop? Should we then also add the address info to it because we want to show it on the search result screen?

More on that in a moment, let us first have a look at how we can perform a search with Lucene.

WildcardQuery query = new WildcardQuery(new Term("lastName", "*" + letters + "*"));
Map<String, List<Long>> temp = index.search(query, Integer.MAX_VALUE)
   .stream()
   .collect(Collectors.groupingBy(c -> String.valueOf(c.getLastName().charAt(0))
      , Collectors.mapping(Customer::getId, Collectors.toList())));

 

The WildcardQuery allows us to search where the letters appear in the names of the customer. We look now for letters that appear everywhere, not just at the start of the name.
If we have enough with the customer id, then this Lucene query is fast to give us the info that we need. In our case, it can give us the list of matching customers in about 1.5 seconds.

But with the Customers id alone, we don’t have enough info to show to the customer. We need all customer fields and the address info so that the end user can decide which customer he intended. This means that we need to look into the Lazy loaded chunks maintained by MicroStream to get our info.
That is the reason why the example creates a Map where the key is the first letter of the matching last names and the value is the list of customer ids. Based on that first letter we can load only the chunk, or chunks, we are interested in.

Depending on how many chunks we need to load, the additional time to get our customer info for the search screen takes 4 seconds additionally (only 1 chunk to load) or 100 seconds (all chunks are needed). Of course, we can implement the restrictions that we abort the search when we need to load too many chunks and ask the user to refine his search criteria, like typing more letters that need to be matched.

To improve the performance of this search by Lucene and lookup through Lazy objects, the Next Generation MicroStream implementation has a new architecture for storing the data. This cell architecture can load Lazy referenced data much faster.

Conclusion

When your dataset becomes really large, and you can’t fit all objects into memory anymore, the Lazy concept of MicroStream helps you to load the data that you need for a specific request. Using a Java Map is similar to an index on the database, you only load the corresponding data like the Orders for a certain customer.
Using the Map structure can also help you to split a very large list like Customer data. Searching for customers where their name starts with certain letters, becomes then again efficient as w don’t need to load and unload each chunk of the large collection. When this is not possible, a Lucene index can help us limit the number of chunks, Lazy collections, that we need to load. The next-generation MicroStream implementation has a more efficient way of loading these Lazy collections also.

 

Category: BlogDecember 12, 2022Leave a comment
Tags: Data PersistenceHow to

Author: Rudy De Busscher

Post navigation

PreviousPrevious post:MicroStream with Helidon MPNextNext post:MicroStream is now an Eclipse Foundation Member

Related Posts

A peek into upcoming version 8.0
January 26, 2023
MicroStream as Spring Cache provider
January 5, 2023
MicroStream joins the Eclipse Foundation
MicroStream is now an Eclipse Foundation Member
December 21, 2022
MicroStream with Helidon MP
November 17, 2022
Store State of a Background process with MicroStream
October 31, 2022
Updates to the Spring Boot integration in version 7.1
October 14, 2022

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Post comment

Search
Categories
  • Blog(40)
  • Conferences(13)
  • General(3)
  • News(18)
  • Releases(12)
Related Posts
  • A peek into upcoming version 8.0
    January 26, 2023
  • MicroStream as Spring Cache provider
    January 5, 2023
  • MicroStream joins the Eclipse Foundation
    MicroStream is now an Eclipse Foundation Member
    December 21, 2022
  • Some Data Model considerations for searching and indexing with MicroStream
    December 12, 2022
  • MicroStream with Helidon MP
    November 17, 2022
  • Store State of a Background process with MicroStream
    October 31, 2022
  • Updates to the Spring Boot integration in version 7.1
    October 14, 2022
  • Using Cloud storage platforms with MicroStream
    September 29, 2022
  • MicroStream Best-Practice
    September 29, 2022
  • Release of MicroStream version 7.1
    September 16, 2022
Tags
Android AWS Cache Cloud connector Cluster Communication Conference Data Persistence Discussion event feature framework free fundamentals Germany GitHub Hackathon Heldion How to InfoQ Integration Java Java-native Java 17 JAX JCON JVM Con logging LTS Micronaut MicroProfile Microservices MicroStream MicroStream 2 MicroStream 4 MicroStream 6 New Features online Open Liberty Open Source Oracle release Serialization Spring Boot Training
MicroStream

Store Java Object Graphs natively, relieved of heavy-weight DBMS Dependencies. Create ultra-fast In- Memory Database Applications & Microservices with Pure Java. The Pure Java Paradigm Shift in Database Development.

Upcoming Event
No news available.

Platforms

  • MicroStream for Java
  • MicroStream for Android

Community

  • GitHub
  • Discussion
  • Contribute
  • Issues
  • StackOverflow
  • MeetUp

Resources

  • Get Started
  • Docs
  • Examples
  • Videos
  • Blog
  • Release plan

Services

  • Support
  • Training

Company

  • About us
  • Contact

Stay Connected

  • Twitter
  • LinkedIn
  • YouTube
  • GitHub
  • StackOverflow
  • MeetUp

Get the latest MicroStream news:

Subscribe

© 2023 MicroStream Software. All rights reserved.
  • Imprint
  • Terms of use
  • Privacy
  • Legal notice

Cookie

We use cookies to make it easier to use and to further improve our service.

If you have given us your consent, you can revoke it at any time in the data protection declaration.

Use all cookies Manage cookies Reject cookies

Cookie

?
?
?
Use all cookiessave settingsReject cookiesView Privacy Policy

Necessary

Necessary cookies help make a website usable by enabling basic functions such as page navigation and access to secure areas of the website. The website cannot function properly without these cookies.

Name Anbieter Zweck Ablauf Typ
c3kie_removeCookieWindow microstream.one Saves the consent status of the user whether the cookie window should be displayed. 1 Jahr HTML Local Storage
c3kie_googleAnalytics microstream.one Saves the consent status of the user as to whether Google Analytics is allowed to run. 1 Jahr HTML Local Storage
c3kie_googleAnalytics microstream.one Saves the consent status of the user as to whether Google Analytics is allowed to run. 1 Jahr HTML Local Storage
c3kie_tagManager microstream.one Saves the consent status of the user as to whether Google Tag Manager is allowed to run. 1 Jahr HTML Local Storage
c3kie_facebook microstream.one Saves the consent status of the user as to whether Facebook is allowed to run. 1 Jahr HTML Local Storage
c3kie_matomo microstream.one Saves the consent status of the user as to whether Matomo is allowed to run. 1 Jahr HTML Local Storage
c3kie_youtube microstream.one Saves the consent status of the user as to whether YouTube is allowed to run. 1 Jahr HTML Local Storage

Statistics

Statistics cookies help website owners understand how visitors interact with websites by collecting and reporting information anonymously.

Name Anbieter Zweck Ablauf Typ
_dc_gtm_ Google Used by Google Analytics to limit the request rate. 1 Jahr HTTP Cookie
_gid_ Google Registers a unique ID that is used to generate statistical data on how the visitor uses the website. 2 Jahre HTTP Cookie
_gcl_au Google Used to send data to Google Analytics about the device and visitor behavior. Captures the visitor across devices and marketing channels. Session Pixel Tracker
_gat_ Google Used to store a unique user ID. 1 Tag HTTP Cookie
_gat_gtag_UA_ Google Used to store a unique user ID. 1 Tag HTTP Cookie
yt-player-headers-readable YouTube Used to determine the optimal video quality based on the visitor's device and network settings. Persistent HTML Local Storage

Marketing

Marketing cookies are used to track visitors across websites. The intent is to show ads that are relevant and engaging to the individual user, making them more valuable to publishers and third party advertising providers.

Name Anbieter Zweck Ablauf Typ
VISITOR_INFO1_LIVE YouTube Tries to estimate the range of users on pages with built-in YouTube videos. 179 Tage HTTP Cookie
YSC YouTube Registers a unique ID to keep statistics on which videos from YouTube the user has seen. Session HTTP Cookie
yt.innertube::nextId YouTube Registers a unique ID to keep statistics on which videos from YouTube the user has seen. Persistent HTML Local Storage
yt.innertube::requests YouTube Registers a unique ID to keep statistics on which videos from YouTube the user has seen. Persistent HTML Local Storage
ytidb::LAST_RESULT_ENTRY_KEY YouTube Saves the user's video player settings with embedded YouTube video. Persistent HTML Local Storage
yt-remote-cast-available YouTube Saves the user's video player settings with embedded YouTube video. Session HTML Local Storage
yt-remote-cast-installed YouTube Saves the user's video player settings with embedded YouTube video. Session HTML Local Storage
yt-remote-connected-devices YouTube Saves the user's video player settings with embedded YouTube video. Persistent HTML Local Storage
yt-remote-device-id YouTube Saves the user's video player settings with embedded YouTube video. Persistent HTML Local Storage
yt-remote-fast-check-period YouTube Saves the user's video player settings with embedded YouTube video. Session HTML Local Storage
yt-remote-session-app YouTube Saves the user's video player settings with embedded YouTube video. Session HTML Local Storage
yt-remote-session-name YouTube Saves the user's video player settings with embedded YouTube video. Session HTML Local Storage