Name	Provider	Purpose	Expiration	Type
c3kie	microstream.one	Saves the consent status of the user whether the cookie window should be displayed.	180 days	HTTP Cookie
c3kie_googleAnalytics	microstream.one	Saves the consent status of the user as to whether Google Analytics is allowed to run.	1 year	HTML Local Storage
c3kie_tagManager	microstream.one	Saves the consent status of the user as to whether Google Tag Manager is allowed to run.	1 year	HTML Local Storage
c3kie_facebook	microstream.one	Saves the consent status of the user as to whether Facebook is allowed to run.	1 year	HTML Local Storage
c3kie_matomo	microstream.one	Saves the consent status of the user as to whether Matomo is allowed to run.	1 year	HTML Local Storage
c3kie_clarity	microstream.one		1 year	HTML Local Storage
c3kie_tiktok	microstream.one		1 year	HTML Local Storage
c3kie_youtube	microstream.one	Saves the consent status of the user as to whether YouTube is allowed to run.	1 year	HTML Local Storage
c3kie_linkedin	microstream.one	Saves the consent status of the user as to whether LinkedIn is allowed to run.	1 year	HTML Local Storage

Name	Provider	Purpose	Expiration	Type
_ga	Google Tag Manager	Registers a unique ID that is used to generate statistical data on how the visitor uses the website.	2 years	HTTP Cookie
_ga_	Google Tag Manager	Collects data on how often a user visited a website, as well as data on the first and last visit. Used by Google Analytics.	2 years	HTTP Cookie
_dc_gtm_	Google	Used by Google Analytics to limit the request rate.	1 year	HTTP Cookie
_gid_	Google	Registers a unique ID that is used to generate statistical data on how the visitor uses the website.	2 years	HTTP Cookie
_gcl_au	Google	Used to send data to Google Analytics about the device and visitor behavior. Captures the visitor across devices and marketing channels.	Session	Pixel Tracker
_gat_	Google	Used to store a unique user ID.	1 day	HTTP Cookie
_gat_gtag_UA_	Google	Used to store a unique user ID.	1 day	HTTP Cookie
_fbp	Facebook	Used to store and track visits to websites.	3 months	HTTP Cookie
MATOMO_SESSID	microstream.one	Registers a unique ID to create statistics about user behaviour.	Session	HTTP Cookie
_clck	Microsoft	Persists the Clarity User ID and preferences, unique to that site is attributed to the same user ID.	1 year	HTTP Cookie
_clsk	Microsoft	Connects multiple page views by a user into a single Clarity session recording.	1 year	HTTP Cookie
CLID	Microsoft	Identifies the first-time Clarity saw this user on any site using Clarity.	1 year	HTTP Cookie
ANONCHK	Microsoft	Indicates whether MUID is transferred to ANID, a cookie used for advertising. Clarity doesn't use ANID and so this is always set to 0.	1 year	HTTP Cookie
MR	Microsoft	Indicates whether to refresh MUID.	1 year	HTTP Cookie
MUID	Microsoft	Identifies unique web browsers visiting Microsoft sites. These cookies are used for advertising, site analytics, and other operational purposes.	1 year	HTTP Cookie
SM	Microsoft	Used in synchronizing the MUID across Microsoft domains.	1 year	HTTP Cookie

Name	Provider	Purpose	Expiration	Type
yt-player-headers-readable	YouTube	Used to determine the optimal video quality based on the visitor's device and network settings.	Persistent	HTML Local Storage
VISITOR_INFO1_LIVE	YouTube	Tries to estimate the range of users on pages with built-in YouTube videos.	179 days	HTTP Cookie
YSC	YouTube	Registers a unique ID to keep statistics on which videos from YouTube the user has seen.	Session	HTTP Cookie
yt.innertube::nextId	YouTube	Registers a unique ID to keep statistics on which videos from YouTube the user has seen.	Persistent	HTML Local Storage
yt.innertube::requests	YouTube	Registers a unique ID to keep statistics on which videos from YouTube the user has seen.	Persistent	HTML Local Storage
ytidb::LAST_RESULT_ENTRY_KEY	YouTube	Saves the user's video player settings with embedded YouTube video.	Persistent	HTML Local Storage
yt-remote-cast-available	YouTube	Saves the user's video player settings with embedded YouTube video.	Session	HTML Local Storage
yt-remote-cast-installed	YouTube	Saves the user's video player settings with embedded YouTube video.	Session	HTML Local Storage
yt-remote-connected-devices	YouTube	Saves the user's video player settings with embedded YouTube video.	Persistent	HTML Local Storage
yt-remote-device-id	YouTube	Saves the user's video player settings with embedded YouTube video.	Persistent	HTML Local Storage
yt-remote-fast-check-period	YouTube	Saves the user's video player settings with embedded YouTube video.	Session	HTML Local Storage
yt-remote-session-app	YouTube	Saves the user's video player settings with embedded YouTube video.	Session	HTML Local Storage
yt-remote-session-name	YouTube	Saves the user's video player settings with embedded YouTube video.	Session	HTML Local Storage

Blog

7 minute read

Some Data Model considerations for searching and indexing with MicroStream

December 12, 2022

In this blog, we go a bit deeper into some techniques to handle a large set of data with MicroStream.
As you know, the idea of MicroStream is that you use the Java objects loaded into the JVM memory as a database. You can quickly get to the object you want by using pure java techniques like the Stream API and getters. There is no need for any specific query language and no retrieval of the data from an external system, including the latency overhead and conversion requirements.

But when you have a lot of data, you probably cannot hold all your data in memory and you need to make use of the Lazy object of MicroStream to load some of your data on demand into memory.

I’ll describe a few options for searching and indexing when you have a lot of data in the next sections.

Lazy concept

The Lazy concept of MicroStream makes it possible to keep some data at the data storage during start-up. When the StorageManager is created, it scans the data storage that you have configured and loads the data into the memory of the JVM. When it encounters an instance of the Lazy Object (class of one.microstream.reference.Lazy), it does not load the actual data that are referenced by the Lazy instance. The Lazy object acts as a proxy and only when we ask the proxy for its data, through the .get() method of Lazy, the data is loaded into the memory from the data storage.

private Lazy<List<String>> someData = Lazy.Reference(new ArrayList<>());

The above definition defines the list of String as lazy loaded and thus not available after the start of the StorageManager. Only when we perform the call someData.get(), we have the list of String available. And when we do not longer need the data in memory, a call to someData.clear() makes sure the data can be removed from memory by the garbage collector. Also without the explicit call to .clear(), data will be cleaned up when there is a need for more free memory blocks within the JVM since Lazy is implemented using the WeakReference class of Java.

Use case with a lot of data

In this blog, we assume the following set of data, which is only a part of the entire data model of course.

We have a large list of customers, each having an address. And for the customers, we keep track of their orders and the products that are ordered.

When we have like 7 million customers, each having 10 or more orders, the data set for these objects alone already become rather large and will not be easy to fit into the JVM memory unless we assign rather large heap sizes.

Store orders

Storing the orders is actually the easiest problem that we need to solve in this case. And it can be seen as an example of using indexes with MicroStream.

private Map<Long, Lazy<List<Order>>> orders;

The indexing solution for accessing the orders of a certain customer very efficiently can be done using a Map variable like orders.

The key value of the map is the unique identification of your customer, like a number, and the value is a Lazy list of Orders. This means that actually none of the orders are loaded when the StorageManager is started.

When we need to have access to the orders of a specific customer, we access the Java map and request the orders through the Lazy reference.

orders.get(customerId).get();

This technique makes that we do not need to load several millions of orders, only those that are required to perform the current request.

Searching

When we need to search through the entire list of customers, based on some letters from their names, things become more complicated. If we can have the data in memory, some simple Stream API statement can give us the result. But that requires probably too much RAM to process. We can use the Lazy object from MicroStream and we will look also at an alternative like a Lucene index and the CQEngine library.

Putting our 7 million customers on a List and searching through it can be as easy as

data.stream()
.filter(c -> c.getLastName().startsWith(letters))
.collect(Collectors.toList());

Some data around this solution
– In our example, the 7 million items take 440 Mb of the memory (according to a heap dump)
– The search is done in 224 milliseconds.

This is of course really fast but since everything is in memory, it cannot be scaled since we need more and more memory when the list grows.

Lazy-loaded MicroStream data

We can use the Lazy concept that we already used for loading only the Orders of a certain customer. But now load the customer data in ‘chunks’. By splitting up the list by the first letter of the customer, we only need to load a smaller amount of data into memory. As a consequence, we lose some performance as we need to load the data on demand when we search for customers with a certain name.

private final Map<Character, Lazy<List<Customer>>> customers = new HashMap<>();

And the search becomes

List<Customer> customers = root.getCustomers(firstLetter);

List<Customer> result = customers.stream().filter(c -> c.getLastName().startsWith(letters))
.collect(Collectors.toList());

So, if we search for all names starting with ‘Sch’, we can load all customers that have ‘S’, and then loop over this list with a Stream.
Due to the loading time we have now, the required time is just under 4 seconds on my machine.
We can slightly improve this by not using the first letter as the key of the map, but the first 2 letters. The actual lists are smaller and are loaded slightly faster into memory by the Lazy loading mechanism of MicroStream.

CQEngine

There exist other frameworks that help you search in large collections. One of them is the Collection Query Engine which promises you faster results than the standard Java API. I tried out version 3.6 where you can either store the data in memory, or somewhere through a Persister. Since we already decided that the memory itself would not be an option, otherwise we could go for the Stream API solution, we opted for a DiskPersistence.

These are the statements to generate the collection. It resulted in a file that is 700 Mb in size.

IndexedCollection<Customer> data = new ConcurrentIndexedCollection<>(persistence);
data.addIndex(HashIndex.onAttribute(LAST_NAME));

data.addAll(customers);

And later on, we can perform our query

Query<Customer> query = QueryFactory.startsWith(LAST_NAME, letters);
List<Customer> found = data.retrieve(query).stream().collect(Collectors.toList());

This search took a bit more than 10 seconds. So performance is less than the MicroStream solution.

Lucene Index

Another popular framework is Apache Lucene, an open-source search engine solution. It is capable of indexing Java objects so that you can later search for some text within the properties you have specified. It can return the id of the Customer object for example that matches our search criteria.

This is how you can create a Lucene Index for a Java Object.

Index.DocumentPopulator<Customer> documentPopulator = (document, customer) -> {
  document.add(new StoredField("id", customer.getId()));
  document.add(new TextField("lastName", customer.getLastName(), Field.Store.YES));
};

Index.EntityMatcher<Customer> entityMatcher = LuceneWithLazy::createCustomer;

Index<Customer> index = new Index<>(Customer.class, documentPopulator, entityMatcher);

You define the DocumentPopulator that feeds the data into the search engine. Here we specify that we can search on the field lastName and that we also store the id of the Customer object so that we can identify the found objects uniquely later on.
The EntityMatcher instance creates a Customer object again for the matched entries. In this case, we have only the id and the lastName of the Customer. We could add more fields of the customer object to the document, but where do we stop? Should we then also add the address info to it because we want to show it on the search result screen?

More on that in a moment, let us first have a look at how we can perform a search with Lucene.

WildcardQuery query = new WildcardQuery(new Term("lastName", "*" + letters + "*"));
Map<String, List<Long>> temp = index.search(query, Integer.MAX_VALUE)
   .stream()
   .collect(Collectors.groupingBy(c -> String.valueOf(c.getLastName().charAt(0))
      , Collectors.mapping(Customer::getId, Collectors.toList())));

The WildcardQuery allows us to search where the letters appear in the names of the customer. We look now for letters that appear everywhere, not just at the start of the name.
If we have enough with the customer id, then this Lucene query is fast to give us the info that we need. In our case, it can give us the list of matching customers in about 1.5 seconds.

But with the Customers id alone, we don’t have enough info to show to the customer. We need all customer fields and the address info so that the end user can decide which customer he intended. This means that we need to look into the Lazy loaded chunks maintained by MicroStream to get our info.
That is the reason why the example creates a Map where the key is the first letter of the matching last names and the value is the list of customer ids. Based on that first letter we can load only the chunk, or chunks, we are interested in.

Depending on how many chunks we need to load, the additional time to get our customer info for the search screen takes 4 seconds additionally (only 1 chunk to load) or 100 seconds (all chunks are needed). Of course, we can implement the restrictions that we abort the search when we need to load too many chunks and ask the user to refine his search criteria, like typing more letters that need to be matched.

To improve the performance of this search by Lucene and lookup through Lazy objects, the Next Generation MicroStream implementation has a new architecture for storing the data. This cell architecture can load Lazy referenced data much faster.

Conclusion

When your dataset becomes really large, and you can’t fit all objects into memory anymore, the Lazy concept of MicroStream helps you to load the data that you need for a specific request. Using a Java Map is similar to an index on the database, you only load the corresponding data like the Orders for a certain customer.
Using the Map structure can also help you to split a very large list like Customer data. Searching for customers where their name starts with certain letters, becomes then again efficient as w don’t need to load and unload each chunk of the large collection. When this is not possible, a Lucene index can help us limit the number of chunks, Lazy collections, that we need to load. The next-generation MicroStream implementation has a more efficient way of loading these Lazy collections also.

MicroStream with Helidon MP

November 17, 2022

MicroStream is now an Eclipse Foundation Member

December 21, 2022

Some Data Model considerations for searching and indexing with MicroStream

Lazy concept

Store orders

Searching

Lazy-loaded MicroStream data

CQEngine

Lucene Index

Conclusion

Leave a Reply Cancel reply

Previous Post

MicroStream with Helidon MP

Next Post

MicroStream is now an Eclipse Foundation Member

Upcoming Event

Products

Community

Resources

Services

Company

Stay Connected

Some Data Model considerations for searching and indexing with MicroStream

Lazy concept

Store orders

Searching

Lazy-loaded MicroStream data

CQEngine

Lucene Index

Conclusion

Leave a Reply Cancel reply

Previous Post

MicroStream with Helidon MP

Next Post

MicroStream is now an Eclipse Foundation Member

Related Posts

MicroStream goes Open Source !

MicroStream as Spring Cache provider

MicroStream training in 2023