In this article, I’ll focus on how batch indexing is implemented in hibernate
search via the package
org.hibernate.search.batchindexing. These details can
be found in GitHub - Hibernate/hibernate-search, under folder
impl vs spi
Under this folder, there’re 2 folders, called
spi. So, what are
the differences between them ?
This package is the current implementation of mass indexing. And it is the core of this article.
In this package, there’re 2 interfaces available for implement our own mass-indexer. Interface
MassIndexerWithTenantcan be used for class assignment for a tenant in architectures with multi-tenancy. And interface
MassIndexerFactorycontains methods that can be used to created a mass indexer.
Different classes in the current implementation
There’re 12 classes in the package, which are :
makes sure that several different
BatchIndexingWorkspacecan be started concurrently, sharing the same batch-backend and index writers.
This runnable will prepare a pipeline for batch indexing of entities, managing the life cycle of several thread pools.
Value holder for the services neede by the mass indexer to wrap operations in transactions.
MassIndexerimplementation used when none is specified in the configuration.
Common parent of all Runnable implementations for the batch indexing: share the code for handling runtime exceptions.
SessionAwareRunnableis consuming entity identifiers and producing corresponding
AddLucenWorkinstances being forwarded to the index writing backend. It will finish when the queue it is consuming from will signal there are no more identifiers.
Runnableis going to feed the indexing queue with the identifiers of all the entities going to be indexed. This step in the indexing process is not parallel (should be done by one thread per type) so that a single transaction is used to define the group of entities to be indexed. Produced identifiers are put in the destination queue grouped in list instances: the reason for this is to load them in batches in the next step and reduce contention on the queue.
Prepares and configures a
BatchIndexingWorkspaceto start rebuilding the indexes for all entity instances in the database. The type of these entities is either all indexed entities or a subset, always including all subtypes.
Wrap the execution of a
Runnablein a JTA transaction if necessay: if the existing Hiberante Core transaction strategy requires a TransactionManager or no JTA trasaction is already started. Unfortunately at this time we need to have access to
Implements a blocking queue capable of storing a poison token to signal consumer threads that the task is finished.
A very simple implementation of
MassIndexerProgressMonotorwhich uses the logger at INFO level to output speed statisics.
An interface to run
Runnable is going to feed the indexing queue with the identifiers of
all the entities going to be indexed. Its core method is
In this function, monitor will be updated by a simple row-count operation.
(TODO: this is useful for my own implementation using JSR 352!). Then, the
results will be fetched by chunk, and loaded in a destionation list (TODO:
this list should have another name, more meanful). Once finished, then put into
the queue. During the construction of this class, 9 parameters are required.
is the target queue where the identifiers will be sent once the production is finished.
is the hibernate session factory used to load entities.
is the bacth size which defines the number of entities to process per query.
the class type of the class. It will be used for loading the correct entity. I think it should better rename it the
indexedClazz. But well, it is not up to me …
is the monitor for the whole batch index process. (Is this class missing from the project on Github?)
is the limit of returned results from hibernate session factory. Set to 0 if there’s no limit.
is the error handler in case of exceptions.
is the fetched results ids’ limit. The
MassIndexeruses a forward-only scorllable result to iterate on the primary keys to be loaded, but MySQL’s JDBC driver will load all values in memory. To avoid this “optimization”, set it to
is the tenant identifer. Cannot understand why it should be used ? The entity information is already contained in the producer consumer queue. (I need more time to figure it out)
Create a blocking queue to control the statuses of different identifier producers progress. (only for stopping task for instant)
Prepares and configures a
BatchIndexingWorkspace to start rebuilding the
indexes for all entity instances in the database. The type of these entities
is either all indexed entities or a subset, always including all sub types.