Simultaneously Sync up more Number of Datasets

The DataImporter is used for scheduled task to sync data between data source and Elasticsearch for indirect datasets.
Each dataset sync results in one thread which is executed until all required data as per configurations is synced between Data Source and Elasticsearch. Time taken till this thread is executed depends on amount of data to be synced.
However, if there are lots of datasets to be synced that have huge number of records. This results in multiple threads will be executed in parallel. Each threads blocks 1 CPU, hence, if number of threads being executed is same as number of cores of CPU then utilization of CPU reaches 100%.
To resolve this issue, configuration has been added in
dataimporterproperties.properties
file to maximize use of core for scheduling threads which is currently at 33-40% i.e. assuming 8 core CPU hence,
bds.schedulerParalledThreads
will be 3. This will result in maximum thread for scheduler to be 3 running in parallel while other threads for which sync needs to happen will wait in queue. This way CPU will not be utilized to 100%.
Limitation of this property is that only specified number of datasets can be synced in parallel.
If there is a requirement to sync more number of datasets simultaneously, then DataImporter can be installed on multiple nodes. And DataImporter can be started with 1 or more exclusive DataImporter with 33% of CPU core dedicated to schedule jobs while other DataImporter with 100% CPU core will be dedicated to schedule jobs.
Provide Feedback
Have questions or feedback about this documentation? Please submit your feedback here.