Parallel Job Executions in Talend

Ämne: Data Integration

Since Talend is a java-code generator, we can run jobs and subjobs in multiple threads to reduce the runtime of a job. There are three different parallel execution techniques in Talend Data Integration that are good to have at your disposal. Note: these techniques are only available out of the box in Talends commercial versions and needs manual implementation in the open studio versions.

Multithreading

If you have multiple subjobs that generate large amount of data to a database that is not dependent on each other, Talend will run each subjob sequentially, i.e. waiting for the completion of the previous subjob to start the execution of the next. This can take a lot of time depending how many subjobs you have. If you enable multithreading, all unconnected subjobs will run in parallel (simultaneously) and depending on your hardware and how many threads you create, this can speed up the process enormously.

tParallelize component

Previously we talked about how to run multiple unconnected and independent subjobs in parallel. But what if we have a subjob that is dependent on previous subjobs and need to wait until one or more subjobs are finished. The build-in tParallelize component will certainly do the job. By connecting all subjobs with tParallelize component, we can choose which subjobs to run in parallel and which ones that needs to wait for synchronization.

Automatic parallelization

Automatic parallelization is used for a single job or a subjob. Imagine if we have a job that generates millions of rows, that job alone can take a lot of time. By enabling automatic parallelization, we can divide the job execution into multiple threads. This is done in four steps: partition, collect, departition and recollect. So, when generating the millions of rows, we can have different threads to do the job and then collect the result and combine it in a single output.