A Spark Streaming job is different to an ordinary Spark job. It runs 24/7 and never stops until you tell it too. Oozie is a really good tool for scheduling and orchestrating Spark jobs, but when it comes down to also making it work with Spark Streaming jobs, things gets a bit tricky.
Consider the following workflow which is an example taken from a disaster recovery process,
1. Run an ordinary spark job (i.e. a ETL process) to recovery the data from a backup.
2. Run some checks to make sure the data recovered is correct
3. Start the Spark Streaming job.
By default, when a job is submitted via spark-submit.sh, the submission processed is locked until the actual Spark job finishes. This is not ideal for Spark Streaming, because it means the workflow itself will never finish. And that’s not the only problem, Oozie itself consumes quite a bit of resources. From my experience, Oozie needs around 2 CPU Cores and 2G of RAM as a minimum to run any Spark Job (1 Core and 1G of RAM per process. Tt uses one process for Oozie itself, and another one for the submission of the Spark job).
Well, the good news is there is an option to tell spark-submit not to wait when it submits a job, which is
And when it’s set to false, the spark-submit submission process will exit and return 0 immediately as soon as the job is submitted, and of course, the Oozie job will exit as well.
Be careful not to blindly use this option everywhere. In the above disaster recovery example, only the Spark Streaming part should use this option, not the disaster recovery itself or you will end up starting the streaming job without the disaster recovery been done at all.
This option is not an obvious one, and I only came cross it when reading https://aws.amazon.com/blogs/big-data/submitting-user-applications-with-spark-submit/. Hope this helps the other having similar issues.