Databricks job cluster error: Could not reach driver of cluster cluster-id.

Question

Databricks job cluster error: Could not reach driver of cluster cluster-id.

González Ternero, David 0

We are experiencing an intermittent error in one of our jobs that always runs at the same time (11:00 PM). On some days, the job fails with the following error:

DriverError: Could not reach driver of cluster <cluster-id>

I have seen in other threads that this issue can sometimes be caused by binary mismatches between libraries installed on the cluster. However, in our case, we are not using any custom-installed libraries beyond the default runtime.

I have also come across similar issues related to the following error:

Failed to start repl ReplId-<id> com.databricks.backend.daemon.driver.PythonDriverLocal$PythonException: Unable to start python kernel for ReplId-<id>, kernel did not start within 80 seconds.

However, in our case, this trace (or any similar one) does not appear in our logs.

Instead, the logs repeatedly show the following errors:

ERROR Datastore: Exception thrown creating StoreManager. See the nested exception Error creating transactional connection factory org.datanucleus.exceptions.NucleusException: Error creating transactional connection factory Caused by: org.datanucleus.exceptions.NucleusException: Attempt to invoke the "HikariCP" plugin to create a ConnectionPool gave an error: Failed to initialize pool: Could not connect to address=(host=REDACTED)(port=REDACTED)(type=master) : Read timed out Caused by: com.zaxxer.hikari.pool.HikariPool$PoolInitializationException: Failed to initialize pool: Could not connect to address=(host=REDACTED)(port=REDACTED)(type=master) : Read timed out

Do you have any idea what might be causing this issue?

2 answers

Your answer

Answer 1

Manoj Kumar Boyini 9,410 Microsoft External Staff Moderator

Hi González Ternero, David,

The issue was caused by driver overload due to too many tasks running in parallel. Each parallel Python task spawns a Python REPL (ipykernel) on the driver, and under high load the REPLs could not start within the default 80‑second timeout. This made the driver unresponsive, resulting in the error “Could not reach driver of cluster”.

Resolution

Reduced job parallelism to limit concurrent REPL creation.
Increased the REPL startup timeout by setting: spark.databricks.driver.ipykernel.launchTimeoutSeconds = 300

After these changes, the job ran consistently without intermittent failures.

Hope this helps, Please let us know if you have any questions and concerns.

If this solution helped resolve your issue, please consider clicking ‘Accept Answer’ or giving it an upvote to help others find it easily.

González Ternero, David 0 Reputation points

2026-02-26T07:27:52.6733333+00:00

Hello, thanks for your answer.

That was our first idea, but after checking the driver metrics, it seems both CPU and memory are far under their limits. CPU utilization is under 10% and Memory is under 40%, so we dont think the driver overloaded at all.
Manoj Kumar Boyini 9,410 Reputation points Microsoft External Staff Moderator

2026-02-26T20:16:29.4366667+00:00
Hi González Ternero, David,

Could you please confirm whether the following two changes were applied on your side:

Reduced job parallelism (to limit the number of Python REPLs being created concurrently)

Increased the REPL startup timeout by setting: spark.databricks.driver.ipykernel.launchTimeoutSeconds = 300

These steps help rule out REPL‑initialization delays as a contributing factor.
Manoj Kumar Boyini 9,410 Reputation points Microsoft External Staff Moderator

2026-02-27T19:33:50.8633333+00:00

Hi González Ternero, David,

Just checking in to see if the information provided earlier was helpful. Please let us know if you need any additional details or assistance, and we’ll gladly continue supporting you.
González Ternero, David 0 Reputation points

2026-03-02T07:30:03.1833333+00:00

Hi, @Manoj Kumar Boyini ,
Thanks for the information.
Unfortunately, after applying both changes (reducing paralallelism and increasing REPL startup timeout) we are still facing the same issue
Manoj Kumar Boyini 9,410 Reputation points Microsoft External Staff Moderator

2026-03-03T18:03:03.04+00:00

Hi González Ternero, David,

You can raise a support ticket and our CSS team can take a look and diagnose with proper tool. Do you have a support plan.?
González Ternero, David 0 Reputation points

2026-03-05T07:38:57.4366667+00:00

Hello Manoj,
Seems like a maintenance for the hive metastore backend finally solved the problem.

Thanks for your help!

Answer 2

The error DriverError: Could not reach driver of cluster <cluster-id> can occur due to several reasons. Since you mentioned that you are not using any custom-installed libraries, the issue might be related to the driver being overloaded or the cluster malfunctioning.

Driver Overload: If the driver is overloaded, it may not be able to respond to requests. You can check the cluster metrics to determine if the driver is heavily loaded. Look for the Server load distribution visualization in the cluster metrics. If the driver is overloaded, it will be indicated by a red block in the visualization. To address this, consider increasing the size of the driver or reducing the concurrency of jobs running on the cluster.
Cluster Malfunctioning: Although rare, a malfunctioning cluster can also lead to this error. Restarting the cluster might resolve the issue. Additionally, you can check the logs for any suspicious entries, particularly in the Event log and Driver logs tabs.
Datastore Issues: The error message regarding the StoreManager could indicate a problem with the underlying data storage or configuration. Ensure that the data store is accessible and properly configured.

Since the job runs at the same time every day, consider if there are any scheduled tasks or resource contention occurring at that time that could impact the cluster's performance.

If the problem persists, you may want to reach out to Databricks support for further assistance, especially if you suspect it could be a deeper issue with the cluster setup or configuration.

References:

Share via

Databricks job cluster error: Could not reach driver of cluster cluster-id.

2 answers

Your answer