How to execute Databricks job from Azure Data Factory
Hello everyone, let us continue exploring various use cases and how they can be efficiently addressed using Azure cloud services. Today, we have a scenario where we need to execute an Azure Databricks job from Azure Data Factory. For instance, we have an Azure file share from our previous cases where we store some files. Currently, we need to establish a pipeline to transfer the file to Azure Databricks for further processing and retrieve the results back to Azure Data Factory.
In this scenario, we need to begin by creating a Databricks cluster, and the policy should be set to “Shared Compute”.
In the next step, we need to create a Python script like below
%python
%pip install azure-storage-file-share
%pip install azure-keyvault-secrets azure-identity
%restart_python
from azure.identity import DefaultAzureCredential
from azure.keyvault.secrets import SecretClient
#Read Service principal from environment variables
credential = DefaultAzureCredential()
#Key Vault client initialization
secret_client = SecretClient(vault_url="https://***.vault.azure.net/", credential=credential)
secret = secret_client.get_secret("sas")
from azure.core.exceptions import (
ResourceExistsError,
ResourceNotFoundError
)
from azure.storage.fileshare import (
ShareFileClient
)
# Create a ShareFileClient from SAS
share_file_client = ShareFileClient(
account_url="https://***.file.core.windows.net",
share_name="test",
file_path="***/avia.pdf",
credential=secret.value
)
# Download file content
with open("avia.pdf", "wb") as data:
stream = share_file_client.download_file()
data.write(stream.readall())
# Read file content
f = open("avia.pdf", "rb")
# ***Doing some processing logic ***
dbutils.notebook.exit(f.read())
print(f.read())
In this example, we initiate a service principal to access Azure Key Vault to obtain a SAS token for connecting to Azure File Share. Next, we create a ShareFileClient to read the file share. In the final part, we create a stream to read or download the file content, perform some processing, and send back the results to the pipeline activity.
When using a service principal, you need to keep the client ID and secrets as environment variables inside the cluster.
P.S. You should never hard code secrets or store them in plain text. Use the Secrets API 2.0 to manage secrets in the Databricks CLI. Use the Secrets utility (dbutils.secrets) to reference secrets in notebooks and jobs.
Let’s switch to Azure Data Factory.
In this example, we added Notebook Databricks activity to refer to our Python script
To do so we need to create Databricks linked service using Managed Identity to connect to our workspace and cluster.
The last step is saving the results of executing in the variable
So, that is pretty it. Happy coding.