Managing Azure Databricks with Terraform
Learn how to deploy and manage Azure Databricks using Terraform
Managing Azure Databricks with Terraform
Azure Databricks is a fast, easy, and collaborative Apache Spark-based analytics platform. This guide shows you how to manage Databricks using Terraform.
Video Tutorial
Prerequisites
- Azure subscription
- Terraform installed
- Azure CLI installed
- Basic understanding of Apache Spark and data analytics
Project Structure
.
├── main.tf # Main Terraform configuration file
├── variables.tf # Variable definitions
├── outputs.tf # Output definitions
├── terraform.tfvars # Variable values
└── modules/
└── databricks/
├── main.tf # Databricks specific configurations
├── variables.tf # Module variables
├── clusters.tf # Cluster configurations
└── outputs.tf # Module outputs
Basic Configuration
Here’s a basic example of setting up Azure Databricks:
resource "azurerm_resource_group" "databricks_rg" {
name = "databricks-resources"
location = "eastus"
}
resource "azurerm_databricks_workspace" "workspace" {
name = "databricks-workspace"
resource_group_name = azurerm_resource_group.databricks_rg.name
location = azurerm_resource_group.databricks_rg.location
sku = "premium"
custom_parameters {
no_public_ip = true
}
tags = {
Environment = "Production"
}
}
Advanced Features
Network Configuration
Configure virtual network injection:
resource "azurerm_virtual_network" "databricks_vnet" {
name = "databricks-vnet"
location = azurerm_resource_group.databricks_rg.location
resource_group_name = azurerm_resource_group.databricks_rg.name
address_space = ["10.0.0.0/16"]
}
resource "azurerm_subnet" "public" {
name = "public-subnet"
resource_group_name = azurerm_resource_group.databricks_rg.name
virtual_network_name = azurerm_virtual_network.databricks_vnet.name
address_prefixes = ["10.0.1.0/24"]
delegation {
name = "databricks-delegation"
service_delegation {
name = "Microsoft.Databricks/workspaces"
actions = [
"Microsoft.Network/virtualNetworks/subnets/join/action",
"Microsoft.Network/virtualNetworks/subnets/prepareNetworkPolicies/action",
"Microsoft.Network/virtualNetworks/subnets/unprepareNetworkPolicies/action"
]
}
}
}
resource "azurerm_subnet" "private" {
name = "private-subnet"
resource_group_name = azurerm_resource_group.databricks_rg.name
virtual_network_name = azurerm_virtual_network.databricks_vnet.name
address_prefixes = ["10.0.2.0/24"]
delegation {
name = "databricks-delegation"
service_delegation {
name = "Microsoft.Databricks/workspaces"
actions = [
"Microsoft.Network/virtualNetworks/subnets/join/action",
"Microsoft.Network/virtualNetworks/subnets/prepareNetworkPolicies/action",
"Microsoft.Network/virtualNetworks/subnets/unprepareNetworkPolicies/action"
]
}
}
}
resource "azurerm_databricks_workspace" "secure_workspace" {
name = "secure-databricks"
resource_group_name = azurerm_resource_group.databricks_rg.name
location = azurerm_resource_group.databricks_rg.location
sku = "premium"
custom_parameters {
no_public_ip = true
virtual_network_id = azurerm_virtual_network.databricks_vnet.id
public_subnet_name = azurerm_subnet.public.name
private_subnet_name = azurerm_subnet.private.name
public_subnet_network_security_group_association_id = azurerm_subnet_network_security_group_association.public.id
private_subnet_network_security_group_association_id = azurerm_subnet_network_security_group_association.private.id
}
}
Cluster Configuration
Using the Databricks provider:
provider "databricks" {
azure_workspace_resource_id = azurerm_databricks_workspace.workspace.id
}
resource "databricks_cluster" "shared_autoscaling" {
cluster_name = "shared-autoscaling"
spark_version = "9.1.x-scala2.12"
node_type_id = "Standard_DS3_v2"
autotermination_minutes = 20
autoscale {
min_workers = 1
max_workers = 10
}
spark_conf = {
"spark.databricks.cluster.profile" : "serverless"
"spark.databricks.repl.allowedLanguages" : "python,sql"
}
custom_tags = {
"Department" = "Data Science"
}
}
Best Practices
- Use Infrastructure as Code for consistent deployments
- Implement proper monitoring and logging
- Use managed identities for enhanced security
- Configure auto-scaling appropriately
- Implement proper backup and disaster recovery
Security Considerations
Network Security
Configure network security groups:
resource "azurerm_network_security_group" "databricks_nsg" {
name = "databricks-nsg"
location = azurerm_resource_group.databricks_rg.location
resource_group_name = azurerm_resource_group.databricks_rg.name
security_rule {
name = "AllowDBWebAccess"
priority = 100
direction = "Inbound"
access = "Allow"
protocol = "Tcp"
source_port_range = "*"
destination_port_range = "443"
source_address_prefix = "AzureCloud"
destination_address_prefix = "*"
}
}
Key Vault Integration
resource "azurerm_key_vault" "databricks_kv" {
name = "databricks-kv"
location = azurerm_resource_group.databricks_rg.location
resource_group_name = azurerm_resource_group.databricks_rg.name
tenant_id = data.azurerm_client_config.current.tenant_id
sku_name = "standard"
purge_protection_enabled = true
}
resource "azurerm_key_vault_access_policy" "databricks" {
key_vault_id = azurerm_key_vault.databricks_kv.id
tenant_id = data.azurerm_client_config.current.tenant_id
object_id = azurerm_databricks_workspace.workspace.managed_resource_group_id
secret_permissions = [
"Get", "List"
]
}
Monitoring and Logging
Configure diagnostics settings:
resource "azurerm_monitor_diagnostic_setting" "databricks_diagnostics" {
name = "databricks-diagnostics"
target_resource_id = azurerm_databricks_workspace.workspace.id
log_analytics_workspace_id = azurerm_log_analytics_workspace.workspace.id
log {
category = "dbfs"
enabled = true
}
log {
category = "clusters"
enabled = true
}
log {
category = "accounts"
enabled = true
}
metric {
category = "AllMetrics"
enabled = true
}
}
Cost Management
Configure budgets and alerts:
resource "azurerm_consumption_budget_resource_group" "databricks_budget" {
name = "databricks-budget"
resource_group_id = azurerm_resource_group.databricks_rg.id
amount = 5000
time_grain = "Monthly"
notification {
enabled = true
threshold = 90.0
operator = "GreaterThan"
contact_emails = [
"admin@example.com"
]
}
}
Integration Examples
Integration with Azure Data Factory:
resource "azurerm_data_factory_linked_service_azure_databricks" "linked_databricks" {
name = "linked-databricks"
data_factory_id = azurerm_data_factory.adf.id
description = "Databricks linked service"
adb_domain = azurerm_databricks_workspace.workspace.workspace_url
msi_work_space_resource_id = azurerm_databricks_workspace.workspace.id
existing_cluster_id = databricks_cluster.shared_autoscaling.id
}
Integration with Azure Machine Learning:
resource "azurerm_machine_learning_compute" "databricks_compute" {
name = "databricks-compute"
location = azurerm_resource_group.databricks_rg.location
machine_learning_workspace_id = azurerm_machine_learning_workspace.aml.id
databricks {
workspace_url = azurerm_databricks_workspace.workspace.workspace_url
}
identity {
type = "SystemAssigned"
}
}
Conclusion
Azure Databricks with Terraform provides powerful data analytics capabilities that can be managed through Infrastructure as Code. By following these best practices and configurations, you can create secure and scalable analytics environments in Azure.