Managing Azure Databricks with Terraform

Learn how to deploy and manage Azure Databricks using Terraform

Managing Azure Databricks with Terraform

Azure Databricks is a fast, easy, and collaborative Apache Spark-based analytics platform. This guide shows you how to manage Databricks using Terraform.

Video Tutorial

Prerequisites

  • Azure subscription
  • Terraform installed
  • Azure CLI installed
  • Basic understanding of Apache Spark and data analytics

Project Structure

.
├── main.tf                   # Main Terraform configuration file
├── variables.tf              # Variable definitions
├── outputs.tf               # Output definitions
├── terraform.tfvars         # Variable values
└── modules/
    └── databricks/
        ├── main.tf          # Databricks specific configurations
        ├── variables.tf      # Module variables
        ├── clusters.tf      # Cluster configurations
        └── outputs.tf       # Module outputs

Basic Configuration

Here’s a basic example of setting up Azure Databricks:

resource "azurerm_resource_group" "databricks_rg" {
  name     = "databricks-resources"
  location = "eastus"
}

resource "azurerm_databricks_workspace" "workspace" {
  name                = "databricks-workspace"
  resource_group_name = azurerm_resource_group.databricks_rg.name
  location           = azurerm_resource_group.databricks_rg.location
  sku                = "premium"

  custom_parameters {
    no_public_ip = true
  }

  tags = {
    Environment = "Production"
  }
}

Advanced Features

Network Configuration

Configure virtual network injection:

resource "azurerm_virtual_network" "databricks_vnet" {
  name                = "databricks-vnet"
  location           = azurerm_resource_group.databricks_rg.location
  resource_group_name = azurerm_resource_group.databricks_rg.name
  address_space      = ["10.0.0.0/16"]
}

resource "azurerm_subnet" "public" {
  name                 = "public-subnet"
  resource_group_name  = azurerm_resource_group.databricks_rg.name
  virtual_network_name = azurerm_virtual_network.databricks_vnet.name
  address_prefixes     = ["10.0.1.0/24"]

  delegation {
    name = "databricks-delegation"
    service_delegation {
      name = "Microsoft.Databricks/workspaces"
      actions = [
        "Microsoft.Network/virtualNetworks/subnets/join/action",
        "Microsoft.Network/virtualNetworks/subnets/prepareNetworkPolicies/action",
        "Microsoft.Network/virtualNetworks/subnets/unprepareNetworkPolicies/action"
      ]
    }
  }
}

resource "azurerm_subnet" "private" {
  name                 = "private-subnet"
  resource_group_name  = azurerm_resource_group.databricks_rg.name
  virtual_network_name = azurerm_virtual_network.databricks_vnet.name
  address_prefixes     = ["10.0.2.0/24"]

  delegation {
    name = "databricks-delegation"
    service_delegation {
      name = "Microsoft.Databricks/workspaces"
      actions = [
        "Microsoft.Network/virtualNetworks/subnets/join/action",
        "Microsoft.Network/virtualNetworks/subnets/prepareNetworkPolicies/action",
        "Microsoft.Network/virtualNetworks/subnets/unprepareNetworkPolicies/action"
      ]
    }
  }
}

resource "azurerm_databricks_workspace" "secure_workspace" {
  name                = "secure-databricks"
  resource_group_name = azurerm_resource_group.databricks_rg.name
  location           = azurerm_resource_group.databricks_rg.location
  sku                = "premium"

  custom_parameters {
    no_public_ip                                         = true
    virtual_network_id                                   = azurerm_virtual_network.databricks_vnet.id
    public_subnet_name                                   = azurerm_subnet.public.name
    private_subnet_name                                  = azurerm_subnet.private.name
    public_subnet_network_security_group_association_id  = azurerm_subnet_network_security_group_association.public.id
    private_subnet_network_security_group_association_id = azurerm_subnet_network_security_group_association.private.id
  }
}

Cluster Configuration

Using the Databricks provider:

provider "databricks" {
  azure_workspace_resource_id = azurerm_databricks_workspace.workspace.id
}

resource "databricks_cluster" "shared_autoscaling" {
  cluster_name            = "shared-autoscaling"
  spark_version          = "9.1.x-scala2.12"
  node_type_id           = "Standard_DS3_v2"
  autotermination_minutes = 20

  autoscale {
    min_workers = 1
    max_workers = 10
  }

  spark_conf = {
    "spark.databricks.cluster.profile" : "serverless"
    "spark.databricks.repl.allowedLanguages" : "python,sql"
  }

  custom_tags = {
    "Department" = "Data Science"
  }
}

Best Practices

  1. Use Infrastructure as Code for consistent deployments
  2. Implement proper monitoring and logging
  3. Use managed identities for enhanced security
  4. Configure auto-scaling appropriately
  5. Implement proper backup and disaster recovery

Security Considerations

Network Security

Configure network security groups:

resource "azurerm_network_security_group" "databricks_nsg" {
  name                = "databricks-nsg"
  location           = azurerm_resource_group.databricks_rg.location
  resource_group_name = azurerm_resource_group.databricks_rg.name

  security_rule {
    name                       = "AllowDBWebAccess"
    priority                   = 100
    direction                  = "Inbound"
    access                     = "Allow"
    protocol                   = "Tcp"
    source_port_range          = "*"
    destination_port_range     = "443"
    source_address_prefix      = "AzureCloud"
    destination_address_prefix = "*"
  }
}

Key Vault Integration

resource "azurerm_key_vault" "databricks_kv" {
  name                = "databricks-kv"
  location           = azurerm_resource_group.databricks_rg.location
  resource_group_name = azurerm_resource_group.databricks_rg.name
  tenant_id          = data.azurerm_client_config.current.tenant_id
  sku_name           = "standard"

  purge_protection_enabled = true
}

resource "azurerm_key_vault_access_policy" "databricks" {
  key_vault_id = azurerm_key_vault.databricks_kv.id
  tenant_id    = data.azurerm_client_config.current.tenant_id
  object_id    = azurerm_databricks_workspace.workspace.managed_resource_group_id

  secret_permissions = [
    "Get", "List"
  ]
}

Monitoring and Logging

Configure diagnostics settings:

resource "azurerm_monitor_diagnostic_setting" "databricks_diagnostics" {
  name                       = "databricks-diagnostics"
  target_resource_id        = azurerm_databricks_workspace.workspace.id
  log_analytics_workspace_id = azurerm_log_analytics_workspace.workspace.id

  log {
    category = "dbfs"
    enabled  = true
  }

  log {
    category = "clusters"
    enabled  = true
  }

  log {
    category = "accounts"
    enabled  = true
  }

  metric {
    category = "AllMetrics"
    enabled  = true
  }
}

Cost Management

Configure budgets and alerts:

resource "azurerm_consumption_budget_resource_group" "databricks_budget" {
  name              = "databricks-budget"
  resource_group_id = azurerm_resource_group.databricks_rg.id

  amount     = 5000
  time_grain = "Monthly"

  notification {
    enabled   = true
    threshold = 90.0
    operator  = "GreaterThan"

    contact_emails = [
      "admin@example.com"
    ]
  }
}

Integration Examples

Integration with Azure Data Factory:

resource "azurerm_data_factory_linked_service_azure_databricks" "linked_databricks" {
  name                = "linked-databricks"
  data_factory_id    = azurerm_data_factory.adf.id
  description        = "Databricks linked service"
  adb_domain        = azurerm_databricks_workspace.workspace.workspace_url

  msi_work_space_resource_id = azurerm_databricks_workspace.workspace.id
  existing_cluster_id       = databricks_cluster.shared_autoscaling.id
}

Integration with Azure Machine Learning:

resource "azurerm_machine_learning_compute" "databricks_compute" {
  name                          = "databricks-compute"
  location                     = azurerm_resource_group.databricks_rg.location
  machine_learning_workspace_id = azurerm_machine_learning_workspace.aml.id
  
  databricks {
    workspace_url = azurerm_databricks_workspace.workspace.workspace_url
  }

  identity {
    type = "SystemAssigned"
  }
}

Conclusion

Azure Databricks with Terraform provides powerful data analytics capabilities that can be managed through Infrastructure as Code. By following these best practices and configurations, you can create secure and scalable analytics environments in Azure.