Managing Azure Machine Learning with Terraform

Learn how to deploy and manage Azure Machine Learning workspaces using Terraform

Managing Azure Machine Learning with Terraform

Azure Machine Learning provides a cloud-based environment for training, deploying, and managing ML models. This guide shows you how to manage Azure Machine Learning resources using Terraform.

Video Tutorial

Prerequisites

  • Azure subscription
  • Terraform installed
  • Azure CLI installed
  • Basic understanding of Machine Learning concepts

Project Structure

.
├── main.tf                   # Main Terraform configuration file
├── variables.tf              # Variable definitions
├── outputs.tf               # Output definitions
├── terraform.tfvars         # Variable values
└── modules/
    └── machine-learning/
        ├── main.tf          # Machine Learning specific configurations
        ├── variables.tf      # Module variables
        ├── compute.tf       # Compute cluster configurations
        └── outputs.tf       # Module outputs

Basic Configuration

Here’s a basic example of setting up an Azure Machine Learning workspace:

resource "azurerm_resource_group" "ml_rg" {
  name     = "ml-resources"
  location = "eastus"
}

resource "azurerm_application_insights" "ml_insights" {
  name                = "ml-insights"
  location           = azurerm_resource_group.ml_rg.location
  resource_group_name = azurerm_resource_group.ml_rg.name
  application_type    = "web"
}

resource "azurerm_key_vault" "ml_vault" {
  name                = "mlkeyvault"
  location           = azurerm_resource_group.ml_rg.location
  resource_group_name = azurerm_resource_group.ml_rg.name
  tenant_id          = data.azurerm_client_config.current.tenant_id
  sku_name           = "standard"
}

resource "azurerm_storage_account" "ml_storage" {
  name                     = "mlstorage"
  location                = azurerm_resource_group.ml_rg.location
  resource_group_name      = azurerm_resource_group.ml_rg.name
  account_tier             = "Standard"
  account_replication_type = "LRS"
}

resource "azurerm_machine_learning_workspace" "ml_workspace" {
  name                    = "mlworkspace"
  location               = azurerm_resource_group.ml_rg.location
  resource_group_name     = azurerm_resource_group.ml_rg.name
  application_insights_id = azurerm_application_insights.ml_insights.id
  key_vault_id           = azurerm_key_vault.ml_vault.id
  storage_account_id     = azurerm_storage_account.ml_storage.id

  identity {
    type = "SystemAssigned"
  }
}

Advanced Features

Compute Clusters

Create compute clusters for training:

resource "azurerm_machine_learning_compute_cluster" "compute_cluster" {
  name                          = "cpu-cluster"
  location                     = azurerm_resource_group.ml_rg.location
  machine_learning_workspace_id = azurerm_machine_learning_workspace.ml_workspace.id
  vm_priority                  = "LowPriority"
  vm_size                      = "Standard_DS2_v2"

  scale_settings {
    min_node_count                       = 0
    max_node_count                       = 4
    scale_down_nodes_after_idle_duration = "PT30M"
  }

  identity {
    type = "SystemAssigned"
  }
}

Compute Instances

Create development compute instances:

resource "azurerm_machine_learning_compute_instance" "compute_instance" {
  name                          = "dev-instance"
  location                     = azurerm_resource_group.ml_rg.location
  machine_learning_workspace_id = azurerm_machine_learning_workspace.ml_workspace.id
  virtual_machine_size         = "Standard_DS3_v2"

  identity {
    type = "SystemAssigned"
  }
}

Best Practices

  1. Use Infrastructure as Code for consistent deployments
  2. Implement proper monitoring and logging
  3. Use managed identities for enhanced security
  4. Configure auto-scaling appropriately
  5. Implement proper backup and disaster recovery

Security Considerations

  1. Use Azure Key Vault for secrets management
  2. Implement network isolation using private endpoints
  3. Use managed identities instead of service principals
  4. Enable Azure Monitor for monitoring and alerting
  5. Regularly audit access and permissions

Network Configuration

Configure private endpoints for enhanced security:

resource "azurerm_virtual_network" "ml_vnet" {
  name                = "ml-vnet"
  resource_group_name = azurerm_resource_group.ml_rg.name
  location           = azurerm_resource_group.ml_rg.location
  address_space      = ["10.0.0.0/16"]
}

resource "azurerm_subnet" "ml_subnet" {
  name                                           = "ml-subnet"
  resource_group_name                            = azurerm_resource_group.ml_rg.name
  virtual_network_name                           = azurerm_virtual_network.ml_vnet.name
  address_prefixes                               = ["10.0.1.0/24"]
  enforce_private_link_endpoint_network_policies = true
}

resource "azurerm_private_endpoint" "ml_pe" {
  name                = "ml-pe"
  location           = azurerm_resource_group.ml_rg.location
  resource_group_name = azurerm_resource_group.ml_rg.name
  subnet_id          = azurerm_subnet.ml_subnet.id

  private_service_connection {
    name                           = "ml-privateserviceconnection"
    private_connection_resource_id = azurerm_machine_learning_workspace.ml_workspace.id
    subresource_names             = ["amlworkspace"]
    is_manual_connection          = false
  }
}

Monitoring and Logging

Configure diagnostics settings:

resource "azurerm_monitor_diagnostic_setting" "ml_diagnostics" {
  name                       = "ml-diagnostics"
  target_resource_id        = azurerm_machine_learning_workspace.ml_workspace.id
  log_analytics_workspace_id = azurerm_log_analytics_workspace.workspace.id

  log {
    category = "AmlComputeClusterEvent"
    enabled  = true
  }

  log {
    category = "AmlComputeClusterNodeEvent"
    enabled  = true
  }

  metric {
    category = "AllMetrics"
    enabled  = true
  }
}

Cost Management

Implement cost management features:

resource "azurerm_monitor_action_group" "cost_alert" {
  name                = "cost-alert"
  resource_group_name = azurerm_resource_group.ml_rg.name
  short_name         = "cost"

  email_receiver {
    name          = "admin"
    email_address = "admin@example.com"
  }
}

resource "azurerm_consumption_budget_resource_group" "ml_budget" {
  name              = "ml-budget"
  resource_group_id = azurerm_resource_group.ml_rg.id

  amount     = 1000
  time_grain = "Monthly"

  notification {
    enabled   = true
    threshold = 90.0
    operator  = "GreaterThan"

    contact_emails = [
      "admin@example.com"
    ]
  }
}

Conclusion

Azure Machine Learning with Terraform provides a powerful way to manage ML resources in Azure. By following these best practices and configurations, you can create secure and scalable machine learning environments in the cloud.