Microsoft Azure ML fundamentals

 Microsoft Azure Machine Learning


What is machine learning?
Machine learning is a technique that uses mathematics and statistics to create a model that can predict unknown values.


For example, 

suppose Adventure Works Cycles is a business that rents cycles in a city. 

Adventure Works cycle rental location, on a cloudy day in January

The business could use historic data to train a model that predicts daily rental demand in order to make sure sufficient staff and cycles are available.

To do this, Adventure Works could create a machine learning model that takes information about a specific day (the day of week, the anticipated weather conditions, and so on) as an input, and predicts the expected number of rentals as an output.

Mathematically, you can think of machine learning as a way of defining a function (let's call it f) that operates on one or more features of something (which we'll call x) to calculate a predicted label (y) - like this:

f(x) = y

In this bicycle rental example, the details about a given day (day of the week, weather, and so on) are the features (x), the number of rentals for that day is the label (y), and the function (f) that calculates the number of rentals based on the information about the day is encapsulated in a machine learning model.

The specific operation that the f function performs on x to calculate y depends on a number of factors, including the type of model you're trying to create and the specific algorithm used to train the model. Additionally in most cases, the data used to train the machine learning model requires some pre-processing before model training can be performed.


Azure Machine Learning

Training and deploying an effective machine learning model involves a lot of work, much of it time-consuming and resource-intensive. 

Azure Machine Learning is a cloud-based service that helps simplify some of the tasks and reduce the time it takes to prepare data, train a model, and deploy a predictive service. 


Create an Azure Machine Learning workspace

Data scientists expend a lot of effort exploring and pre-processing data, and trying various types of model-training algorithms to produce accurate models, which is time consuming, and often makes inefficient use of expensive compute hardware.

Azure Machine Learning is a cloud-based platform for building and operating machine learning solutions in Azure. It includes a wide range of features and capabilities that help data scientists prepare data, train models, publish predictive services, and monitor their usage. Most importantly, it helps data scientists increase their efficiency by automating many of the time-consuming tasks associated with training models; and it enables them to use cloud-based compute resources that scale effectively to handle large volumes of data while incurring costs only when actually used.

Create an Azure Machine Learning workspace

To use Azure Machine Learning, you create a workspace in your Azure subscription. You can then use this workspace to manage data, compute resources, code, models, and other artifacts related to your machine learning workloads.

Follow these steps to create a workspace:

  1. Sign into the Azure portal  using your Microsoft credentials.
  2. Select +Create a resource, search for Machine Learning, and create a new Machine Learning resource the following settings:
    • Workspace NameA unique name of your choice
    • SubscriptionYour Azure subscription
    • Resource groupCreate a new resource group with a unique name
    • LocationChoose any available location
    • Workspace edition: Enterprise
  3. Wait for your workspace to be created (it can take a few minutes). Then go to it in the portal.
  4. On the Overview page for your workspace, launch Azure Machine Learning studio (or open a new browser tab and navigate to https://ml.azure.com ), and sign into Azure Machine Learning studio using your Microsoft account.
  5. In Azure Machine Learning studio, toggle the ☰ icon at the top left to view the various pages in the interface. You can use these pages to manage the resources in your workspace.

 Important

If you intend to use an Azure Machine Learning workspace that you created previously using the Basic edition, upgrade it to Enterprise edition to make the automated machine learning interface available.

You can manage your workspace using the Azure portal, but for data scientists and Machine Learning operations engineers, Azure Machine Learning studio provides a more focused user interface for managing workspace resources.



Create compute resources

After you have created an Azure Machine Learning workspace, you can use it to manage the various assets and resources you need to create machine learning solutions. At its core, Azure Machine Learning is a platform for training and managing machine learning models, for which you need compute on which to run the training process.


Azure compute is an on-demand computing service for running cloud-based applications. It provides computing resources like multi-core processors and supercomputers via virtual machines and containers. It also provides serverless computing to run apps without requiring infrastructure setup or configuration.

Create compute targets

Compute targets are cloud-based resources on which you can run model training and data exploration processes.

  1. In Azure Machine Learning studio , view the Compute page (under Manage). This is where you manage the compute targets for your data science activities. There are four kinds of compute resource you can create:
    • Compute Instances: Development workstations that data scientists can use to work with data and models.
    • Compute Clusters: Scalable clusters of virtual machines for on-demand processing of experiment code.
    • Inference Clusters: Deployment targets for predictive services that use your trained models.
    • Attached Compute: Links to existing Azure compute resources, such as Virtual Machines or Azure Databricks clusters.
  2. On the Compute Instances tab, add a new compute instance with the following settings. You'll use this as a workstation from which to test your model:
    • Compute nameenter a unique name
    • Virtual Machine type: CPU
    • Virtual Machine size: Standard_DS2_v2
  3. While the compute instance is being created, switch to the Compute Clusters tab, and add a new compute cluster with the following settings. You'll use this to train a machine learning model:
    • Compute nameenter a unique name
    • Virtual Machine size: Standard_DS2_v2
    • Virtual Machine priority: Dedicated
    • Minimum number of nodes: 2
    • Maximum number of nodes: 2
    • Idle seconds before scale down: 120

 Note

In a production environment, you'd typically set the minimum number of nodes value to 0 so that compute is only started when it is needed. However, compute can take a while to start, so to reduce the amount of time you spend waiting for it in this module, you've initialized it with two permanently running nodes.

If you decide not to complete this module, be sure to stop your compute instance and edit the compute cluster to reset the minimum number of nodes to 0 in order to avoid leaving your compute running and incurring unnecessary charges to your Azure subscription. Alternatively, if you're finished exploring Azure Machine Learning, delete the entire resource group in your Azure subscription.

The compute targets will take some time to be created. You can move onto the next unit while you wait.




Explore data

Machine learning models must be trained with existing data. In this case, you'll use a dataset of historical bicycle rental details to train a model that predicts the number of bicycle rentals that should be expected on a given day, based on seasonal and meteorological features.

Create a dataset

In Azure Machine Learning, data for model training and other operations is usually encapsulated in an object called a dataset.

  1. in Azure Machine Learning studio , view the Datasets page (under Assets), and create a new dataset from web files with the following settings:
    • Basic Info:
    • Settings and preview:
      • File format: Delimited
      • Delimiter: Comma
      • Encoding: UTF-8
      • Column headers: Use headers from first file
      • Skip rows: None
    • Schema:
      • Include all columns other than Path
      • Review the automatically detected types
    • Confirm details:
      • Do not profile the dataset after creation
  2. After the dataset has been created, open it and view the Explore page to see a sample of the data. This data contains historical features and labels for bike rentals.

CitationThis data is derived from Capital Bikeshare and is used in accordance with the published data license agreement.




Train a machine learning model

Azure Machine Learning includes an automated machine learning capability that leverages the scalability of cloud compute to automatically try multiple pre-processing techniques and model-training algorithms in parallel to find the best performing model for your data.

Run an automated machine learning experiment

In Azure Machine Learning, operations that you run are called experiments. Follow the steps below to run an experiment that uses automated machine learning to train a regression model that predicts bicycle rentals.

  1. In Azure Machine Learning studio , view the Automated ML page (under Author).

  2. Create a new Automated ML run with the following settings:

    • Select dataset:
      • Dataset: bike-rentals
    • Configure run:
      • New experiment name: mslearn-bike-rental
      • Target column: rentals (this is the label the model will be trained to predict)
      • Training compute targetthe compute cluster you created previously
    • Task type and settings:
      • Task type: Regression (the model will predict a numeric value)
      • Additional configuration settings:
        • Primary metric: Select Normalized root mean square error (more about this metric later!)
        • Explain best model: Selected - this option causes automated machine learning to calculate feature importance for the best model; making it possible to determine the influence of each feature on the predicted label.
        • Blocked algorithmsBlock all other than RandomForest and LightGBM - normally you'd want to try as many as possible, but doing so can take a long time!
        • Exit criterion:
          • Training job time (hours): 0.25 - this causes the experiment to end after a maximum of 15 minutes.
          • Metric score threshold: 0.08 - this causes the experiment to end if a model achieves a normalized root mean square error metric score of 0.08 or less.
      • Featurization settings:
        • Enable featurization: Selected - this causes Azure Machine Learning to automatically preprocess the features before training.
  3. When you finish submitting the automated ML run details, it will start automatically. Wait for the run status to change from Preparing to Running (this may take five minutes or so, as the cluster nodes need to be initialized before training can begin - now might be a good time for a coffee break!). You may need to select ↻ Refresh periodically.

  4. When the run status changes to Running, view the Models tab and observe as each possible combination of training algorithm and pre-processing steps is tried and the performance of the resulting model is evaluated. The page will automatically refresh periodically, but you can also select ↻ Refresh.

  5. Wait for the experiment to finish. It may take a few minutes.

Review the best model

After the experiment has finished; you can review the best performing model that was generated (note that in this case, we used exit criteria to stop the experiment - so the "best" model found by the experiment may not be the best possible model, just the best one found within the time allowed for this exercise!).

  1. On the Details tab of the automated machine learning run, note the best model summary.

  2. Select the Algorithm name for the best model to view its details.

    The best model is identified based on the evaluation metric you specified (Normalized root mean square error). To calculate this metric, the training process used some of the data to train the model, and applied a technique called cross-validation to iteratively test the trained model with data it wasn't trained with and compare the predicted value with the actual known value. The difference between the predicted and actual value (known as the residuals) indicates the amount of error in the model, and this particular performance metric is calculated by squaring the errors across all of the test cases, finding the mean of these squares, and then taking the square root. What all of this means is that smaller this value is, the more accurately the model is predicting.

  3. Next to the Normalized root mean square error value, select View all other metrics to see values of other possible evaluation metrics for a regression model.

  4. Select the Metrics tab and select the residuals and predicted_true charts if they are not already selected. Then review the charts, which show the performance of the model by comparing the predicted values against the true values, and by showing the residuals (differences between predicted and actual values) as a histogram.

    The Predicted vs. True chart should show a diagonal trend in which the predicted value correlates closely to the true value. A dotted line shows how a perfect model should perform, and the closer the line for your model's average predicted value is to this, the better its performance. A histogram below the line chart shows the distribution of true values.

    Predicted vs True chart

    The Residual Histogram shows the frequency of residual value ranges. Residuals represent variance between predicted and true values that can't be explained by the model - in other words, errors; so what you should hope to see is that the most frequently occurring residual values are clustered around 0 (in other words, most of the errors are small), with fewer errors at the extreme ends of the scale.

    Residuals histogram

  5. Select the Explanations tab, and view the Global Importance chart. This shows how much each feature in the dataset influences the label prediction, like this:

    Residuals histogram

 





Deploy a model as a service

After you've used automated machine learning to train some models, you can deploy the best performing model as a service for client applications to use.

Deploy a predictive service

In Azure Machine Learning, you can deploy a service as an Azure Container Instances (ACI) or to an Azure Kubernetes Service (AKS) cluster. 

Azure Container Instances is a service that enables a developer to deploy containers on the Microsoft Azure public cloud without having to provision or manage any underlying infrastructure

Containers are an executable unit of software in which application code is packaged, along with its libraries and dependencies,


The main benefits of Azure Container Instances (ACI) are:
  • Run containers without managing servers.
  • Increase agility with containers on demand.
  • Deploy containers to the cloud with unprecedented simplicity and speed—with a single command.
  • Secure applications with hypervisor isolation.



Azure Kubernetes Service is a robust and cost-effective container orchestration service that helps you to deploy and manage containerized applications in seconds where additional resources are assigned automatically without the headache of managing additional servers.

Kubernetes is an open-source container orchestration tool designed to automate deploying, scaling, and operating containerized applications. Kubernetes was born from Google's 15-year experience running production workloads. It is designed to grow from tens, thousands, or even millions of containers

For production scenarios, an AKS deployment is recommended, for which you must create an inference cluster compute target. In this exercise, you'll use an ACI service, which is a suitable deployment target for testing, and does not require you to create an inference cluster.

  1. In Azure Machine Learning studio , on the Automated ML page, select the run for your automated machine learning experiment and view the Details tab.
  2. Select the algorithm name for the best model. Then, on the Model tab, use the Deploy button to deploy the model with the following settings:
    • Name: predict-rentals
    • Description: Predict cycle rentals
    • Compute type: ACI
    • Enable authentication: Selected
  3. Wait for the deployment to start - this may take a few seconds. Then, in the Model summary section, observe the Deploy status for the predict-rentals service, which should be Running. Wait for this status to change to Successful. You may need to select ↻ Refresh periodically.
  4. In Azure Machine Learning studio, view the Endpoints page and select the predict-rentals real-time endpoint. Then select the Consume tab and note the following information there. You need this information to connect to your deployed service from a client application.
    • The REST endpoint for your service
    • the Primary Key for your service
  5. Note that you can use the ⧉ link next to these values to copy them to the clipboard.

Test the deployed service

Now that you've deployed a service, you can test it using some simple code.

  1. With the Consume page for the predict-rentals service page open in your browser, open a new browser tab and open a second instance of Azure Machine Learning studio . Then in the new tab, view the Notebooks page. On the Notebooks page, create a new file with the following settings:

    • File name: bike_test.ipynb
    • File type: Notebook
    • Overwrite if already exists: Selected
    • Select target directorySelect the folder with your user name under User files
  2. When the new notebook has been created, ensure that the compute instance you created previously is selected in the Compute box, and that it has a status of Running.

  3. Edit the notebook inline, and in the cell that has been created in the notebook, paste the following code:

    Python
    endpoint = 'YOUR_ENDPOINT' #Replace with your endpoint
    key = 'YOUR_KEY' #Replace with your key
    
    import json
    import requests
    
    #An array of features based on five-day weather forecast
    x = [[1,1,2022,1,0,6,0,2,0.344167,0.363625,0.805833,0.160446],
        [2,1,2022,1,0,0,0,2,0.363478,0.353739,0.696087,0.248539],
        [3,1,2022,1,0,1,1,1,0.196364,0.189405,0.437273,0.248309],
        [4,1,2022,1,0,2,1,1,0.2,0.212122,0.590435,0.160296],
        [5,1,2022,1,0,3,1,1,0.226957,0.22927,0.436957,0.1869]]
    
    #Convert the array to JSON format
    input_json = json.dumps({"data": x})
    
    #Set the content type and authentication for the request
    headers = {"Content-Type":"application/json",
            "Authorization":"Bearer " + key}
    
    #Send the request
    response = requests.post(endpoint, input_json, headers=headers)
    
    #If we got a valid response, display the predictions
    if response.status_code == 200:
        y = json.loads(response.json())
        print("Predictions:")
        for i in range(len(x)):
            print (" Day: {}. Predicted rentals: {}".format(i+1, max(0, round(y["result"][i]))))
    else:
        print(response)
    

     Note

    Don't worry too much about the details of the code. It just defines features for a five day period using hypothetical weather forecast data, and uses the predict-rentals service you created to predict cycle rentals for those five days.

  4. Switch to the browser tab containing the Consume page for the predict-rentals service, and copy the REST endpoint for your service. The switch back to the tab containing the notebook and paste the key into the code, replacing YOUR_ENDPOINT.

  5. Switch to the browser tab containing the Consume page for the predict-rentals service, and copy the Primary Key for your service. The switch back to the tab containing the notebook and paste the key into the code, replacing YOUR_KEY.

  6. Save the notebook, Then use the  button next to the cell to run the code.

  7. Verify that predicted number of rentals for each day in the five day period are returned.






Create an Azure Machine Learning workspace

Azure Machine Learning is a cloud-based platform for building and operating machine learning solutions in Azure. It includes a wide range of features and capabilities that help data scientists prepare data, train models, publish predictive services, and monitor their usage. One of these features is a visual interface called designer, that you can use to train, test, and deploy machine learning models without writing any code.

Create an Azure Machine Learning workspace

To use Azure Machine Learning, you create a workspace in your Azure subscription. You can then use this workspace to manage data, compute resources, code, models, and other artifacts related to your machine learning workloads.

If you do not already have one, follow these steps to create a workspace:

  1. Sign into the Azure portal  using your Microsoft credentials.
  2. Select +Create a resource, search for Machine Learning, and create a new Machine Learning resource the following settings:
    • Workspace NameA unique name of your choice
    • SubscriptionYour Azure subscription
    • Resource groupCreate a new resource group with a unique name
    • LocationChoose any available location
    • Workspace edition: Enterprise
  3. Wait for your workspace to be created (it can take a few minutes). Then go to it in the portal.
  4. On the Overview page for your workspace, launch Azure Machine Learning Studio (or open a new browser tab and navigate to https://ml.azure.com ), and sign into Azure Machine Learning studio using your Microsoft account.
  5. In Azure Machine Learning studio, toggle the ☰ icon at the top left to view the various pages in the interface. You can use these pages to manage the resources in your workspace.

You can manage your workspace using the Azure portal, but for data scientists and Machine Learning operations engineers, Azure Machine Learning studio provides a more focused user interface for managing workspace resources.

 Important

If you intend to use an Azure Machine Learning workspace that you created previously using the Basic edition, upgrade it to Enterprise edition to make the designer interface available.




Create compute resources

To train and deploy models using Azure Machine Learning designer, you need compute on which to run the training process, test the model, and host the model in a deployed service.

Create compute targets

Compute targets are cloud-based resources on which you can run model training and data exploration processes.

  1. In Azure Machine Learning studio , view the Compute page (under Manage). This is where you manage the compute targets for your data science activities. There are four kinds of compute resource you can create:
    • Compute Instances: Development workstations that data scientists can use to work with data and models.
    • Compute Clusters: Scalable clusters of virtual machines for on-demand processing of experiment code.
    • Inference Clusters: Deployment targets for predictive services that use your trained models.
    • Attached Compute: Links to existing Azure compute resources, such as Virtual Machines or Azure Databricks clusters.
  2. On the Compute Instances tab, add a new compute instance with the following settings. You'll use this to test your model:
    • Compute nameenter a unique name
    • Virtual Machine type: CPU
    • Virtual Machine size: Standard_DS2_v2
  3. While the compute instance is being created, switch to the Compute Clusters tab, and add a new compute cluster with the following settings. You'll use this to train a machine learning model:
    • Compute nameenter a unique name
    • Virtual Machine size: Standard_DS2_v2
    • Virtual Machine priority: Dedicated
    • Minimum number of nodes: 2
    • Maximum number of nodes: 2
    • Idle seconds before scale down: 120
  4. While the compute cluster is being created, switch to the Inference Clusters tab, and add a new cluster with the following settings. You'll use this to deploy your model as a service.
    • Compute nameenter a unique name
    • Kubernetes Service: Create new
    • RegionSelect a different region than the one used for your workspace
    • Virtual Machine size: Standard_DS2_v2 (Use the filter to find this in the list)
    • Cluster purpose: Dev-test
    • Number of nodes: 2
    • Network configuration: Basic
    • Enable SSL configuration: Unselected
  5. Verify that the inference cluster is in the Creating state - it will take a while to be created, so leave it for now.

 Note

In a production environment, you'd typically set the minimum number of nodes value to 0 so that compute is only started when it is needed. However, compute can take a while to start, so to reduce the amount of time you spend waiting for it in this module, you've initialized it with two permanently running nodes.

If you decide not to complete this module, be sure to stop your compute instance, edit the compute cluster to reset the minimum number of nodes to 0, and delete the inference cluster in order to avoid leaving your compute running and incurring unnecessary charges to your Azure subscription. Alternatively, if you're finished exploring Azure Machine Learning, delete the entire resource group in your Azure subscription.

The compute targets will take some time to be created. You can move onto the next unit while you wait.


Explore data

To train a regression model, you need a dataset that includes historical features (characteristics of the entity for which you want to make a prediction) and known label values (the numeric value that you want to train a model to predict).

Create a pipeline

To use the Azure Machine Learning designer, you create a pipeline that you will use to train a machine learning model. This pipeline starts with the dataset from which you want to train the model.

  1. in Azure Machine Learning studio , view the Designer page (under Author), and select + to create a new pipeline.
  2. In the Settings pane, change the default pipeline name (Pipeline-Created-on-date) to Auto Price Training (if the Settings pane is not visible, select the  icon next to the pipeline name at the top).
  3. Observe that you need to specify a compute target on which to run the pipeline. In the Settings pane, use Select compute target to select the compute cluster you created previously.

Add and explore a dataset

In this module, you'll train a regression model that predicts the price of an automobile based on its characteristics. Azure Machine Learning includes a sample dataset that you can use for this model.

  1. On the left side of the designer, select the Datasets (⌕) tab, and drag the Automobile price data (Raw) dataset from the Samples section onto the canvas.
  2. Right-click (Ctrl+click on a Mac) the Automobile price data (Raw) dataset on the canvas, and on the Visualize menu, select Dataset output.
  3. Review the schema of the data, noting that you can see the distributions of the various columns as histograms.
  4. Scroll to the right of the dataset until you see the Price column. This is the label your model will predict.
  5. Select the column header for the price column and view the details that are displayed in the pane to the right. These include various statistics for the column values, and a histogram showing the distribution of the column values.
  6. Scroll back to the left and select the normalized-losses column header. Then review the statistics for this column noting, there are quite a few missing values in this column. This will limit its usefulness in predicting the price label; so you might want to exclude it from training.
  7. View the statistics for the borestroke, and horsepower columns, noting the number of missing values. These columns have significantly fewer missing values than normalized-losses, so they may still be useful in predicting price if you exclude the rows where the values are missing from training.
  8. Compare the values in the strokepeak-rpm, and city-mpg columns. These are all measured in different scales, and its possible that the larger values for peak-rpm might bias the training algorithm and create an over-dependency on this column compared to columns with lower values, such as stroke. Typically, data scientists mitigate this possible bias by normalizing the numeric columns so they're on the similar scales.
  9. Close the Automobile price data (Raw) result visualization window so that you can see the dataset on the canvas like this:

The Automobile price data (Raw) dataset on the designer canvas

Add data transformations

You typically apply data transformations to prepare the data for modeling. In the case of the automobile price data, you'll add transformations to address the issues you identified when exploring the data.

  1. In the pane on the left, view the Modules (⊞) tab and expand the Data Transformation section, which contains a wide range of modules you can use to transform data before model training.
  2. Drag a Select Columns in Dataset module to the canvas, below the Automobile price data (Raw) module. Then connect the output at the bottom of the Automobile price data (Raw) module to the input at the top of the Select Columns in Dataset module, like this:

The Automobile price data (Raw) dataset connected to the Select Columns in Dataset module

  1. Select the Select Columns in Dataset module, and in its Settings pane on the right, select Edit column. Then in the Select columns window, select By name and use the + links to add all columns other than normalized-losses, like this:

all columns other than normalized_losses

In the rest of this exercise, you're going to create a pipeline that looks like this:

Automobile price data (Raw) dataset with Select Columns in Dataset, Clean Missing Data, and Normalize Data modules

Follow the remaining steps, using the image above for reference as you add and configure the required modules.

  1. Drag a Clean Missing Data module from the Data Transformations section, and place it under the Select Columns in Dataset module. Then connect the output from the Select Columns in Dataset module to the input of the Clean Missing Data module.
  2. Select the Clean Missing Data module, and in the settings pane on the right, click Edit column. Then in the Select columns window, select With rules, in the Include list select Column names, in the box of column names enter borestroke, and horsepower (making sure you match the spelling and capitalization exactly), like this:

bore, stroke, and horsepower columns are selected

  1. With the Clean Missing Data module still selected, in the settings pane, set the following configuration settings:
    • Minimum missing value ratio: 0.0
    • Maximum missing value ratio: 1.0
    • Cleaning mode: Remove entire row
  2. Drag a Normalize Data module to the canvas, below the Clean Missing Data module. Then connect the left-most output from the Clean Missing Data module to the input of the Normalize Data module.
  3. Select the Normalize Data module and view its settings, noting that it requires you to specify the transformation method and the columns to be transformed. Then, set the transformation to MinMax and edit the columns by applying a rule to include the following Column names (ensuring you match the spelling, capitalization, and hyphenation exactly):
    • symboling
    • wheel-base
    • length
    • width
    • height
    • curb-weight
    • engine-size
    • bore
    • stroke
    • compression-ratio
    • horsepower
    • peak-rpm
    • city-mpg
    • highway-mpg

all numeric columns other than price are selected

Run the pipeline

To apply your data transformations, you need to run the pipeline as an experiment.

  1. Ensure your pipeline looks similar to this:

Automobile price data (Raw) dataset with Select Columns in Dataset, Clean Missing Data, and Normalize Data modules

  1. Select Submit, and run the pipeline as a new experiment named auto-price-training on your compute cluster.
  2. Wait for the run to finish. This may take 5 minutes or more. When the run has completed, the modules should look like this:

Automobile price data (Raw) dataset with Select Columns in Dataset, Clean Missing Data, and Normalize Data modules in completed state

View the transformed data

The dataset is now prepared for model training.

  1. Select the completed Normalize Data module, and in its Settings pane on the right, on the Outputs + logs tab, select the Visualize icon for the Transformed dataset.
  2. View the data, noting that the normalized-losses column has been removed, all rows contain data for borestroke, and horsepower, and the numeric columns you selected have been normalized to a common scale.
  3. Close the normalized data result visualization.






Create and run a training pipeline

After you've used data transformations to prepare the data, you can use it to train a machine learning model.

Add training modules

It's common practice to train the model using a subset of the data, while holding back some data with which to test the trained model. This enables you to compare the labels that the model predicts with the actual known labels in the original dataset.

In this exercise, you're going to extend the Auto Price Training pipeline as shown here:

split data, then train with linear regression and score

Follow the steps below, using the image above for reference as you add and configure the required modules.

  1. Open the Auto Price Training pipeline you created in the previous unit if it's not already open.

  2. In the pane on the left, on the Modules tab, in the Data Transformations section, drag a Split Data module onto the canvas under the Normalize Data module. Then connect the Transformed Dataset (left) output of the Normalize Data module to the input of the Split Data module.

  3. Select the Split Data module, and configure its settings as follows:

    • Splitting mode: Split Rows
    • Fraction of rows in the first output dataset: 0.7
    • Random seed: 123
    • Stratified split: False
  4. Expand the Model Training section in the pane on the left, and drag a Train Model module to the canvas, under the Split Data module. Then connect the Result dataset1 (left) output of the Split Data module to the Dataset (right) input of the Train Model module.

  5. The model we're training will predict the price value, so select the Train Model module and modify its settings to set the Label column to price (matching the case and spelling exactly!)

  6. The price label the model will predict is a numeric value, so we need to train the model using a regression algorithm. Expand the Machine Learning Algorithms section, and under Regression, drag a Linear Regression module to the canvas, to the left of the Split Data module and above the Train Model module. Then connect its output to the Untrained model (left) input of the Train Model module.

 Note

There are multiple algorithms you can use to train a regression model. For help choosing one, take a look at the Machine Learning Algorithm Cheat Sheet for Azure Machine Learning designer .

  1. To test the trained model, we need to use it to score the validation dataset we held back when we split the original data - in other words, predict labels for the features in the validation dataset. Expand the Model Scoring & Evaluation section and drag a Score Model module to the canvas, below the Train Model module. Then connect the output of the Train Model module to the Trained model (left) input of the Score Model module; and drag the Results dataset2 (right) output of the Split Data module to the Dataset (right) input of the Score Model module.
  2. Ensure your pipeline looks like this:

split data, then train with linear regression and score

Run the training pipeline

Now you're ready to run the training pipeline and train the model.

  1. Select Submit, and run the pipeline using the existing experiment named auto-price-training.
  2. Wait for the experiment run to complete. This may take 5 minutes or more.
  3. When the experiment run has completed, select the Score Model module and in the settings pane, on the Outputs + logs tab, under Data outputs in the Scored dataset section, use the Visualize icon to view the results.
  4. Scroll to the right, and note that next to the price column (which contains the known true values of the label) there is a new column named Scored labels, which contains the predicted label values.
  5. Close the Score Model result visualization window.

The model is predicting values for the price label, but how reliable are its predictions? To assess that, you need to evaluate the model.






Evaluate a regression model

To evaluate a regression model, you could simply compare the predicted labels to the actual labels in the validation dataset to held back during training, but this is an imprecise process and doesn't provide a simple metric that you can use to compare the performance of multiple models.

Add an Evaluate Model module

  1. Open the Auto Price Training pipeline you created in the previous unit if it's not already open.
  2. In the pane on the left, on the Modules tab, in the Model Scoring & Evaluation section, drag an Evaluate Model module to the canvas, under the Score Model module, and connect the output of the Score Model module to the Scored dataset (left) input of the Evaluate Model module.
  3. Ensure your pipeline looks like this:

Evaluate Model module added to Score Model module

  1. Select Submit, and run the pipeline using the existing experiment named auto-price-training.
  2. Wait for the experiment run to complete.
  3. When the experiment run has completed, select the Evaluate Model module and in the settings pane, on the Outputs + logs tab, under Data outputs in the Evaluation results section, use the Visualize icon to view the results. These include the following regression performance metrics:
    • Mean Absolute Error (MAE): The average difference between predicted values and true values. This value is based on the same units as the label, in this case dollars. The lower this value is, the better the model is predicting.
    • Root Mean Squared Error (RMSE): For this metric, the mean difference between predicted and true values is squared, and then the square root is calculated. The result is a metric based on the same unit as the label (dollars). When compared to the MAE (above), a larger difference indicates greater variance in the individual errors (for example, with some errors being very small, while others are large). If the MAE and RMSE are approximately the same, then all individual errors are of a similar magnitude.
    • Relative Squared Error (RSE): A relative metric between 0 and 1 based on the square of the differences between predicted and true values. The closer to 0 this metric is, the better the model is performing. Because this metric is relative, it can be used to compare models where the labels are in different units.
    • Relative Absolute Error (RAE): A relative metric between 0 and 1 based on the absolute differences between predicted and true values. The closer to 0 this metric is, the better the model is performing. Like RSE, this metric can be used to compare models where the labels are in different units.
    • Coefficient of Determination (R2): This metric is more commonly referred to as R-Squared, and summarizes how much of the variance between predicted and true values is explained by the model. The closer to 1 this value is, the better the model is performing.
  4. Close the Evaluate Model result visualization window.

You can try a different regression algorithm and compare the results by connecting the same outputs from the Split Data module to a second Train model module (with a different algorithm) and a second Score Model module; and then connecting the outputs of both Score Model modules to the same Evaluate Model module for a side-by-side comparison.

When you've identified a model with evaluation metrics that meet your needs, you can prepare to use that model with new data.






Create an inference pipeline

After creating and running a pipeline to train the model, you need a second pipeline that performs the same data transformations for new data, and then uses the trained model to inference (in other words, predict) label values based on its features. This will form the basis for a predictive service that you can publish for applications to use.

Create and run an inference pipeline

  1. In Azure Machine Learning Studio, click the Designer page to view all of the pipelines you have created. Then open the Auto Price Training pipeline you created previously.

  2. In the Create inference pipeline drop-down list, click Real-time inference pipeline. After a few seconds, a new version of your pipeline named Auto Price Training-real time inference will be opened.

    If the pipeline does not include Web Service Input and Web Service Output modules, go back to the Designer page and then re-open the Auto Price Training-real time inference pipeline.

  3. Rename the new pipeline to Predict Auto Price, and then review the new pipeline. It contains a web service input for new data to be submitted, and a web service output to return results. Some of the transformations and training steps have been encapsulated in this pipeline so that the statistics from your training data will be used to normalize any new data values, and the trained model will be used to score the new data.

    You are going to make the following changes to the inference pipeline:

An inference pipeline with changes indicated

  • Replace the Automobile price data (Raw) dataset with an Enter Data Manually module that does not include the label column (price)

  • Modify the Select Columns in Dataset module to remove any reference to the (now absent) price column.

  • Remove the Evaluate Model module.

  • Insert an Execute Python Script module before the web service output to return only the predicted label.

    Follow the remaining steps below, using the image and information above for reference as you modify the pipeline.

  1. The inference pipeline assumes that new data will match the schema of the original training data, so the Automobile price data (Raw) dataset from the training pipeline is included. However, this input data includes the price label that the model predicts, which is unintuitive to include in new car data for which a price prediction has not yet been made. Delete this module and replace it with an Enter Data Manually module from the Data Input and Output section of the Modules tab, containing the following CSV data, which includes feature values without labels for three cars (copy and paste the entire block of text):

    CSV
    symboling,normalized-losses,make,fuel-type,aspiration,num-of-doors,body-style,drive-wheels,engine-location,wheel-base,length,width,height,curb-weight,engine-type,num-of-cylinders,engine-size,fuel-system,bore,stroke,compression-ratio,horsepower,peak-rpm,city-mpg,highway-mpg
    3,NaN,alfa-romero,gas,std,two,convertible,rwd,front,88.6,168.8,64.1,48.8,2548,dohc,four,130,mpfi,3.47,2.68,9,111,5000,21,27
    3,NaN,alfa-romero,gas,std,two,convertible,rwd,front,88.6,168.8,64.1,48.8,2548,dohc,four,130,mpfi,3.47,2.68,9,111,5000,21,27
    1,NaN,alfa-romero,gas,std,two,hatchback,rwd,front,94.5,171.2,65.5,52.4,2823,ohcv,six,152,mpfi,2.68,3.47,9,154,5000,19,26
    
  2. Connect the new Enter Data Manually module to the same dataset input of the Select Columns in Dataset module as the Web Service Input.

  3. Now that you've changed the schema of the incoming data to exclude the price field, you need to remove any explicit uses of this field in the remaining modules. Select the Select Columns in Dataset module and then in the settings pane, edit the columns to remove the price field.

  4. The inference pipeline includes the Evaluate Model module, which is not useful when predicting from new data, so delete this module.

  5. The output from the Score Model module includes all of the input features as well as the predicted label. To modify the output to include only the prediction:

    • Delete the connection between the Score Model module and the Web Service Output.

    • Add an Execute Python Script module from the Python Language section, replacing all of the the default python script with the following code (which selects only the Scored Labels column and renames it to predicted_price):

      Python
      import pandas as pd
      
      def azureml_main(dataframe1 = None, dataframe2 = None):
      
          scored_results = dataframe1[['Scored Labels']]
          scored_results.rename(columns={'Scored Labels':'predicted_price'},
                              inplace=True)
          return scored_results
      
    • Connect the output from the Score Model module to the Dataset1 (left-most) input of the Execute Python Script, and connect the output of the Execute Python Script module to the Web Service Output.

  6. Verify that your pipeline looks similar to the following:

A visual inference pipeline

  1. Submit the pipeline as a new experiment named predict-auto-price on your compute cluster. This may take a while!
  2. When the pipeline has completed, select the Execute Python Script module, and in the settings pane, on the Output + logs tab, visualize the Result dataset to see the predicted prices for the three cars in the input data.
  3. Close the visualization window.

Your inference pipeline predicts prices for cars based on their features. Now you're ready to publish the pipeline so that client applications can use it.





Deploy a predictive service

After you've created and tested an inference pipeline for real-time inferencing, which you can publish it as a service for client applications to use.

To publish a real-time inference pipeline as a service, you must deploy it to an Azure Kubernetes Service (AKS) cluster. In this exercise, you'll use the AKS inference cluster you created previously in this module.

Deploy a service

  1. View the Predict Auto Price inference pipeline you created in the previous unit.
  2. At the top right, select Deploy, and set up a new real-time endpoint named predict-auto-price on the inference cluster you created previously.
  3. Wait for the web service to be deployed - this can take several minutes. The deployment status is shown at the top left of the designer interface.

Test the service

Now you can test your deployed service from a client application - in this case, you'll use the code in the cell below to simulate a client application.

  1. On the Endpoints page, open the predict-auto-price real-time endpoint.

  2. When the predict-auto-price endpoint opens, view the Consume tab and note the following information there. You need this to connect to your deployed service from a client application.

    • The REST endpoint for your service
    • The Primary Key for your service
  3. Observe that you can use the ⧉ link next to these values to copy them to the clipboard.

  4. With the Consume page for the predict-auto-price service page open in your browser, open a new browser tab and open a second instance of Azure Machine Learning studio . Then in the new tab, view the Notebooks page.

  5. On the Notebooks page, create a new file with the following settings:

    • File name: auto_test.ipynb
    • File type: Notebook
    • Overwrite if already exists: Selected
    • Select target directorySelect the folder with your user name under User files
  6. When the new notebook has been created, ensure that the compute instance you created previously is selected in the Compute box, and that it has a status of Running.

  7. Edit the notebook inline, and in the cell that has been created in the notebook, paste the following code:

    Python
    endpoint = 'YOUR_ENDPOINT' #Replace with your endpoint
    key = 'YOUR_KEY' #Replace with your key
    
    import urllib.request
    import json
    import os
    
    # Prepare the input data
    data = {
        "Inputs": {
            "WebServiceInput0":
            [
                {
                        'symboling': 3,
                        'normalized-losses': None,
                        'make': "alfa-romero",
                        'fuel-type': "gas",
                        'aspiration': "std",
                        'num-of-doors': "two",
                        'body-style': "convertible",
                        'drive-wheels': "rwd",
                        'engine-location': "front",
                        'wheel-base': 88.6,
                        'length': 168.8,
                        'width': 64.1,
                        'height': 48.8,
                        'curb-weight': 2548,
                        'engine-type': "dohc",
                        'num-of-cylinders': "four",
                        'engine-size': 130,
                        'fuel-system': "mpfi",
                        'bore': 3.47,
                        'stroke': 2.68,
                        'compression-ratio': 9,
                        'horsepower': 111,
                        'peak-rpm': 5000,
                        'city-mpg': 21,
                        'highway-mpg': 27,
                },
            ],
        },
        "GlobalParameters":  {
        }
    }
    body = str.encode(json.dumps(data))
    headers = {'Content-Type':'application/json', 'Authorization':('Bearer '+ key)}
    req = urllib.request.Request(endpoint, body, headers)
    
    try:
        response = urllib.request.urlopen(req)
        result = response.read()
        json_result = json.loads(result)
        y = json_result["Results"]["WebServiceOutput0"][0]["predicted_price"]
        print('Predicted price: {:.2f}'.format(y))
    
    except urllib.error.HTTPError as error:
        print("The request failed with status code: " + str(error.code))
    
        # Print the headers to help debug the error
        print(error.info())
        print(json.loads(error.read().decode("utf8", 'ignore')))
    

     Note

    Don't worry too much about the details of the code. It just submits details of a car and uses the predict-auto-price service you created to get a predicted price.

  8. Switch to the browser tab containing the Consume page for the predict-auto-price service, and copy the REST endpoint for your service. The switch back to the tab containing the notebook and paste the key into the code, replacing YOUR_ENDPOINT.

  9. Switch to the browser tab containing the Consume page for the predict-auto-price service, and copy the Primary Key for your service. The switch back to the tab containing the notebook and paste the key into the code, replacing YOUR_KEY.

  10. Save the notebook. Then use the  button next to the cell to run the code.

  11. Verify that predicted price is returned.




Reset resources

The web service you created is hosted in an Azure Kubernetes Cluster. If you don't intend to experiment with it further, you should delete the endpoint and cluster to avoid accruing unnecessary Azure charges. You should also stop the compute instance until you need it again.

  1. In Azure Machine Learning studio , on the Endpoints tab, select the predict-auto-price endpoint. Then select Delete (🗑) and confirm that you want to delete the endpoint.
  2. On the Compute page, on the Compute Instances tab, select your compute instance and then select Stop.
  3. On the Compute page, on the Compute clusters tab, open your compute cluster and select Edit. Then set the Minimum number of nodes setting to 0 and select Update.
  4. On the Compute page, on the Inference clusters tab, select your inference cluster and select Delete, and confirm you want to delete the cluster.

If you have finished exploring Azure Machine Learning, you can delete the resource group containing your Azure Machine Learning workspace from your Azure subscription:

  1. In the Azure portal , in the Resource groups page, open the resource group you specified when creating your Azure Machine Learning workspace.
  2. Click Delete resource group, type the resource group name to confirm you want to delete it, and select Delete.



Create a classification model with Azure Machine Learning designer

Introduction

Classification is a form of machine learning that is used to predict which category, or class, an item belongs to. For example, a health clinic might use the characteristics of a patient (such as age, weight, blood pressure, and so on) to predict whether the patient is at risk of diabetes. In this case, the characteristics of the patient are the features, and the label is a classification of either 0 or 1, representing non-diabetic or diabetic.

Patients with clinical data, classified as diabetic and non-diabetic

Classification is an example of a supervised machine learning technique in which you train a model using data that includes both the features and known values for the label, so that the model learns to fit the feature combinations to the label. Then, after training has been completed, you can use the trained model to predict labels for new items for which the label is unknown.

You can use Microsoft Azure Machine Learning designer to create classification models by using a drag and drop visual interface, without needing to write any code.

In this module, you'll learn how to:

  • Use Azure Machine Learning designer to train a classification model.
  • Use a classification model for inferencing.
  • Deploy a classification model as a service.

To complete this module, you'll need a Microsoft Azure subscription. If you don't already have one, you can sign up for a free trial at https://azure.microsoft.com .



Create an Azure Machine Learning workspace

Azure Machine Learning is a cloud-based platform for building and operating machine learning solutions in Azure. It includes a wide range of features and capabilities that help data scientists prepare data, train models, publish predictive services, and monitor their usage. One of these features is a visual interface called designer, that you can use to train, test, and deploy machine learning models without writing any code.

Create an Azure Machine Learning workspace

To use Azure Machine Learning, you create a workspace in your Azure subscription. You can then use this workspace to manage data, compute resources, code, models, and other artifacts related to your machine learning workloads.

If you do not already have one, follow these steps to create a workspace:

  1. Sign into the Azure portal  using your Microsoft credentials.
  2. Select +Create a resource, search for Machine Learning, and create a new Machine Learning resource the following settings:
    • Workspace NameA unique name of your choice
    • SubscriptionYour Azure subscription
    • Resource groupCreate a new resource group with a unique name
    • LocationChoose any available location
    • Workspace edition: Enterprise
  3. Wait for your workspace to be created (it can take a few minutes). Then go to it in the portal.
  4. On the Overview page for your workspace, launch Azure Machine Learning Studio (or open a new browser tab and navigate to https://ml.azure.com ), and sign into Azure Machine Learning studio using your Microsoft account.
  5. In Azure Machine Learning studio, toggle the ☰ icon at the top left to view the various pages in the interface. You can use these pages to manage the resources in your workspace.

You can manage your workspace using the Azure portal, but for data scientists and Machine Learning operations engineers, Azure Machine Learning studio provides a more focused user interface for managing workspace resources.

 Important

If you intend to use an Azure Machine Learning workspace that you created previously using the Basic edition, upgrade it to Enterprise edition to make the designer interface available.



Create compute resources

To train and deploy models using Azure Machine Learning designer, you need compute on which to run the training process, test the model, and host the model in a deployed service.

Create compute targets

Compute targets are cloud-based resources on which you can run model training and data exploration processes.

  1. In Azure Machine Learning studio , view the Compute page (under Manage). This is where you manage the compute targets for your data science activities. There are four kinds of compute resource you can create:
    • Compute Instances: Development workstations that data scientists can use to work with data and models.
    • Compute Clusters: Scalable clusters of virtual machines for on-demand processing of experiment code.
    • Inference Clusters: Deployment targets for predictive services that use your trained models.
    • Attached Compute: Links to existing Azure compute resources, such as Virtual Machines or Azure Databricks clusters.
  2. On the Compute Instances tab, add a new compute instance with the following settings. You'll use this to test your model:
    • Compute nameenter a unique name
    • Virtual Machine type: CPU
    • Virtual Machine size: Standard_DS2_v2
  3. While the compute instance is being created, switch to the Compute Clusters tab, and add a new compute cluster with the following settings. You'll use this to train a machine learning model:
    • Compute nameenter a unique name
    • Virtual Machine size: Standard_DS2_v2
    • Virtual Machine priority: Dedicated
    • Minimum number of nodes: 2
    • Maximum number of nodes: 2
    • Idle seconds before scale down: 120
  4. While the compute cluster is being created, switch to the Inference Clusters tab, and add a new cluster with the following settings. You'll use this to deploy your model as a service.
    • Compute nameenter a unique name
    • Kubernetes Service: Create new
    • RegionSelect a different region than the one used for your workspace
    • Virtual Machine size: Standard_DS2_v2 (Use the filter to find this in the list)
    • Cluster purpose: Dev-test
    • Number of nodes: 2
    • Network configuration: Basic
    • Enable SSL configuration: Unselected
  5. Verify that the inference cluster is in the Creating state - it will take a while to be created, so leave it for now.

 Note

In a production environment, you'd typically set the minimum number of nodes value to 0 so that compute is only started when it is needed. However, compute can take a while to start, so to reduce the amount of time you spend waiting for it in this module, you've initialized it with two permanently running nodes.

If you decide not to complete this module, be sure to stop your compute instance, edit the compute cluster to reset the minimum number of nodes to 0, and delete the inference cluster in order to avoid leaving your compute running and incurring unnecessary charges to your Azure subscription. Alternatively, if you're finished exploring Azure Machine Learning, delete the entire resource group in your Azure subscription.

The compute targets will take some time to be created. You can move onto the next unit while you wait.



Create a dataset

In Azure Machine Learning, data for model training and other operations is usually encapsulated in an object called a dataset.

  1. In Azure Machine Learning studio , view the Datasets page. Datasets represent specific data files or tables that you plan to work with in Azure ML.
  2. Create a dataset from web files, using the following settings:
    • Basic Info:
    • Settings and preview:
      • File format: Delimited
      • Delimiter: Comma
      • Encoding: UTF-8
      • Column headers: Use headers from first file
      • Skip rows: None
    • Schema:
      • Include all columns other than Path
      • Review the automatically detected types
    • Confirm details:
      • Do not profile the dataset after creation
  3. After the dataset has been created, open it and view the Explore page to see a sample of the data. This data represents details from patients who have been tested for diabetes.

Create a pipeline

To get started with Azure Machine Learning designer, first you must create a pipeline and add the dataset you want to work with.

  1. In Azure Machine Learning studio  for your workspace, view the Designer page and select + to create a new pipeline.
  2. In the Settings pane, change the default pipeline name (Pipeline-Created-on-date) to Diabetes Training (if the Settings pane is not visible, click the  icon next to the pipeline name at the top).
  3. Note that you need to specify a compute target on which to run the pipeline. In the Settings pane, click Select compute target and select the aml-cluster compute cluster you created previously.
  4. On the left side of the designer, select the Datasets (⌕) tab, expand the Datasets section, and drag the diabetes-data dataset you created in the previous exercise onto the canvas.
  5. Right-click (Ctrl+click on a Mac) the diabetes-data dataset on the canvas, and on the Visualize menu, select Dataset output.
  6. Review the schema of the data, noting that you can see the distributions of the various columns as histograms.
  7. Scroll to the right and select the column heading for the Diabetic column, and note that it contains two values 0 and 1. These values represent the two possible classes for the label that your model will predict, with a value of 0 meaning that the patient does not have diabetes, and a value of 1 meaning that the patient is diabetic.
  8. Scroll back to the left and review the other columns, which represent the features that will be used to predict the label. Note that most of these columns are numeric, but each feature is on its own scale. For example, Age values range from 21 to 77, while DiabetesPedigree values range from 0.078 to 2.3016. When training a machine learning model, it is sometimes possible for larger values to dominate the resulting predictive function, reducing the influence of features that on a smaller scale. Typically, data scientists mitigate this possible bias by normalizing the numeric columns so they're on the similar scales.
  9. Close the diabetes-data result visualization window so that you can see the dataset on the canvas like this:

The diabetes-data dataset on the designer canvas

Add Transformations

Before you can train a model, you typically need to apply some preprocessing transformations to the data.

  1. In the pane on the left, view the Modules (⊞) tab and expand the Data Transformation section, which contains a wide range of modules you can use to transform data before model training.
  2. Drag a Normalize Data module to the canvas, below the diabetes-data dataset. Then connect the output from the bottom of the diabetes-data dataset to the input at the top of the Normalize Data module, like this:

A pipeline with the diabetes-data dataset connected to a Normalize Data module

  1. Select the Normalize Data module and view its settings, noting that it requires you to specify the transformation method and the columns to be transformed.
  2. Set the transformation to MinMax and edit the columns to include the following columns by name, as shown in the image:
    • Pregnancies
    • PlasmaGlucose
    • DiastolicBloodPressure
    • TricepsThickness
    • SerumInsulin
    • BMI
    • DiabetesPedigree
    • Age

columns selected for normalization




Create and run a training pipeline

After you've used data transformations to prepare the data, you can use it to train a machine learning model.

Add training modules

It's common practice to train the model using a subset of the data, while holding back some data with which to test the trained model. This enables you to compare the labels that the model predicts with the actual known labels in the original dataset.

In this exercise, you're going to extend the Diabetes Training pipeline as shown here:

split data, then train with logistic regression and score

Follow the steps below, using the image above for reference as you add and configure the required modules.

  1. Open the Diabetes Training pipeline you created in the previous unit if it's not already open.
  2. In the pane on the left, on the Modules tab, in the Data Transformations section, drag a Split Data module onto the canvas under the Normalize Data module. Then connect the Transformed Dataset (left) output of the Normalize Data module to the input of the Split Data module.
  3. Select the Split Data module, and configure its settings as follows:
    • Splitting mode Split Rows
    • Fraction of rows in the first output dataset: 0.7
    • Random seed: 123
    • Stratified split: False
  4. Expand the Model Training section in the pane on the left, and drag a Train Model module to the canvas, under the Split Data module. Then connect the Result dataset1 (left) output of the Split Data module to the Dataset (right) input of the Train Model module.
  5. The model we're training will predict the Diabetic value, so select the Train Model module and modify its settings to set the Label column to Diabetic (matching the case and spelling exactly!)
  6. The Diabetic label the model will predict is a class (0 or 1), so we need to train the model using a classification algorithm. Specifically, there are two possible classes, so we need a binary classification algorithm. Expand the Machine Learning Algorithms section, and under Classification, drag a Two-Class Logistic Regression module to the canvas, to the left of the Split Data module and above the Train Model module. Then connect its output to the Untrained model (left) input of the Train Model module.

 Note

There are multiple algorithms you can use to train a classification model. For help choosing one, take a look at the Machine Learning Algorithm Cheat Sheet for Azure Machine Learning designer .

  1. To test the trained model, we need to use it to score the validation dataset we held back when we split the original data - in other words, predict labels for the features in the validation dataset. Expand the Model Scoring & Evaluation section and drag a Score Model module to the canvas, below the Train Model module. Then connect the output of the Train Model module to the Trained model (left) input of the Score Model module; and connect the Results dataset2 (right) output of the Split Data module to the Dataset (right) input of the Score Model module.
  2. Ensure your pipeline looks like this:

split data, then train with logistic regression and score

Run the training pipeline

Now you're ready to run the training pipeline and train the model.

  1. Select Submit, and run the pipeline using the existing experiment named diabetes-training.
  2. Wait for the experiment run to finish. This may take 5 minutes or more.
  3. When the experiment run has finished, select the Score Model module and in the settings pane, on the Outputs + Logs tab, under Data outputs in the Scored dataset section, use the Visualize icon to view the results.
  4. Scroll to the right, and note that next to the Diabetic column (which contains the known true values of the label) there is a new column named Scored Labels, which contains the predicted label values, and a Scored Probabilities columns containing a probability value between 0 and 1. This indicates the probability of a positive prediction, so probabilities greater than 0.5 result in a predicted label of 1 (diabetic), while probabilities between 0 and 0.5 result in a predicted label of 0 (not diabetic).
  5. Close the Score Model result visualization window.

The model is predicting values for the Diabetic label, but how reliable are its predictions? To assess that, you need to evaluate the model.

The data transformation is normalizing the numeric columns to put them on the same scale, which should help prevent columns with large values from dominating model training. You'd usually apply a whole bunch of pre-processing transformations like this to prepare your data for training, but we'll keep things simple in this exercise.

Run the pipeline

To apply your data transformations, you need to run the pipeline as an experiment.

  1. Ensure your pipeline looks similar to this:

diabetes-data dataset with Normalize Data module

  1. Select Submit, and run the pipeline as a new experiment named diabetes-training on your compute cluster.
  2. Wait for the run to finish - this may take a few minutes.

View the transformed data

The dataset is now prepared for model training.

  1. Select the completed Normalize Data module, and in its Settings pane on the right, on the Outputs + logs tab, select the Visualize icon for the Transformed dataset.
  2. View the data, noting that the numeric columns you selected have been normalized to a common scale.
  3. Close the normalized data result visualization.



Evaluate a classification model

The validation data you held back and used to score the model includes the known values for the label. So to validate the model, you can compare the true values for the label to the label values that were predicted when you scored the validation dataset. Based on this comparison, you can calculate various metrics that describe how well the model performs.

Add an Evaluate Model module

  1. Open the Diabetes Training pipeline you created in the previous unit if it's not already open.
  2. In the pane on the left, on the Modules tab, in the Model Scoring & Evaluation section, drag an Evaluate Model module to the canvas, under the Score Model module, and connect the output of the Score Model module to the Scored dataset (left) input of the Evaluate Model module.
  3. Ensure your pipeline looks like this:

Evaluate Model module added to Score Model module

  1. Select Submit, and run the pipeline using the existing experiment named diabetes-training.
  2. Wait for the experiment run to finish.
  3. When the experiment run has finished, select the Evaluate Model module and in the settings pane, on the Outputs + Logs tab, under Data outputs in the Evaluation results section, use the Visualize icon to view the performance metrics. These metrics can help data scientists assess how well the model predicts based on the validation data.
  4. View the confusion matrix for the model, which is a tabulation of the predicted and actual value counts for each possible class. For a binary classification model like this one, where you're predicting one of two possible values, the confusion matrix is a 2x2 grid showing the predicted and actual value counts for classes 0 and 1, similar to this:

A confusion matrix showing actual and predicted value counts for each class

The confusion matrix shows cases where both the predicted and actual values were 1 (known as true positives) at the top left, and cases where both the predicted and the actual values were 0 (true negatives) at the bottom right. The other cells show cases where the predicted and actual values differ (false positives and false negatives). The cells in the matrix are colored so that the more cases represented in the cell, the more intense the color - with the result that you can identify a model that predicts accurately for all classes by looking for a diagonal line of intensely colored cells from the top left to the bottom right (in other words, the cells where the predicted values match the actual values). For a multi-class classification model (where there are more than two possible classes), the same approach is used to tabulate each possible combination of actual and predicted value counts - so a model with three possible classes would result in a 3x3 matrix with a diagonal line of cells where the predicted and actual labels match.

  1. Review the metrics to the left of the confusion matrix, which include:

    • Accuracy: The ratio of correct predictions (true positives + true negatives) to the total number of predictions. In other words, what proportion of diabetes predictions did the model get right?
    • Precision: The fraction of positive cases correctly identified (the number of true positives divided by the number of true positives plus false positives). In other words, out of all the patients that the model predicted as having diabetes, how many are actually diabetic?
    • Recall: The fraction of the cases classified as positive that are actually positive (the number of true positives divided by the number of true positives plus false negatives). In other words, out of all the patients who actually have diabetes, how many did the model identify?
    • F1 Score: An overall metric that essentially combines precision and recall.
    • We'll return to AUC later.

    Of these metric, accuracy is the most intuitive. However, you need to be careful about using simple accuracy as a measurement of how well a model works. Suppose that only 3% of the population is diabetic. You could create a model that always predicts 0 and it would be 97% accurate - just not very useful! For this reason, most data scientists use other metrics like precision and recall to assess classification model performance.

  2. Above the list of metrics, note that there's a Threshold slider. Remember that what a classification model predicts is the probability for each possible class. In the case of this binary classification model, the predicted probability for a positive (that is, diabetic) prediction is a value between 0 and 1. By default, a predicted probability for diabetes above 0.5 results in a class prediction of 1, while a prediction below this threshold means that there's a greater probability of the patient not having diabetes (remember that the probabilities for all classes add up to 1), so the predicted class would be 0. Try moving the threshold slider and observe the effect on the confusion matrix. If you move it all the way to the left (0), the Recall metric becomes 1, and if you move it all the way to the right (1), the Recall metric becomes 0.

  3. Look above the Threshold slider at the ROC curve (ROC stands for received operator characteristic, but most data scientists just call it a ROC curve). Another term for recall is True positive rate, and it has a corresponding metric named False positive rate, which measures the number of negative cases incorrectly identified as positive compared the number of actual negative cases. Plotting these metrics against each other for every possible threshold value between 0 and 1 results in a curve. In an ideal model, the curve would go all the way up the left side and across the top, so that it covers the full area of the chart. The larger the area under the curve (which can be any value from 0 to 1), the better the model is performing - this is the AUC metric listed with the other metrics below. To get an idea of how this area represents the performance of the model, imagine a straight diagonal line from the bottom left to the top right of the ROC chart. This represents the expected performance if you just guessed or flipped a coin for each patient - you could expect to get around half of them right, and half of them wrong, so the area under the diagonal line represents an AUC of 0.5. If the AUC for your model is higher than this for a binary classification model, then the model performs better than a random guess.

  4. Close the Evaluate Model result visualization window.

The performance of this model isn't all that great, partly because we performed only minimal feature engineering and pre-processing. You could try a different classification algorithm, such as Two-Class Decision Forest, and compare the results. You can connect the outputs of the Split Data module to multiple Train Model and Score Model modules, and you can connect a second Score Model module to the Evaluate Model module to see a side-by-side comparison. The point of the exercise is simply to introduce you to classification and the Azure Machine Learning designer interface, not to train a perfect model!


Create an inference pipeline

After creating and running a pipeline to train the model, you need a second pipeline that performs the same data transformations for new data, and then uses the trained model to inference (in other words, predict) label values based on its features. This pipeline will form the basis for a predictive service that you can publish for applications to use.

Create an inference pipeline

  1. In Azure Machine Learning Studio, click the Designer page to view all of the pipelines you have created. Then open the Diabetes Training pipeline you created previously.

  2. In the Create inference pipeline drop-down list, click Real-time inference pipeline. After a few seconds, a new version of your pipeline named Diabetes Training-real time inference will be opened.

    If the pipeline does not include Web Service Input and Web Service Output modules, go back to the Designer page and then re-open the Diabetes Training-real time inference pipeline.

  3. Rename the new pipeline to Predict Diabetes, and then review the new pipeline. It contains a web service input for new data to be submitted, and a web service output to return results. Some of the transformations and training steps have been encapsulated in this pipeline so that the statistics from your training data will be used to normalize any new data values, and the trained model will be used to score the new data.

    You are going to make the following changes to the inference pipeline:

An inference pipeline with changes indicated

  • Replace the diabetes-data dataset with an Enter Data Manually module that does not include the label column (Diabetic).

  • Remove the Evaluate Model module.

  • Insert an Execute Python Script module before the web service output to return only the patient ID, predicted label value, and probability.

    Follow the remaining steps below, using the image and information above for reference as you modify the pipeline.

  1. The inference pipeline assumes that new data will match the schema of the original training data, so the diabetes-data dataset from the training pipeline is included. However, this input data includes the Diabetic label that the model predicts, which is unintuitive to include in new patient data for which a diabetes prediction has not yet been made. Delete this module and replace it with an Enter Data Manually module from the Data Input and Output section on the Modules tab, containing the following CSV data, which includes feature values without labels for three new patient observations:

    CSV
    PatientID,Pregnancies,PlasmaGlucose,DiastolicBloodPressure,TricepsThickness,SerumInsulin,BMI,DiabetesPedigree,Age
    1882185,9,104,51,7,24,27.36983156,1.350472047,43
    1662484,6,73,61,35,24,18.74367404,1.074147566,75
    1228510,4,115,50,29,243,34.69215364,0.741159926,59
    
  2. Connect the new Enter Data Manually module to the same Dataset input of the Apply Transformation module as the Web Service Input.

  3. The inference pipeline includes the Evaluate Model module, which is not useful when predicting from new data, so delete this module.

  4. The output from the Score Model module includes all of the input features as well as the predicted label and probability score. To limit the output to only the prediction and probability:

    • Delete the connection between the Score Model module and the Web Service Output.

    • Add an Execute Python Script module from the Python Language section, replacing all of the the default python script with the following code (which selects only the PatientIDScored Labels and Scored Probabilities columns and renames them appropriately):

      Python
      import pandas as pd
      
      def azureml_main(dataframe1 = None, dataframe2 = None):
      
          scored_results = dataframe1[['PatientID', 'Scored Labels', 'Scored Probabilities']]
          scored_results.rename(columns={'Scored Labels':'DiabetesPrediction',
                                      'Scored Probabilities':'Probability'},
                              inplace=True)
          return scored_results
      
    • Connect the output from the Score Model module to the Dataset1 (left-most) input of the Execute Python Script, and connect the output of the Execute Python Script module to the Web Service Output.

  5. Verify that your pipeline looks similar to the following:

A visual inference pipeline

  1. Run the pipeline as a new experiment named predict-diabetes on your compute cluster. This may take a while!
  2. When the pipeline has finished, select the Execute Python Script module, and in the settings pane, on the Output + Logs tab, visualize the Result dataset to see the predicted labels and probabilities for the three patient observations in the input data.

Your inference pipeline predicts whether or not patients are at risk for diabetes based on their features. Now you're ready to publish the pipeline so that client applications can use it.



Deploy a predictive service

After you've created and tested an inference pipeline for real-time inferencing, you can publish it as a service for client applications to use.

To publish a real-time inference pipeline as a service, you must deploy it to an Azure Kubernetes Service (AKS) cluster. In this exercise, you'll use the AKS inference cluster you created previously in this module.

Deploy a service

  1. View the Predict Diabetes inference pipeline you created in the previous unit.
  2. At the top right, select Deploy, and set up a new real-time endpoint named predict-diabetes on the inference cluster you created previously.
  3. Wait for the web service to be deployed - this can take several minutes. The deployment status is shown at the top left of the designer interface.

Test the service

Now you can test your deployed service from a client application - in this case, you'll use the code in the cell below to simulate a client application.

  1. On the Endpoints page, open the predict-diabetes real-time endpoint.

  2. When the predict-diabetes endpoint opens, view the Consume tab and note the following information there. You need this to connect to your deployed service from a client application.

    • The REST endpoint for your service
    • the Primary Key for your service
  3. Note that you can use the ⧉ link next to these values to copy them to the clipboard.

  4. With the Consume page for the predict-diabetes service page open in your browser, open a new browser tab and open a second instance of Azure Machine Learning studio . Then in the new tab, view the Notebooks page.

  5. On the Notebooks page, create a new file with the following settings:

    • File name: diabetes_test.ipynb
    • File type: Notebook
    • Overwrite if already exists: Selected
    • Select target directorySelect the folder with your user name under User files
  6. When the new notebook has been created, ensure that the compute instance you created previously is selected in the Compute box, and that it has a status of Running.

  7. Edit the notebook inline, and in the cell that has been created in the notebook, paste the following code:

    Python
    endpoint = 'YOUR_ENDPOINT' #Replace with your endpoint
    key = 'YOUR_KEY' #Replace with your key
    
    import urllib.request
    import json
    import os
    
    data = {
        "Inputs": {
            "WebServiceInput0":
            [
                {
                        'PatientID': 1882185,
                        'Pregnancies': 9,
                        'PlasmaGlucose': 104,
                        'DiastolicBloodPressure': 51,
                        'TricepsThickness': 7,
                        'SerumInsulin': 24,
                        'BMI': 27.36983156,
                        'DiabetesPedigree': 1.3504720469999998,
                        'Age': 43,
                },
            ],
        },
        "GlobalParameters":  {
        }
    }
    
    body = str.encode(json.dumps(data))
    
    
    headers = {'Content-Type':'application/json', 'Authorization':('Bearer '+ key)}
    
    req = urllib.request.Request(endpoint, body, headers)
    
    try:
        response = urllib.request.urlopen(req)
        result = response.read()
        json_result = json.loads(result)
        output = json_result["Results"]["WebServiceOutput0"][0]
        print('Patient: {}\nPrediction: {}\nProbability: {:.2f}'.format(output["PatientID"],
                                                                output["DiabetesPrediction"],
                                                                output["Probability"]))
    except urllib.error.HTTPError as error:
        print("The request failed with status code: " + str(error.code))
    
        # Print the headers to help debug
        print(error.info())
        print(json.loads(error.read().decode("utf8", 'ignore')))
    

     Note

    Don't worry too much about the details of the code. It just defines features for a patient, and uses the predict-diabetes service you created to predict a diabetes diagnosis.

  8. Switch to the browser tab containing the Consume page for the predict-diabetes service, and copy the REST endpoint for your service. The switch back to the tab containing the notebook and paste the key into the code, replacing YOUR_ENDPOINT.

  9. Switch to the browser tab containing the Consume page for the predict-diabetes service, and copy the Primary Key for your service. The switch back to the tab containing the notebook and paste the key into the code, replacing YOUR_KEY.

  10. Save the notebook. Then use the  button next to the cell to run the code.

  11. Verify that predicted diabetes diagnosis is returned.

Comments

Popular posts from this blog

DP-900 Part 1