Introduction to Machine Learning

Delving into features, targets, and the steps involved in Machine Learning models, as well as the importance of model selection and development environment for effective implementation.

Featured image

The concept of Machine Learning (ML) is illustrated through an example of predicting car prices. Data, including features such as year and mileage, is used by the ML model to learn and identify patterns. The target variable, in this case, is the car’s price.

New data, which lacks the target variable, is then provided to the model to predict the price.

In summary, ML involves extracting patterns from data, which is categorized into two types:

New feature values are inputted into the model, which generates predictions based on the patterns it has learned. This is an overview of what has been learned from the ML course by Alexey Grigorev (ML Zoomcamp). All images in this post are sourced from the course material. Images in other posts may also be derived from this material.

What is Machine Learning?

Machine Learning (ML) is explained as the process of training a model using features and target information to predict unknown object targets. In other words, ML is about extracting patterns from data, which includes features and targets.

Key Terms in ML

Training and Using a Model

What Did I Learn?

Part 1: What is Machine Learning

Definition

Machine Learning (ML) is a process where models are trained using data to predict outcomes. The main components involved in ML are:

How ML Works

Key Components

Part 2: Machine Learning vs Rule-Based Systems

Rule-Based Systems

Machine Learning Approach

Comparison

Part 3: Supervised Machine Learning Overview

Definition

In Supervised Machine Learning (SML), models learn from labeled data, with:

Types of SML Problems

Part 4: CRISP-DM — Cross-Industry Standard Process for Data Mining

Overview

CRISP-DM is an iterative process model for data mining, consisting of six phases:

  1. Business Understanding: Identify the problem and requirements.
  2. Data Understanding: Analyze available data.
  3. Data Preparation: Clean and format data for modeling.
  4. Modeling: Train various models and select the best.
  5. Evaluation: Assess model performance against business goals.
  6. Deployment: Implement the model in a production environment.

The process may require revisiting previous steps based on feedback and evaluation results.

Part 5: Model Selection Process

Overview

Steps:

  1. Split the Dataset: Divide into training (60%), validation (20%), and test (20%) sets.
  2. Train the Models: Use the training dataset for training.
  3. Evaluate the Models: Assess model performance on the validation dataset.
  4. Select the Best Model: Choose the model with the best validation performance.
  5. Apply the Best Model: Test on the unseen test dataset.
  6. Compare Performance Metrics: Ensure the model generalizes well by comparing validation and test performance.

Multiple Comparison Problem (MCP)

To mitigate MCP, the test set verifies that the selected model truly performs well, rather than relying solely on validation results.

Part 6: Setting Up the Environment

Requirements

To prepare your environment, you’ll need the following:

For a comprehensive guide on configuring your environment on an AWS EC2 instance running Ubuntu 22.04, refer to this video.

Make sure to adjust the instructions to clone the relevant repository instead of the MLOps one. These instructions can also be adapted for setting up a local Ubuntu environment.

Note for WSL

Most instructions from the video are applicable to Windows Subsystem for Linux (WSL) as well. For Docker, simply install Docker Desktop on Windows; it will automatically be used in WSL, so there’s no need to install docker.io.

Anaconda and Conda

It is recommended to use Anaconda or Miniconda:

Make sure to follow the installation instructions provided on their respective websites to set up your environment correctly.

Part 7: NumPy: A Comprehensive Overview

NumPy is a highly regarded library in Python that serves as a cornerstone for numerical computing. Its primary strength lies in its ability to facilitate the creation and manipulation of multi-dimensional arrays, along with providing a rich set of mathematical functions. This makes it an indispensable tool for a wide range of applications, including data analysis, scientific computing, and machine learning.

Creating Arrays

One of the key features of NumPy is its flexibility in creating arrays. Users can generate NumPy arrays in various ways:

Element-wise Operations

One of the most powerful features of NumPy is its support for element-wise operations. This capability allows users to perform mathematical operations on arrays without the need for explicit loops, greatly enhancing efficiency. This includes:

Conclusion

Machine Learning is a powerful tool for extracting patterns from data, enabling predictions for unseen data. Understanding the fundamentals of ML, including features, targets, and the model training process, is essential for successfully applying ML techniques. By leveraging the capabilities of libraries like NumPy, practitioners can enhance their data analysis and machine learning workflows.