Supervised Learning with SCI-KIT

Most of things you need to cover in supervised learning. (Theoretically + Programitaclly)

5 min readFeb 21, 2025

Machine Learning Key Concepts

Machine learning is the process whereby computers learn to make decisions from data without being explicitly programmed.

Supervised learning is a machine learning approach where the model is trained on labeled data. The training data includes input-output pairs, where the output serves as a guide for the model to learn and make predictions. This approach is commonly used for tasks such as classification and regression.
Unsupervised learning deals with unlabeled data, meaning the model has no predefined output labels. Instead, it aims to uncover hidden patterns, relationships, or structures within the data. Common applications of unsupervised learning include clustering, dimensionality reduction, and anomaly detection.
Reinforcement learning is where agent learns to make decisions by interacting with an environment. The agent receives feedback in the form of rewards or penalties based on the actions it takes, which helps it learn the optimal strategy or policy for achieving its goals.

Two Categories of Supervised learning

Classification is a way of predicting discrete outcomes or categories, such as determining whether an email is spam or not.
Regression is a way of predicting continuous numeric values, such as estimating the price of a house based on its features.

Classification

Classification can be divide into two main categories

Binary classification is where outcome is only has two outcomes. for example spam or not. either it can be a spam or not a spam. only two outcomes.
Multiclass classification is where outcome has many outcomes. for example rate 1 to 10 it can be 1 , 4 , 5 or any other between 10.

SCIKIT-Learn Syntax

The below code is only provided to help you understand the syntax. It is not a working code. As you can see, you need to import a model from the sklearn library, fit it with your labeled data, add the data you want to predict, and then display the predictions. It’s very straightforward. All you need to do is find the appropriate model for your specific task. Additionally, you should evaluate the accuracy of the model and understand the concepts of underfitting and overfitting, which will be covered at the end of this blog.

SCIKIT-learn Syntax

# Import Model from sklearn.module (In your case it find the right model)
# for example you can use Kneighbour_classifier model from sklearn.neighbour
from sklearn.module import Model
# Initialize model using Model
model = Model()
# Using fit method providing by sklearn.module Model() fit labeled data 
# X represent features (prediction variable) inputs
# y represent target (dependent varible) outcome
model.fit(X,y)
# Xnew represent records needed to predict 
Xnew = [[1,2],
    [4,12]]
# Using label data it found in the fit method. model will predict output for X_new inputs
model.predict(X_new)

Using KNeighboursClassifier

k-Nearest Neighbors algorithm is a algorithm which is using neighbour labeled data to predict outcome of the observing unlabeled data.

# Import KNeighborsClassifier model
from sklearn.neighbors import KNeighborsClassifier

# Replace this with actual DataFrame creation or import
import pandas as pd
import numpy as np

# Example DataFrame 
# account_length = 10 ; customer service calls = 1 ; churn = 0 ;
# As you seen churn is the dependent varible (outcome)
data = {
    "account_length": [10, 20, 30, 40, 50, 60],
    "customer_service_calls": [1, 2, 3, 4, 5, 6],
    "churn": [0, 1, 0, 1, 0, 1]  # 0: No churn, 1: Churn
}
churn_df = pd.DataFrame(data)

# Define the target and features
y = churn_df["churn"].values
X = churn_df[["account_length", "customer_service_calls"]].values

# Create a KNN classifier with 6 neighbors
knn = KNeighborsClassifier(n_neighbors=6)

# Fit the classifier to the data
knn.fit(X, y)

# Observing data (data that needs to predict)
X_new = np.array([[12, 40], [1, 5]])

# Predict the labels for the X_new
y_pred = knn.predict(X_new)

# Print the predictions
print("Predictions: {}".format(y_pred))

Measuring model performance

In machine learning accuracy is a very important factor that we cannot ignore.

When we are going to check model accuracy First, we should split data set into two different data sets. Training data and Test Data. Then we fit Training data set into the model and calculate accuracy using test data sets.

complexity

The parameter k in the KNN algorithm determines the number of nearest neighbors used to classify a data point. It significantly affects the complexity and accuracy of the model.

Lower K will lead model into Underfitting and Higher K will lead model into overfitting

Undefitting- model is too simple to capture the underlying patterns in the data. Perform badly on training data and even worse on test data.

Overfitting — model is too complex and starts to memorize the training data. Perform well in training data but fails on test data.

Plotting accuracy check complexity

# Import required modules
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
import numpy as np
import matplotlib.pyplot as plt

# Separate features (X) and target (y)
X = churn_df.drop("churn", axis=1).values  # Drop the target column 'churn' and keep only features
y = churn_df["churn"].values  # Target variable

# Split the data into training and test sets
# test_size=0.2 means 20% of the data is for testing, and 80% is for training
# random_state ensures reproducibility of the split
# stratify=y ensures the split maintains the same proportion of target labels in both sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Instantiate the KNN classifier with 5 neighbors
knn = KNeighborsClassifier(n_neighbors=5)

# Fit the KNN model to the training data
knn.fit(X_train, y_train)

# Calculate and print the accuracy of the model on the test data
# This evaluates the proportion of correct predictions on the test set
print(f"Test Accuracy with k=5: {knn.score(X_test, y_test)}")

# Create an array of neighbor values to test (1 through 12)
neighbors = np.arange(1, 13)  # Neighbor values from 1 to 12
train_accuracies = {}  # Dictionary to store training accuracies
test_accuracies = {}  # Dictionary to store testing accuracies

# Loop through each value of neighbors
for neighbor in neighbors:
    # Instantiate a KNN model with the current number of neighbors
    knn = KNeighborsClassifier(n_neighbors=neighbor)

    # Fit the KNN model to the training data
    knn.fit(X_train, y_train)

    # Calculate and store the accuracy on the training data
    train_accuracies[neighbor] = knn.score(X_train, y_train)

    # Calculate and store the accuracy on the test data
    test_accuracies[neighbor] = knn.score(X_test, y_test)

# Print neighbors and corresponding accuracies for verification
print("Neighbors:", neighbors)
print("Training Accuracies:", train_accuracies)
print("Testing Accuracies:", test_accuracies)

# Add a title to the plot
plt.title("KNN: Varying Number of Neighbors")

# Plot training accuracies
plt.plot(list(train_accuracies.keys()), list(train_accuracies.values()), label="Training Accuracy", marker='o')

# Plot test accuracies
plt.plot(list(test_accuracies.keys()), list(test_accuracies.values()), label="Testing Accuracy", marker='o')

# Add legend to differentiate between training and testing accuracy
plt.legend()

# Add labels to the axes
plt.xlabel("Number of Neighbors")
plt.ylabel("Accuracy")

# Display the plot
plt.show()

Supervised Learning with SCI-KIT

Most of things you need to cover in supervised learning. (Theoretically + Programitaclly)

Machine Learning Key Concepts

Two Categories of Supervised learning

Classification

SCIKIT-Learn Syntax

Using KNeighboursClassifier

Measuring model performance

complexity

Plotting accuracy check complexity

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by Thidas Senavirathna

No responses yet