Home Experiences Projects Awards Blogs
Back

ML Algorithm From Scratch

Python scikit-learn Pandas NumPy Jupyter Notebook
No image

Description

This project implements machine learning algorithms from scratch to predict student academic success. The implementation includes three core algorithms with additional bonus features:

1. Decision Tree Learning (C4.5)

Core Implementation:

  • Uses gain ratio (information gain / split info) for feature selection
  • Handles both numeric and categorical features
  • Probability-based approach for missing value handling
  • Configurable tree depth, minimum samples for split/leaf

Bonus Features:

  • Tree visualization using Graphviz with visualize_tree() method
  • Configurable top_n parameter to limit displayed tree depth for better readability
  • Multiple export formats supported (PNG, PDF, svg)
  • Generated output available in outputs/dtl/tree_dtl_submission.png

2. Logistic Regression

Core Implementation:

  • Gradient ascent optimization
  • Binary and multiclass classification support
  • Sigmoid activation for binary, softmax for multiclass
  • L2 regularization support
  • Training history tracking (loss and parameters)

Bonus Features:

  • Training history tracking implemented with loss_history and param_history to track model parameters during training
  • Infrastructure prepared for video generation showing loss function contour lines and parameter trajectory

3. Support Vector Machine (SVM)

Core Implementation:

  • Quadratic programming solver using CVXOPT
  • Support for linear and non-linear kernels (Linear, RBF)
  • One-vs-rest strategy for multiclass classification

Bonus Features:

  • Fully implemented SVMVideo class for training process visualization showing decision boundary evolution, support vector identification, and margin visualization
  • Multiple multiclass strategies: One-vs-One (OvO) and One-vs-All (OvA)
  • Dual optimization methods: CVXOPT quadratic programming solver and SciPy SLSQP optimization as alternative
  • Both linear and RBF (Radial Basis Function) kernel support

Dataset

The dataset contains demographic data, socio-economic factors, and academic performance information of students enrolled in various undergraduate programs. The goal is to classify the 'Target' feature based on other provided features.

Key Features

  • From-scratch implementations: DTL, Logistic Regression, and SVM implemented without using scikit-learn's core algorithms
  • Scikit-learn compatibility: All models inherit from sklearn.base.BaseEstimator and ClassifierMixin for compatibility
  • Comprehensive preprocessing pipeline: Data cleaning, transformation, feature selection, and dimensionality reduction
  • Model persistence: Save and load models using pickle
  • Performance comparison: Compare from-scratch implementations with scikit-learn library implementations

Contributors

NIM Name Contribution
13523035 M Rayhan Farukh Implementasi Support Vector Machine
13523043 Najwa Kahani Fatima Implementasi Logistic Regression
13523073 Alfian Hanif Fitria Yustanto Exploratory Data Analysis, Data Cleaning
13523079 Nayla Zahira Implementasi Preprocessing
13523091 Carlo Angkisan Implementasi Decision Tree Learning C4.5