ML Algorithm From Scratch
Python scikit-learn Pandas NumPy Jupyter Notebook
No image
Description
This project implements machine learning algorithms from scratch to predict student academic success. The implementation includes three core algorithms with additional bonus features:
1. Decision Tree Learning (C4.5)
Core Implementation:
- Uses gain ratio (information gain / split info) for feature selection
- Handles both numeric and categorical features
- Probability-based approach for missing value handling
- Configurable tree depth, minimum samples for split/leaf
Bonus Features:
- Tree visualization using Graphviz with
visualize_tree()method - Configurable
top_nparameter to limit displayed tree depth for better readability - Multiple export formats supported (PNG, PDF, svg)
- Generated output available in
outputs/dtl/tree_dtl_submission.png
2. Logistic Regression
Core Implementation:
- Gradient ascent optimization
- Binary and multiclass classification support
- Sigmoid activation for binary, softmax for multiclass
- L2 regularization support
- Training history tracking (loss and parameters)
Bonus Features:
- Training history tracking implemented with
loss_historyandparam_historyto track model parameters during training - Infrastructure prepared for video generation showing loss function contour lines and parameter trajectory
3. Support Vector Machine (SVM)
Core Implementation:
- Quadratic programming solver using CVXOPT
- Support for linear and non-linear kernels (Linear, RBF)
- One-vs-rest strategy for multiclass classification
Bonus Features:
- Fully implemented
SVMVideoclass for training process visualization showing decision boundary evolution, support vector identification, and margin visualization - Multiple multiclass strategies: One-vs-One (OvO) and One-vs-All (OvA)
- Dual optimization methods: CVXOPT quadratic programming solver and SciPy SLSQP optimization as alternative
- Both linear and RBF (Radial Basis Function) kernel support
Dataset
The dataset contains demographic data, socio-economic factors, and academic performance information of students enrolled in various undergraduate programs. The goal is to classify the 'Target' feature based on other provided features.
Key Features
- From-scratch implementations: DTL, Logistic Regression, and SVM implemented without using scikit-learn's core algorithms
- Scikit-learn compatibility: All models inherit from
sklearn.base.BaseEstimatorandClassifierMixinfor compatibility - Comprehensive preprocessing pipeline: Data cleaning, transformation, feature selection, and dimensionality reduction
- Model persistence: Save and load models using pickle
- Performance comparison: Compare from-scratch implementations with scikit-learn library implementations
Contributors
| NIM | Name | Contribution |
|---|---|---|
| 13523035 | M Rayhan Farukh | Implementasi Support Vector Machine |
| 13523043 | Najwa Kahani Fatima | Implementasi Logistic Regression |
| 13523073 | Alfian Hanif Fitria Yustanto | Exploratory Data Analysis, Data Cleaning |
| 13523079 | Nayla Zahira | Implementasi Preprocessing |
| 13523091 | Carlo Angkisan | Implementasi Decision Tree Learning C4.5 |