Machine Learning Studies

Overview

Background &
Objective

These three Problem Based Learning projects explore complementary neural network architectures, each tackling a distinct domain. Two studies address the same regression challenge Tabular Regression (Healthcare Analytics) predicting medical insurance charges from emographic and health factors using a Multi-Layer Perceptron (MLP). The primary objective is to develop a predictive model that estimates insurance charges based on various demographic and health-related factors such as age, bmi, smoking status, and region. The Radial Basis Function (RBF) model, a powerful machine learning approach, is used to train and predict insurance costs by capturing complex patterns in the data., allowing a direct architectural comparison on identical data.

The third study shifts to computer vision: a custom Convolutional Neural Network (CNN) trained from scratch to classify coastal objects that are shells versus pebbles working across a dataset of over 4,000 labelled images. Developed the CNN model for binary image classification, trained on a dataset containing two distinct image categories. The model achieved a training accuracy and a validation accuracy demonstrating its ability to learn meaningful patterns from image data and generalize to unseen samples with minimal overfitting. Together, the three projects cover the breadth of supervised ML, from tabular regression to image classification.

All models were trained, evaluated, and documented end-to-end: dataset preprocessing, architecture design, training runs, and performance analysis against standard metrics (MSE, R², accuracy, precision/recall). The common thread is understanding how different inductive biases, dense layers, kernel transformations, and spatial convolutions shape what a model learns and how well it generalises.

Year2024 - 2025
TypeProblem Based Learning (PBL)
Team6 Members
GuideDr. Rashmi P. Shetty
DatasetsKaggle · Medical Insurance & Coastal Images
StackPython · Scikit-learn · TensorFlow · Keras
InstituteNMAM Institute of Technology, NITTE

86.7%

MLP Train R²

86.8%

RBF Train R²

86.1%

CNN Train Acc.

4,284

CNN Image Samples

The Studies

Three
architectures

Regression · Tabular

MLP Regressor

Medical Insurance Cost Prediction

A Multi-Layer Perceptron with three hidden layers (200→100→50 neurons, ReLU activation) trained to predict insurance charges from age, BMI, smoking status, children count, sex, and region. Input encoded via Label Encoding, normalised with Min-Max Scaler. Trained for 2,000 iterations with backpropagation. The combined use of RBF feature mapping alongside MLP helped capture complex non-linear patterns in the tabular data.

Train R² 86.73%

Test R² 84.27%

Train MSE 0.00488

Test MSE 0.00622

Regression · Kernel

RBF Kernel Regressor

Medical Insurance Cost Prediction

A Radial Basis Function network implemented via Kernel Ridge Regression (α=0.001, γ=0.08), enhanced with K-Means feature engineering (45 clusters, Elbow Method). The RBF kernel K(x,x') = exp(−γ|x−x'|²) maps inputs to a higher-dimensional space, improving class separability. StandardScaler normalisation and 80/20 train-test split. Smoking status emerged as the dominant predictor (correlation 0.79 with charges).

Train R² 86.79%

Test R² 85.78%

Train MSE 0.1301

Test MSE 0.1506

Classification · Vision

CNN Classifier

Shells vs. Pebbles Image Classification

A custom three-layer CNN (Conv2D 32→64→128, 3×3 kernels, ReLU + MaxPooling) built from scratch in TensorFlow/Keras for binary image classification. Trained on 4,284 images (150×150px) with a 80/20 split. Dropout(0.5) for regularisation, Adam optimiser, binary cross-entropy loss. 10 epochs. Pebble class achieved F1=0.82; Shell class F1=0.62, reflecting class imbalance (548 pebbles vs 308 shells in validation).

Train Accuracy 86.14%

Val Accuracy 75.47%

Pebbles F1 0.82

Shells F1 0.62

Cross-Study Insight

The two regression models reach near-identical R² scores (~86.7-86.8% training) on the same insurance dataset, but with very different learning mechanisms. The RBF model generalises slightly better (test R² 85.78% vs 84.27%), suggesting kernel methods may be more robust to the non-linear smoker-charge relationship that dominates this dataset. The CNN's ~11% train-to-val gap signals room for improvement via data augmentation or transfer learning.

Comparison

Side-by-side
breakdown

Attribute MLP Regressor RBF Regressor CNN Classifier

Task Regression Regression Classification

Data Type Tabular (6 features) Tabular (6 features) Images 150×150px

Dataset Size 1,338 rows 1,338 rows 4,284 images

Normalisation Min-Max Scaler Standard Scaler Rescale ÷ 255

Architecture 3 hidden layers, 200-100-50 KRR + 45 K-Means clusters 3× Conv2D + MaxPool + Dense

Activation ReLU (hidden), Linear (out) RBF Kernel (γ=0.08) ReLU + Sigmoid (out)

Regularisation Iterations limit (2000) Ridge α=0.001 Dropout 0.5

Train Score R² 86.73% R² 86.79% Acc. 86.14%

Test / Val Score R² 84.27% R² 85.78% Acc. 75.47%

Key Strength Flexible depth Better generalisation Spatial feature learning

Tech Stack

Built with

Python Scikit-learn TensorFlow Keras NumPy Pandas Matplotlib Seaborn Kernel Ridge Regression K-Means Clustering Label Encoding ImageDataGenerator

Feature Importance

Across both insurance regression models, smoking status showed the highest correlation with charges (0.79), followed by age (0.30) and BMI (0.20). Region was near-uncorrelated (-0.01), suggesting it could be dropped without model degradation.

RBF vs MLP Generalisation

The RBF model achieved a higher test R² (85.78%) vs MLP (84.27%), despite near-identical training scores. The kernel method's implicit regularisation via ridge penalty and radial distance decay produced a slightly smoother decision surface.

CNN Class Imbalance

The validation set contained 548 pebble images vs 308 shell images. This imbalance contributed to the lower shell F1-score (0.62 vs 0.82 for pebbles). Weighted sampling or augmentation would likely close this gap in future iterations.

New Data Validation (MLP)

Manual validation on 5 new insurance records (age 25-60, mixed smoking) showed predictions within 10% tolerance for all 5 cases, with the highest deviation at 5.45% for a non-smoking 35-year-old female confirming practical usability.

MachineLearningModels

Background &Objective

Threearchitectures

Side-by-sidebreakdown

Built with

Machine
Learning
Models

Background &
Objective

Three
architectures

Side-by-side
breakdown