Skip to content

Latest commit

 

History

History
235 lines (158 loc) · 8.3 KB

File metadata and controls

235 lines (158 loc) · 8.3 KB

🧠 Comment Category Prediction Challenge

📌 Multiclass Text Classification | End-to-End ML Project

Project Type Problem Approach Metric Status


🚀 Project Summary

This project focuses on classifying user-generated comments into multiple categories using a structured machine learning pipeline.

The workflow covers:

  • 🔍 Data understanding
  • 🧹 Preprocessing
  • 📊 Exploratory analysis
  • 🧠 Feature engineering
  • 🤖 Model training & evaluation

🎯 Objective

  • Accurately classify comments into predefined categories
  • Handle imbalanced class distribution
  • Build a model that generalizes well

🧠 Approach

🔹 Data Understanding & Cleaning

  • Removed irrelevant features
  • Handled missing values
  • Standardized dataset for modeling

🔹 Exploratory Data Analysis

  • 📊 Univariate Analysis: feature distributions, outliers, and descriptive statistics
  • 🔗 Bivariate & Multivariate Analysis: relationships between features and target variable
  • ⚖️ Class Distribution Analysis: identification of class imbalance
  • 📝 Text Analysis: review of sample comments and linguistic patterns
  • ☁️ WordClouds: top words across entire dataset and across individual classes

📊 Visual: Class Distribution

Class Distribution


🔹 Feature Engineering

  • 📝 Text-based features: comment length, word count, average word length

  • ⏱️ Temporal features: extracted hour and month

    • Applied sine-cosine transformation to capture cyclic patterns
  • 🔁 Zero-inflated features

    • Identified features with excessive zeros
    • Converted into binary indicators to capture presence/absence
  • 📉 Handling skewness

    • Evaluated log transformation and Yeo-Johnson transformation
    • Yeo-Johnson selected as it reduced skewness more effectively
  • 🔍 Feature Selection & Comparison

    • Used Mutual Information (MI) for feature importance
    • Compared raw vs transformed features separately
    • Transformed features slightly outperformed and were retained
  • 🔗 Final Feature Set

    • Transformed numerical features
    • Word-level TF-IDF
    • Character-level TF-IDF (char_wb)

🔹 Text Representation

  • Built TF-IDF features using both word-level and character-level (char_wb) analyzers

📌 Word-Level TF-IDF

  • Tuned n-gram range, min_df, max_df, and sublinear TF scaling
  • Captures semantic and contextual word patterns

🔬 Character-Level TF-IDF (char_wb)

  • Tuned n-grams, min_df, and max_df

  • Captures subword patterns and improves robustness to noisy text

  • 🔧 Both vectorizers were independently tuned and optimized

  • 📊 Final feature space consisted of ~125K TF-IDF features


🔹 Modeling & Tuning

Models were trained under a consistent pipeline and systematically tuned:

  • Logistic Regression → tuned using RandomizedSearchCV (regularization, tolerance, class weights)
  • Linear SVM (LinearSVC) → tuned for regularization and class weights
  • LightGBM Classifier → manually tuned across multiple parameter combinations

📊 Visual: Model Performance Comparison

Model Comparison


🔎 Evaluation Analysis

  • Evaluated model performance using classification report, confusion matrix, and precision-recall (PR) curves

  • 📊 Classification Report

    • Analyzed precision, recall, and F1-score per class
    • Helped identify performance gaps, especially in minority classes
  • 🔁 Confusion Matrix

    • Provided a clear view of class-wise predictions and misclassifications
    • Useful for understanding where the model was confusing similar categories
  • 📉 Precision-Recall (PR) Curves

    • Focused on minority classes, where performance is harder to capture
    • Helped analyze the precision–recall trade-off under class imbalance
  • ⚠️ ROC Curve not used

    • Dropped due to large number of true negatives, which can make ROC curves overly optimistic in imbalanced settings

🎯 Evaluation Metric

Macro F1 Score

  • Accounts for class imbalance
  • Treats all classes equally

📈 Validation Score

Macro F1: 0.8350

🏆 Submission Score (Public Leaderboard)

Macro F1: 0.8344

⚔️ Challenges & Observations

⚖️ Class Imbalance

  • Some categories had significantly fewer samples
  • Addressed using class weight tuning to penalize minority classes
  • Improved recall for minority classes, with a trade-off in precision which was balanced to achieve a good macro f1 score

📝 Noisy Text Data

  • Informal language, inconsistencies, and variations
  • Improved robustness using character-level TF-IDF

📉 Skewed Feature Distributions

  • Numerical features showed strong skewness
  • Evaluated both log transformation and Yeo-Johnson transformation
  • Yeo-Johnson provided better normalization and was selected

🔍 Feature Selection

  • Used Mutual Information (MI) to evaluate feature importance
  • Compared raw vs transformed features
  • Transformed features slightly outperformed and were retained

⚙️ Model Trade-offs

  • Linear models performed strongly on high-dimensional sparse features
  • Extensive hyperparameter tuning was performed across models
  • Trade-off between capturing minority class patterns and risk of overfitting
  • Slight overfitting observed on training/validation was intentional to improve Macro F1
  • Final model generalized well, with consistent or slightly improved performance on leaderboard data

🧩 Key Insights

  • Feature engineering significantly improved model performance
  • Character-level TF-IDF helped capture minority class patterns more effectively
  • LightGBM outperformed linear models by capturing both linear and non-linear relationships
  • Focus was on balancing precision and recall, especially for minority classes, under the Macro F1 metric

🛠️ Tech Stack

  • 🐍 Core: Python
  • 🤖 Machine Learning: Scikit-learn, LightGBM
  • 📊 Data Handling: Pandas, NumPy
  • 📉 Visualization: Matplotlib, Seaborn, WordCloud
  • 🔤 Text Processing: Regex (re), TF-IDF
  • 📐 Statistical Transformations: SciPy (scipy.stats)

📂 Repository Structure

├── 23f2005144-comment-classification-notebook.ipynb # Complete workflow
├── submission.csv # Final predictions
└── README.md # Documentation


🌟 Summary

A structured end-to-end machine learning pipeline for multiclass text classification, with emphasis on:

  • Data understanding and comprehensive exploratory analysis
  • Advanced feature engineering, including transformations and feature selection
  • Robust text representation using word-level and character-level TF-IDF
  • Systematic model tuning and comparison across linear and boosting models
  • Careful evaluation using class-wise metrics and precision-recall analysis

💡 Notes for Reviewers

  • The notebook contains full experimentation, including feature comparisons and model tuning
  • Evaluation includes classification reports, confusion matrices, and PR curve analysis
  • Emphasis is placed on data-driven decision-making and trade-offs at each stage of the pipeline