Skip to content

23f2005144/comment-category-classification

Repository files navigation

🧠 Comment Category Prediction Challenge

πŸ“Œ Multiclass Text Classification | End-to-End ML Project

Project Type Problem Approach Metric Status


πŸš€ Project Summary

This project focuses on classifying user-generated comments into multiple categories using a structured machine learning pipeline.

The workflow covers:

  • πŸ” Data understanding
  • 🧹 Preprocessing
  • πŸ“Š Exploratory analysis
  • 🧠 Feature engineering
  • πŸ€– Model training & evaluation

🎯 Objective

  • Accurately classify comments into predefined categories
  • Handle imbalanced class distribution
  • Build a model that generalizes well

🧠 Approach

πŸ”Ή Data Understanding & Cleaning

  • Removed irrelevant features
  • Handled missing values
  • Standardized dataset for modeling

πŸ”Ή Exploratory Data Analysis

  • πŸ“Š Univariate Analysis: feature distributions, outliers, and descriptive statistics
  • πŸ”— Bivariate & Multivariate Analysis: relationships between features and target variable
  • βš–οΈ Class Distribution Analysis: identification of class imbalance
  • πŸ“ Text Analysis: review of sample comments and linguistic patterns
  • ☁️ WordClouds: top words across entire dataset and across individual classes

πŸ“Š Visual: Class Distribution

Class Distribution


πŸ”Ή Feature Engineering

  • πŸ“ Text-based features: comment length, word count, average word length

  • ⏱️ Temporal features: extracted hour and month

    • Applied sine-cosine transformation to capture cyclic patterns
  • πŸ” Zero-inflated features

    • Identified features with excessive zeros
    • Converted into binary indicators to capture presence/absence
  • πŸ“‰ Handling skewness

    • Evaluated log transformation and Yeo-Johnson transformation
    • Yeo-Johnson selected as it reduced skewness more effectively
  • πŸ” Feature Selection & Comparison

    • Used Mutual Information (MI) for feature importance
    • Compared raw vs transformed features separately
    • Transformed features slightly outperformed and were retained
  • πŸ”— Final Feature Set

    • Transformed numerical features
    • Word-level TF-IDF
    • Character-level TF-IDF (char_wb)

πŸ”Ή Text Representation

  • Built TF-IDF features using both word-level and character-level (char_wb) analyzers

πŸ“Œ Word-Level TF-IDF

  • Tuned n-gram range, min_df, max_df, and sublinear TF scaling
  • Captures semantic and contextual word patterns

πŸ”¬ Character-Level TF-IDF (char_wb)

  • Tuned n-grams, min_df, and max_df

  • Captures subword patterns and improves robustness to noisy text

  • πŸ”§ Both vectorizers were independently tuned and optimized

  • πŸ“Š Final feature space consisted of ~125K TF-IDF features


πŸ”Ή Modeling & Tuning

Models were trained under a consistent pipeline and systematically tuned:

  • Logistic Regression β†’ tuned using RandomizedSearchCV (regularization, tolerance, class weights)
  • Linear SVM (LinearSVC) β†’ tuned for regularization and class weights
  • LightGBM Classifier β†’ manually tuned across multiple parameter combinations

πŸ“Š Visual: Model Performance Comparison

Model Comparison


πŸ”Ž Evaluation Analysis

  • Evaluated model performance using classification report, confusion matrix, and precision-recall (PR) curves

  • πŸ“Š Classification Report

    • Analyzed precision, recall, and F1-score per class
    • Helped identify performance gaps, especially in minority classes
  • πŸ” Confusion Matrix

    • Provided a clear view of class-wise predictions and misclassifications
    • Useful for understanding where the model was confusing similar categories
  • πŸ“‰ Precision-Recall (PR) Curves

    • Focused on minority classes, where performance is harder to capture
    • Helped analyze the precision–recall trade-off under class imbalance
  • ⚠️ ROC Curve not used

    • Dropped due to large number of true negatives, which can make ROC curves overly optimistic in imbalanced settings

🎯 Evaluation Metric

Macro F1 Score

  • Accounts for class imbalance
  • Treats all classes equally

πŸ“ˆ Validation Score

Macro F1: 0.8350

πŸ† Submission Score (Public Leaderboard)

Macro F1: 0.8344

βš”οΈ Challenges & Observations

βš–οΈ Class Imbalance

  • Some categories had significantly fewer samples
  • Addressed using class weight tuning to penalize minority classes
  • Improved recall for minority classes, with a trade-off in precision which was balanced to achieve a good macro f1 score

πŸ“ Noisy Text Data

  • Informal language, inconsistencies, and variations
  • Improved robustness using character-level TF-IDF

πŸ“‰ Skewed Feature Distributions

  • Numerical features showed strong skewness
  • Evaluated both log transformation and Yeo-Johnson transformation
  • Yeo-Johnson provided better normalization and was selected

πŸ” Feature Selection

  • Used Mutual Information (MI) to evaluate feature importance
  • Compared raw vs transformed features
  • Transformed features slightly outperformed and were retained

βš™οΈ Model Trade-offs

  • Linear models performed strongly on high-dimensional sparse features
  • Extensive hyperparameter tuning was performed across models
  • Trade-off between capturing minority class patterns and risk of overfitting
  • Slight overfitting observed on training/validation was intentional to improve Macro F1
  • Final model generalized well, with consistent or slightly improved performance on leaderboard data

🧩 Key Insights

  • Feature engineering significantly improved model performance
  • Character-level TF-IDF helped capture minority class patterns more effectively
  • LightGBM outperformed linear models by capturing both linear and non-linear relationships
  • Focus was on balancing precision and recall, especially for minority classes, under the Macro F1 metric

πŸ› οΈ Tech Stack

  • 🐍 Core: Python
  • πŸ€– Machine Learning: Scikit-learn, LightGBM
  • πŸ“Š Data Handling: Pandas, NumPy
  • πŸ“‰ Visualization: Matplotlib, Seaborn, WordCloud
  • πŸ”€ Text Processing: Regex (re), TF-IDF
  • πŸ“ Statistical Transformations: SciPy (scipy.stats)

πŸ“‚ Repository Structure

β”œβ”€β”€ 23f2005144-comment-classification-notebook.ipynb # Complete workflow
β”œβ”€β”€ submission.csv # Final predictions
└── README.md # Documentation


🌟 Summary

A structured end-to-end machine learning pipeline for multiclass text classification, with emphasis on:

  • Data understanding and comprehensive exploratory analysis
  • Advanced feature engineering, including transformations and feature selection
  • Robust text representation using word-level and character-level TF-IDF
  • Systematic model tuning and comparison across linear and boosting models
  • Careful evaluation using class-wise metrics and precision-recall analysis

πŸ’‘ Notes for Reviewers

  • The notebook contains full experimentation, including feature comparisons and model tuning
  • Evaluation includes classification reports, confusion matrices, and PR curve analysis
  • Emphasis is placed on data-driven decision-making and trade-offs at each stage of the pipeline

About

End-to-end ML pipeline for multiclass comment classification using TF-IDF, feature engineering, and tuned ML models.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors