Skip to content

deepikagandla7456/sediffgen-topic-modeling

Repository files navigation

Diffusion-Guided Semantic Embedding Model for Topic Modeling

GitHub license GitHub issues GitHub contributors GitHub last-commit

A diffusion-based topic modeling system that integrates semantic embeddings with generative diffusion models to discover coherent topics and generate topic-conditioned text from large unstructured datasets.


Table of Contents


About

The SEDiff-Gen Topic Modeling Framework performs unsupervised topic discovery and topic-conditioned text generation using diffusion models. Unlike traditional topic models such as LDA, this approach captures deep semantic relationships through sentence embeddings and iterative denoising.

System Architecture

The system follows a modular pipeline-based architecture:

  1. Dataset Loading
  2. Text Preprocessing
  3. Semantic Embedding Generation
  4. Diffusion Model Training
  5. Topic Clustering
  6. Topic Evaluation
  7. Topic-Conditioned Text Generation

This design ensures scalability, flexibility, and ease of extension.


Limitations


Traditional topic modeling techniques:

  • Depend on bag-of-words assumptions
  • Fail to capture semantic meaning
  • Struggle with large and complex datasets

This project addresses these limitations by combining:

  • Sentence-BERT embeddings
  • Diffusion-based generative learning
  • Density-based clustering

Features

  • Diffusion-based topic modeling
  • Semantic embeddings using Sentence-BERT
  • Topic clustering with HDBSCAN
  • Topic coherence and cluster quality evaluation
  • Topic-conditioned text generation
  • Model comparison (LDA, TF-IDF, SBERT, SEDiff-Gen)
  • Topic cluster visualization

Dataset

  • BBC News Dataset
  • Multi-domain news articles
  • Preprocessed and cleaned for topic modeling

Evaluation Metrics

  • Topic Coherence Score
  • Cluster Quality Score
  • Accuracy
  • Precision, Recall, F1-score
  • Training vs Validation Loss

Results

  • Improved topic coherence compared to LDA and TF-IDF
  • Stable diffusion model convergence
  • Minimal overfitting
  • Meaningful topic-aligned text generation

Future Scope

  • Multilingual topic modeling
  • Real-time topic discovery
  • Domain-specific fine-tuning
  • Hybrid transformer-diffusion models
  • Integration with downstream NLP tasks

Requirements

  • Python 3.10+
  • PyTorch
  • Sentence-Transformers
  • Scikit-learn
  • HDBSCAN
  • NumPy
  • Pandas
  • Matplotlib

Setup and Installation

  1. Clone the repository:
git clone https://github.com/deepikagandla7456/sediffgen-topic-modeling.git
cd sediffgen-topic-modeling
  1. Install required packages:
pip install -r requirements.txt

Usage

  • Please follow the instructions below to run the application locally. Run the following command to start the application:
python app.py
python main.py

Screenshots

Training vs Validation Loss Across Epochs

Image

Global Topic Distribution in 2D Space

Image

Focused Topic Cluster Visualization (2D)

Image

Model Comparison Scores

Image

Model Performance Table

Image

Generated and Retrieved Text Samples by Topic

Image

Live Topic-Conditioned Text Generation Output

Image

Classification Performance Metrics

Image

License

This project is licensed under the MIT - see the LICENSE file for details.

About

Diffusion-guided topic modeling using semantic embeddings and unsupervised learning. Combines Sentence-BERT, diffusion models, and clustering to discover coherent topics, generate topic-conditioned text, visualize topic structures, and compare with traditional methods.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors