SEDiffGen Topic Modeling Framework

Diffusion-Guided Semantic Embedding Model for Topic Modeling

A diffusion-based topic modeling system that integrates semantic embeddings with generative diffusion models to discover coherent topics and generate topic-conditioned text from large unstructured datasets.

About

The SEDiff-Gen Topic Modeling Framework performs unsupervised topic discovery and topic-conditioned text generation using diffusion models. Unlike traditional topic models such as LDA, this approach captures deep semantic relationships through sentence embeddings and iterative denoising.

System Architecture

The system follows a modular pipeline-based architecture:

Dataset Loading
Text Preprocessing
Semantic Embedding Generation
Diffusion Model Training
Topic Clustering
Topic Evaluation
Topic-Conditioned Text Generation

This design ensures scalability, flexibility, and ease of extension.

Limitations

Traditional topic modeling techniques:

Depend on bag-of-words assumptions
Fail to capture semantic meaning
Struggle with large and complex datasets

This project addresses these limitations by combining:

Sentence-BERT embeddings
Diffusion-based generative learning
Density-based clustering

Features

Diffusion-based topic modeling
Semantic embeddings using Sentence-BERT
Topic clustering with HDBSCAN
Topic coherence and cluster quality evaluation
Topic-conditioned text generation
Model comparison (LDA, TF-IDF, SBERT, SEDiff-Gen)
Topic cluster visualization

Dataset

BBC News Dataset
Multi-domain news articles
Preprocessed and cleaned for topic modeling

Evaluation Metrics

Topic Coherence Score
Cluster Quality Score
Accuracy
Precision, Recall, F1-score
Training vs Validation Loss

Results

Improved topic coherence compared to LDA and TF-IDF
Stable diffusion model convergence
Minimal overfitting
Meaningful topic-aligned text generation

Future Scope

Multilingual topic modeling
Real-time topic discovery
Domain-specific fine-tuning
Hybrid transformer-diffusion models
Integration with downstream NLP tasks

Requirements

Python 3.10+
PyTorch
Sentence-Transformers
Scikit-learn
HDBSCAN
NumPy
Pandas
Matplotlib

Setup and Installation

Clone the repository:

git clone https://github.com/deepikagandla7456/sediffgen-topic-modeling.git
cd sediffgen-topic-modeling

Install required packages:

pip install -r requirements.txt

Usage

Please follow the instructions below to run the application locally. Run the following command to start the application:

python app.py

python main.py

Screenshots

Training vs Validation Loss Across Epochs

Global Topic Distribution in 2D Space

Focused Topic Cluster Visualization (2D)

Model Comparison Scores

Model Performance Table

Generated and Retrieved Text Samples by Topic

Live Topic-Conditioned Text Generation Output

Classification Performance Metrics

License

This project is licensed under the MIT - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
data		data
outputs		outputs
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
app.py		app.py
bbc-text.csv		bbc-text.csv
compare_lda_vs_sediffgen.py		compare_lda_vs_sediffgen.py
main.py		main.py
requirements.txt		requirements.txt
retrieve_cluster_samples.py		retrieve_cluster_samples.py
topic_modeling_diffusion_tf.ipynb		topic_modeling_diffusion_tf.ipynb
topics_summary.csv		topics_summary.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SEDiffGen Topic Modeling Framework

Diffusion-Guided Semantic Embedding Model for Topic Modeling

Table of Contents

About

System Architecture

Limitations

Features

Dataset

Evaluation Metrics

Results

Future Scope

Requirements

Setup and Installation

Usage

Screenshots

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

SEDiffGen Topic Modeling Framework

Diffusion-Guided Semantic Embedding Model for Topic Modeling

Table of Contents

About

System Architecture

Limitations

Features

Dataset

Evaluation Metrics

Results

Future Scope

Requirements

Setup and Installation

Usage

Screenshots

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages