A diffusion-based topic modeling system that integrates semantic embeddings with generative diffusion models to discover coherent topics and generate topic-conditioned text from large unstructured datasets.
- About
- System Architecture
- Features
- Dataset
- Evaluation Metrics
- Results
- Future Scope
- Requirements
- Setup and Installation
- Usage
- Screenshots
- License
The SEDiff-Gen Topic Modeling Framework performs unsupervised topic discovery and topic-conditioned text generation using diffusion models. Unlike traditional topic models such as LDA, this approach captures deep semantic relationships through sentence embeddings and iterative denoising.
The system follows a modular pipeline-based architecture:
- Dataset Loading
- Text Preprocessing
- Semantic Embedding Generation
- Diffusion Model Training
- Topic Clustering
- Topic Evaluation
- Topic-Conditioned Text Generation
This design ensures scalability, flexibility, and ease of extension.
Traditional topic modeling techniques:
- Depend on bag-of-words assumptions
- Fail to capture semantic meaning
- Struggle with large and complex datasets
This project addresses these limitations by combining:
- Sentence-BERT embeddings
- Diffusion-based generative learning
- Density-based clustering
- Diffusion-based topic modeling
- Semantic embeddings using Sentence-BERT
- Topic clustering with HDBSCAN
- Topic coherence and cluster quality evaluation
- Topic-conditioned text generation
- Model comparison (LDA, TF-IDF, SBERT, SEDiff-Gen)
- Topic cluster visualization
- BBC News Dataset
- Multi-domain news articles
- Preprocessed and cleaned for topic modeling
- Topic Coherence Score
- Cluster Quality Score
- Accuracy
- Precision, Recall, F1-score
- Training vs Validation Loss
- Improved topic coherence compared to LDA and TF-IDF
- Stable diffusion model convergence
- Minimal overfitting
- Meaningful topic-aligned text generation
- Multilingual topic modeling
- Real-time topic discovery
- Domain-specific fine-tuning
- Hybrid transformer-diffusion models
- Integration with downstream NLP tasks
- Python 3.10+
- PyTorch
- Sentence-Transformers
- Scikit-learn
- HDBSCAN
- NumPy
- Pandas
- Matplotlib
- Clone the repository:
git clone https://github.com/deepikagandla7456/sediffgen-topic-modeling.git
cd sediffgen-topic-modeling- Install required packages:
pip install -r requirements.txt- Please follow the instructions below to run the application locally. Run the following command to start the application:
python app.pypython main.pyTraining vs Validation Loss Across Epochs
Global Topic Distribution in 2D Space
Focused Topic Cluster Visualization (2D)
Model Comparison Scores
Model Performance Table
Generated and Retrieved Text Samples by Topic
Live Topic-Conditioned Text Generation Output
Classification Performance Metrics
This project is licensed under the MIT - see the LICENSE file for details.







