This repository contains a comprehensive research project on multi-view clustering using various feature extraction methods and DBSCAN clustering algorithm. The research focuses on analyzing the effectiveness of different feature combinations for clustering tasks on two benchmark datasets: COIL-20 and Caltech-7.
- Primary Goal: Investigate the performance of multi-view clustering using different feature extraction methods
- Secondary Goal: Compare clustering algorithms (K-Means vs DBSCAN) and optimize parameters
- Research Question: Which combination of features provides the best clustering performance for image datasets?
- Source: Subset of Caltech-101 dataset
- Classes: 7 categories with varying image counts
- Total Images: 1,474 images
- Classes Breakdown:
- Faces: 435 images
- Motorbikes: 798 images
- Dollar-Bill: 52 images
- Garfield: 34 images
- Snoopy: 35 images
- Stop-Sign: 64 images
- Windsor-Chair: 56 images
- Source: Columbia Object Image Library
- Cars Subset: 144 images (objects 10 and 14)
- Full Dataset: 1,440 images (20 objects ร 72 rotations each)
- Purpose: Validation and comparative analysis
Our research implements 6 different feature extraction techniques to create multiple views of the data:
- Purpose: Texture analysis and local pattern recognition
- Parameters:
- Radius: 3
- Number of points: 24
- Method: 'uniform'
- Feature Dimension: 26 features
- Advantage: Robust to illumination changes, captures local texture patterns
- Purpose: Shape and edge detection
- Parameters:
- Pixels per cell: (8,8)
- Cells per block: (2,2)
- Orientations: 9
- Feature Dimension: 8,100 features
- Advantage: Captures shape information and spatial relationships
- Purpose: Multi-resolution analysis
- Method: Haar wavelet decomposition
- Feature Dimension: 8 features (mean & std of 4 coefficients)
- Advantage: Captures both frequency and spatial information
- Purpose: Texture description using census transform
- Implementation: LBP-based with 'uniform' method
- Feature Dimension: 26 features
- Advantage: Computationally efficient, good for texture classification
- Purpose: Edge and texture detection at different orientations
- Feature Dimension: 48 features
- Advantage: Mimics human visual system, good for texture analysis
- Purpose: Global scene descriptors
- Feature Dimension: 512 features
- Advantage: Captures global scene properties and spatial layout
- Usage: Initial clustering experiments and baseline comparison
- Parameters: n_clusters=2-5, random_state=42
- Evaluation: Silhouette scores, cluster visualization
- Limitations: Requires predefined number of clusters, sensitive to initialization
- Usage: Advanced clustering with automatic noise detection
- Parameter Tuning: Systematic eps optimization (0.1-2.0 range)
- Advantages:
- No need to specify number of clusters
- Handles noise and outliers effectively
- Discovers clusters of arbitrary shapes
- Evaluation: Silhouette scores, NMI, Rand Index
- Purpose: Measures cluster cohesion and separation
- Range: -1 to 1 (higher is better)
- Interpretation: Measures how similar an object is to its own cluster vs other clusters
- Purpose: Measures cluster-label agreement
- Range: 0 to 1 (higher is better)
- Interpretation: Quantifies the mutual information between true and predicted clusters
- Purpose: Overall clustering accuracy
- Range: 0 to 1 (higher is better)
- Interpretation: Measures the percentage of correct decisions made by the algorithm
- Image Loading: Load images from directory structure
- Preprocessing: Convert to grayscale, normalize pixel values
- Label Extraction: Extract true labels from directory structure
- Data Organization: Organize files for systematic processing
- Individual Feature Extraction: Extract each feature type separately
- Feature Standardization: Apply StandardScaler for normalization
- Dimensionality Reduction: Apply PCA to reduce to 2D for visualization
- Feature Combination: Create various combinations of features
- Parameter Tuning: Systematic optimization of DBSCAN parameters
- Multiple Runs: Test different feature combinations
- Performance Evaluation: Compute multiple evaluation metrics
- Statistical Analysis: Compare results across different configurations
- 2D Visualization: PCA plots showing cluster distributions
- Performance Comparison: Tabular comparison of results
- Statistical Significance: Analysis of performance differences
- Result Interpretation: Draw conclusions from experimental data
- Feature Combination: LBP + WM + CENTRIST (60 features total)
- Clustering Method: DBSCAN (eps=1.1, min_samples=20)
- Performance Metrics:
- NMI: 0.320
- Rand Index: 0.720
- Silhouette Score: 0.454
- Clusters Discovered: 2 main clusters with 245 outliers
| Feature Combination | Best EPS | Clusters | NMI | Rand Index | Silhouette Score |
|---|---|---|---|---|---|
| LBP only | 0.7 | 2 | 0.299 | 0.708 | 0.412 |
| WM only | 0.2 | 3 | 0.282 | 0.665 | 0.096 |
| CENTRIST only | 0.7 | 2 | 0.299 | 0.708 | 0.412 |
| LBP + WM | 0.6 | 2 | 0.289 | 0.673 | 0.305 |
| WM + CENTRIST | 0.6 | 2 | 0.289 | 0.673 | 0.305 |
| LBP + CENTRIST | 1.0 | 2 | 0.302 | 0.710 | 0.415 |
| LBP + WM + CENTRIST | 1.1 | 2 | 0.320 | 0.720 | 0.454 |
-
Feature Complementarity:
- LBP and CENTRIST show similar individual performance
- WM features provide complementary information when combined
- Three-feature combination achieves optimal performance
-
Clustering Algorithm Performance:
- DBSCAN outperforms K-Means due to noise handling
- Parameter sensitivity is critical for DBSCAN performance
- eps parameter optimization is essential for good results
-
Dataset Characteristics:
- 2-3 clusters optimal for Caltech-7 dataset
- Significant number of outliers present (245 out of 1474)
- Feature standardization crucial for multi-view clustering
caltech/
โโโ README.md # Project documentation
โโโ requirements.txt # Python dependencies
โโโ .gitignore # Git ignore file
โโโ CALTECH7_DB-SCAN.pdf # Research results and analysis
โ
โโโ notebooks/ # Jupyter notebooks
โ โโโ CALTECH7LBP-checkpoint.ipynb # LBP analysis on Caltech-7
โ โโโ finalcal7-checkpoint.ipynb # Comprehensive Caltech-7 analysis
โ โโโ DBSCAN-checkpoint.ipynb # DBSCAN parameter optimization
โ โโโ car_coil20-checkpoint.ipynb # COIL-20 car subset analysis
โ โโโ COIL20_Image_Processing-checkpoint.ipynb # Full COIL-20 processing
โ โโโ combined_class_coik_correct-checkpoint.ipynb # Combined features
โ โโโ dbscan_car-checkpoint.ipynb # DBSCAN on car subset
โ โโโ hog_class_coil_correct-checkpoint.ipynb # HOG feature analysis
โ โโโ lbp_class_coil_correct-checkpoint.ipynb # LBP feature analysis
โ โโโ fileCALTECH7-checkpoint.ipynb # Dataset organization
โ โโโ filesharing-checkpoint.ipynb # File management utilities
โ
โโโ data/ # Dataset information
โ โโโ caltech7_info.md # Caltech-7 dataset details
โ โโโ coil20_info.md # COIL-20 dataset details
โ
โโโ results/ # Results and analysis
โ โโโ performance_summary.csv # Performance metrics table
โ
โโโ src/ # Source code modules (empty)
- Python 3.7+
- Jupyter Notebook
- Required Python packages (see requirements.txt)
- Clone the repository:
git clone https://github.com/yourusername/caltech-multiview-clustering.git
cd caltech-multiview-clustering- Install dependencies:
pip install -r requirements.txt- Download datasets:
- Caltech-7: Place in
data/caltech7/ - COIL-20: Place in
data/coil20/
- Caltech-7: Place in
-
Run individual experiments:
- Open any notebook in the
notebooks/directory - Follow the execution order in each notebook
- Open any notebook in the
-
Reproduce results:
- Start with
finalcal7-checkpoint.ipynbfor comprehensive analysis - Use
DBSCAN-checkpoint.ipynbfor parameter optimization
- Start with
-
Custom experiments:
- Modify feature extraction parameters in
src/feature_extraction.py - Adjust clustering parameters in
src/clustering.py
- Modify feature extraction parameters in
finalcal7-checkpoint.ipynb: Main research notebook with comprehensive analysisCALTECH7LBP-checkpoint.ipynb: LBP feature analysis on Caltech-7DBSCAN-checkpoint.ipynb: DBSCAN parameter optimization and analysis
hog_class_coil_correct-checkpoint.ipynb: HOG feature extraction and clusteringlbp_class_coil_correct-checkpoint.ipynb: LBP feature extraction and clusteringcombined_class_coik_correct-checkpoint.ipynb: Combined feature analysis
car_coil20-checkpoint.ipynb: COIL-20 car subset analysisCOIL20_Image_Processing-checkpoint.ipynb: Full COIL-20 dataset processing
fileCALTECH7-checkpoint.ipynb: Dataset organization and preparationfilesharing-checkpoint.ipynb: File management and organization
- Image Preprocessing: Grayscale conversion, normalization
- Feature Computation: Parallel extraction of multiple feature types
- Feature Standardization: Z-score normalization
- Dimensionality Reduction: PCA for visualization (2D)
- Parameter Optimization: Grid search for optimal eps values
- Clustering Execution: DBSCAN with optimized parameters
- Performance Evaluation: Multiple metrics computation
- Visualization: 2D PCA plots with cluster assignments
- Metric Computation: Silhouette score, NMI, Rand Index
- Statistical Analysis: Performance comparison across configurations
- Visualization: Cluster plots, performance charts
- Interpretation: Results analysis and insights
Multi-view clustering leverages multiple representations (views) of the same data to improve clustering performance. Each feature extraction method captures different aspects of the data:
- LBP: Local texture patterns
- HOG: Shape and edge information
- WM: Multi-resolution characteristics
- CENTRIST: Census transform features
DBSCAN was chosen over K-Means because:
- No need to specify cluster count: Automatically discovers optimal number of clusters
- Noise handling: Identifies and handles outliers effectively
- Shape flexibility: Can find clusters of arbitrary shapes
- Parameter robustness: Less sensitive to parameter variations
- Caltech-7: Provides diverse object categories with varying complexity
- COIL-20: Offers controlled rotation variations for validation
- Combined analysis: Allows for comprehensive evaluation across different data characteristics
- Systematic Evaluation: Comprehensive comparison of 6 different feature extraction methods
- Multi-View Analysis: Investigation of feature combination effectiveness
- Parameter Optimization: Systematic DBSCAN parameter tuning methodology
- Performance Benchmarking: Detailed evaluation using multiple metrics
- Reproducible Research: Complete code and data organization for reproducibility
- Additional Feature Types: Include more advanced features (CNN features, SIFT, etc.)
- Advanced Clustering: Test other clustering algorithms (Spectral Clustering, Gaussian Mixture Models)
- Multi-View Fusion: Implement advanced multi-view fusion techniques
- Larger Datasets: Extend analysis to larger image datasets
- Deep Learning Integration: Incorporate deep learning-based feature extraction
- Theoretical Analysis: Mathematical analysis of multi-view clustering performance
- Scalability: Optimization for large-scale datasets
- Real-world Applications: Apply to practical computer vision problems
- Comparative Studies: Compare with state-of-the-art multi-view clustering methods
For questions or collaboration opportunities:
- Email: kunalsali04@gmail.com
- GitHub: kunnaall04
Note: This research demonstrates multi-view clustering using various feature extraction methods and DBSCAN clustering on image datasets.