FSL3-EcoAcousticAlarmDetection

FSL3-EcoAcousticAlarmDetection is a few-shot learning model designed to classify ecological audio recordings into three categories: alarm, non-alarm, and background. The model begins by converting MP3 or WAV files into Mel spectrograms and, for each episode, splits the data into a support set (5 samples per class), a query set (6 samples per class), and a test set (30 samples per class), generating 100 training episodes via an episodic batch sampler. A CNN encoder (trained with the Adam optimizer and cross-entropy loss) with four convolutional blocks extracts spectrogram embeddings, followed by an adaptive pooling layer that compresses the frequency dimension into four representative bins while preserving the temporal resolution. This is then passed into an RNN module (GRU by default) in a bidirectional configuration to capture temporal dynamic and sequences in bird calls. A multi-head attention mechanism (with four attention heads) highlights the most informative time steps, and the resulting context vector is passed through a linear projection and a learnable normalization layer, to center feature embeddings during training. The model uses a Prototypical Network for classification, computing class prototypes from the support set and comparing them to query embeddings using cosine similarity. Temperature scaling with linear decay (10.0 -> 3.0) is applied to sharpen the softmax distribution, and the final classification is based on log-probabilities derived from cosine distances.

The model achieves 93% accuracy on a test set of 30 samples per class, evaluated over 100 episodes.

Compared to FSL1, this model uses cosine distance rather than Euclidean distance to compare query embeddings with class prototypes, it maintains temporal structure by applying a pooling layer that compresses the frequency dimension into four representative bins while preserving the time axis—unlike FSL1, which flattens both dimensions. It also incorporates a multi-head attention mechanism with four attention heads to emphasize informative time steps, and employs temperature decay across episodes to sharpen the softmax distribution. Additionally, it removes relation-based predictions and introduces a learnable normalization layer to center feature embeddings during training.

Compared to FSL2, this model replaces the simplified attention mechanism—which uses a single linear layer with Tanh activation—with a multi-head attention module comprising four attention heads. It also includes learnable normalization for better feature centering, excluding relation-based predictions, and introduces temperature decay (10.0 -> 3.0) - unlike FSL2, which uses a fixed temperature value (10.0) for scaling softmax.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
__pycache__		__pycache__
src		src
utils		utils
.DS_Store		.DS_Store
.gitignore		.gitignore
README.md		README.md
__init__.py		__init__.py
config.py		config.py
main.py		main.py
replace_mp3.py		replace_mp3.py
tempCodeRunnerFile.py		tempCodeRunnerFile.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

FSL3-EcoAcousticAlarmDetection

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

FSL3-EcoAcousticAlarmDetection

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages