FSL3-EcoAcousticAlarmDetection is a few-shot learning model designed to classify ecological audio recordings into three categories: alarm, non-alarm, and background. The model begins by converting MP3 or WAV files into Mel spectrograms and, for each episode, splits the data into a support set (5 samples per class), a query set (6 samples per class), and a test set (30 samples per class), generating 100 training episodes via an episodic batch sampler. A CNN encoder (trained with the Adam optimizer and cross-entropy loss) with four convolutional blocks extracts spectrogram embeddings, followed by an adaptive pooling layer that compresses the frequency dimension into four representative bins while preserving the temporal resolution. This is then passed into an RNN module (GRU by default) in a bidirectional configuration to capture temporal dynamic and sequences in bird calls. A multi-head attention mechanism (with four attention heads) highlights the most informative time steps, and the resulting context vector is passed through a linear projection and a learnable normalization layer, to center feature embeddings during training. The model uses a Prototypical Network for classification, computing class prototypes from the support set and comparing them to query embeddings using cosine similarity. Temperature scaling with linear decay (10.0 -> 3.0) is applied to sharpen the softmax distribution, and the final classification is based on log-probabilities derived from cosine distances.
The model achieves 93% accuracy on a test set of 30 samples per class, evaluated over 100 episodes.
Compared to FSL1, this model uses cosine distance rather than Euclidean distance to compare query embeddings with class prototypes, it maintains temporal structure by applying a pooling layer that compresses the frequency dimension into four representative bins while preserving the time axis—unlike FSL1, which flattens both dimensions. It also incorporates a multi-head attention mechanism with four attention heads to emphasize informative time steps, and employs temperature decay across episodes to sharpen the softmax distribution. Additionally, it removes relation-based predictions and introduces a learnable normalization layer to center feature embeddings during training.
Compared to FSL2, this model replaces the simplified attention mechanism—which uses a single linear layer with Tanh activation—with a multi-head attention module comprising four attention heads. It also includes learnable normalization for better feature centering, excluding relation-based predictions, and introduces temperature decay (10.0 -> 3.0) - unlike FSL2, which uses a fixed temperature value (10.0) for scaling softmax.