A completely label-free approach to automatic audio editing. Instead of predicting edit labels or doing frame-to-frame mapping, this system learns what good audio sounds like from your edited examples, then detects "anomalies" (bad segments) in raw audio by reconstruction error.
The key realization: Edited audio is a subset of raw audio. Human editors SELECT good content and CUT bad content.
Instead of trying to classify "good vs bad" (which fails because good content exists in both raw and edited), we:
- Train an autoencoder only on edited (good) audio
- For raw audio, segments with high reconstruction error are anomalies (bad)
- Keep segments with low error, cut segments with high error
The autoencoder only sees "good" audio during training. When presented with:
- Good segments: Model recognizes patterns, low reconstruction error
- Bad segments: Model hasn't seen these patterns, high reconstruction error
This is classic anomaly detection applied to audio editing.
- Trained shared encoder + separate decoders for raw/edited
- Failed: All segments scored similarly (0.81-0.89)
- Why: Raw and edited audio share too much similarity (same melody, key, tempo)
- Trained classifier to distinguish edited from raw
- Failed: Classifier learned "music-like features" not editing patterns
- Why: Good content exists in both datasets, can't discriminate
- Train autoencoder only on edited audio
- Score by reconstruction error
- Success: Clear separation between good and bad segments
┌─────────────────┐
Edited Audio ─────► │ Encoder │ ────► Latent (z)
(training only) │ (conv + pool) │
└────────┬────────┘
│
▼
┌─────────────────┐
│ Decoder │ ────► Reconstruction
│ (conv transpose)│
└─────────────────┘
Inference:
┌─────────────────┐
Raw Segment ──────► │ Autoencoder │ ────► Reconstruction
└────────┬────────┘
│
▼
┌─────────────────┐
│ MSE(input, │ ────► Error Score
│ reconstruction)│ (high = anomaly)
└─────────────────┘
- Paired raw + edited audio files (but only edited files are used for training)
- Same as before:
training_data/input/*.wavandtraining_data/desired_output/*.wav - No labels needed - just the edited audio
- Extract random segments from edited audio only
- Train autoencoder to reconstruct these segments
- Model learns the distribution of "good" audio
python _train_anomaly.pysegment_frames: Length of segments (64 = ~0.75 seconds)hidden_dims: Encoder/decoder capacity ([32, 64, 128])latent_dim: Bottleneck size (64)epochs: Training iterations (100)
For each segment of raw audio:
- Feed through autoencoder
- Compute MSE between input and reconstruction
- High error = anomaly (bad segment)
# Using the anomaly detector
from audio_slicer.trainers.anomaly_trainer import SimpleAutoencoder
model = SimpleAutoencoder.from_checkpoint('models/anomaly_detector/best.pt')
# Score segment
error = ((model(segment)['reconstruction'] - segment) ** 2).mean()
# Low error = good segment, high error = bad segment- Use percentile-based threshold (e.g., 40th percentile of errors)
- Segments below threshold are kept
- Adjust based on desired compression ratio
| Approach | Labels | Frame Mapping | Can Cut | Works |
|---|---|---|---|---|
| super_editor (Phase 1) | Yes | 1:1 | Via DSP | Limited |
| super_editor (Phase 2) | Predicted | 1:1 | Via labels | Collapsed |
| mel_to_mel | No | 1:1 | No | Can't cut |
| FaceSwap-style | No | No | Yes | Failed (no discrimination) |
| Binary classifier | No | No | Yes | Failed (all high scores) |
| Anomaly detector | No | No | Yes | Works! |
Add contrastive term: edited segments should cluster together in latent space.
Add discriminator to distinguish real edited audio from transformed raw audio.
Process at multiple time scales (beat, bar, phrase level).
Align segment boundaries to detected beats for smoother cuts.
Replace conv encoder with transformer for better long-range dependencies.
Add VGG-style perceptual loss using audio embeddings (like CLAP or wav2vec).
- Model may not have learned good representations
- Try training longer
- Increase model capacity
- Threshold too high
- Raw audio very different from edited style
- Lower threshold or check data quality
- Threshold too low
- Model learned identity mapping
- Check that raw and edited are actually different
audio_slicer/
├── __init__.py # Package exports
├── config.py # Configuration dataclasses
├── README.md # This file
├── IDEAS.md # Insights and future directions
├── models/
│ ├── scorer.py # Quality scorer (binary classifier - failed)
│ ├── autoencoder.py # Single autoencoder
│ └── dual_autoencoder.py # FaceSwap-style dual AE (failed)
├── data/
│ └── dataset.py # Segment extraction from cache
├── trainers/
│ ├── trainer.py # Dual AE training (failed)
│ ├── classifier_trainer.py # Binary classifier training (failed)
│ └── anomaly_trainer.py # Anomaly detector training (WORKS)
└── inference/
└── slicer.py # Segment scoring and selection
# Training scripts
_train_anomaly.py # Train the anomaly detector
_test_anomaly.py # Test segment selection
# Output
models/anomaly_detector/ # Trained model checkpoints
Using the anomaly detector on a 641-second raw audio file:
- Input: 641 seconds (raw)
- Output: 313 seconds (49% of original)
- Target: ~42% (based on actual edited audio ratio)
The model successfully identifies:
- Anomalies (high error): Transitions, problematic sections
- Good segments (low error): Clean, consistent audio
Important realization: The anomaly detector approach has a fundamental limitation.
What the model does:
- Learns what "good" audio sounds like
- Scores segments by reconstruction error
What the model does NOT do:
- Learn WHERE humans cut
- Learn WHAT to keep/repeat
- Learn HOW to structure the edit
All editing logic is manual code in the inference script:
- Where to cut → manual (energy-based, beat-aligned, onset detection)
- Minimum segment length → manual (8 seconds)
- How to transition → manual (crossfade mixing)
The model is a quality scorer, not an editor.
To build a system that actually learns editing behavior, we're planning a pointer network approach.
Input: Raw audio + Edited audio (as reference)
Output: Sequence of frame pointers [1000, 1001, ..., 2000, 1000, ...]
The model outputs which frames from raw audio to use, in what order. This allows:
- Cutting: Skip frames
- Keeping: Include frames
- Looping: Point to same frames multiple times
- Reordering: Change sequence
Raw Audio ───→ Encoder ───→ ┐
├─→ Cross-attention ─→ Pointer Decoder ─→ [indices]
Edited Audio ─→ Encoder ───→ ┘
The cross-attention implicitly learns alignment (which parts of raw appear in edited).
- Long sequences: 10 min audio = 50k frames to point into
- Variable output length: Model must decide when to stop
- Training signal: Need aligned raw↔edited pairs
- Multiple valid edits: No single "correct" answer
See IDEAS.md for full details on limitations and possible improvements.