Audio Examples


Here we provide 40 audio comparisons between a baseline frequency-masking attack (Qin et al.+, see Qin et al. 2019 / Szurley & Kolter 2019 / Dörr et al. 2020 / Wang et al. 2020) and our proposed adaptive filtering attack (Adaptive Filtering). For reference, we also include the original unperturbed audio (Original), the spoofed utterance from the target speaker (Target), and a waveform-additive projected gradient descent attack using the same loss function and optimization procedure as the proposed filtering attack (PGD). All attacks are optimized through a set of simulated "over-the-air" environments, inducing large-magnitude perturbations. In the case of the PGD attack, the adversarial perturbation should be clearly audible. By contrast, the perturbations introduced by the Qin et al.+ are rendered more subtle through the use of a complex perceptually-inspired loss and two-stage optimization procedure. Finally, our proposed Adaptive Filtering attack improves on the perceptual quality of the Qin et al.+ attack without the use of a complex perceptually-inspired loss or two-stage optimization procedure. In a user study, listeners rate our proposed attack as less conspicuous than Qin et al.+ by 65.9% - 34.1% given a two-way forced choice. For additional details, see our preprint.


Source Speaker Target Speaker Source Audio Target Audio PGD Baseline (Qin et al.+) Proposed (Adaptive Filtering)
5683 1320
1089 61
2300 4970
5105 1580
7021 2830
4507 8224
4970 2300
4077 4446
2961 5142
3729 8463
121 8230
237 5683
1580 7127
8230 1089
1188 672
1284 1995
8463 1284
8455 237
2830 6930
61 4077
6930 7021
672 2094
7176 6829
4446 908
3575 7729
5142 1221
8555 121
2094 2961
908 4992
7127 3575
8224 4507
7729 3570
1320 5639
3570 260
1995 8455
4992 8555
260 5105
1221 3729
5639 1188
6829 7176