Here we provide 40 audio comparisons between a baseline frequency-masking attack (Qin et al.+, see Qin et al. 2019 / Szurley & Kolter 2019 / Dörr et al. 2020 / Wang et al. 2020) and our proposed adaptive filtering attack (Adaptive Filtering). For reference, we also include the original unperturbed audio (Original), the spoofed utterance from the target speaker (Target), and a waveform-additive projected gradient descent attack using the same loss function and optimization procedure as the proposed filtering attack (PGD). All attacks are optimized through a set of simulated "over-the-air" environments, inducing large-magnitude perturbations. In the case of the PGD attack, the adversarial perturbation should be clearly audible. By contrast, the perturbations introduced by the Qin et al.+ are rendered more subtle through the use of a complex perceptually-inspired loss and two-stage optimization procedure. Finally, our proposed Adaptive Filtering attack improves on the perceptual quality of the Qin et al.+ attack without the use of a complex perceptually-inspired loss or two-stage optimization procedure. In a user study, listeners rate our proposed attack as less conspicuous than Qin et al.+ by 65.9% - 34.1% given a two-way forced choice. For additional details, see our preprint.
|Source Speaker||Target Speaker||Source Audio||Target Audio||PGD||Baseline (Qin et al.+)||Proposed (Adaptive Filtering)|