Playing with the background I tried to Isolate just the espresso machine and the train sounds in one of their demos and it seemed to fail. Maybe not the desired use case, but I thought it was odd that I could break it so easily on the sample material.
Footsteps worked pretty well when I tried that on the other hand. I wonder if lot of it has to do with how well the model understands what the english description of the sound should sound like...