Strictly in terms of architecture, CNNs are still SOTA for small data visual tasks, especially when the target is a locally specific phenomenon where global context isn't as necessary. It has good inductive bias for this.
The main known way to improve performance on tasks like this is getting more data.