The way I understood it is that while individual fMRI studies can be amazing, it is borderline impossible to compare them when made using different people or even different MRI machines. So reproducibility is a big issue, even though the tech itself is extremely promising.
It is in fact even difficult to compare the same person on the same fMRI machine (and especially in developmental contexts).
Herting, M. M., Gautam, P., Chen, Z., Mezher, A., & Vetter, N. C. (2018). Test-retest reliability of longitudinal task-based fMRI: Implications for developmental studies. Developmental Cognitive Neuroscience, 33, 17–26. https://doi.org/10.1016/j.dcn.2017.07.001
The article is pointing out that one of the base assumptions behind fMRI, that increased blood flow (which is what the machine can image) is strongly correlated to increased brain activity (which is what you want to measure) is not true in many situations. This means that the whole approach is suspect if you can't tell which situation you're in.
Individual fMRI is not a useful diagnostic tool for general conditions. There have been some clinics trying to push it (or SPECT) as a tool for diagnosing things like ADHD or chronic pain, but there is no scientific basis for this. The operator can basically crank up the noise and get some activity to show up, then tell the patient it’s a sign they have “ring of fire type ADHD” because they set the color pattern to reds and a circular pattern showed up at some point.
This isn't really true. The issue is that when you combine data across multiple MRI scanners (sites), you need to account for random effects (e.g. site specific means and variances)...see solutions like COMBAT. Also if they have different equipment versions/manufacturers those scanners can have different SNR profiles. The other issue is that there are many processing with many ways to perform those steps. In general, researchers don't process in multiple ways and choose the way that gives them the result they want or anything nefarious like that, but it does make comparisons difficult since the effects of different preprocessing variations can be significant. To defend against this, many peer reviewers, like myself, request researchers perform the preprocessing multiple ways to assess how robust the results are to those choices. Another way the field has combatted this issue has been software like fMRIprep.