CSS has definitely become a breeze to work with since LLMs have become a thing. Conceptually it's very "memorize how a billion possible combinations of obscure parameters interact with one another under various conditions" kind of setup so it's a perfect fit for machines and a terrible fit for humans.
The main limitation I think is that they're blind as a bat and don't understand how things stand visually and render in the end. Even the best VLMs are still complete trash and can't even tell if two lines intersect. Slapping on an encoder post training doesn't do anything to help with visual understanding, it just adds some generic features the text model can react to.
I'll grant that. A lot of times I want to give it a screenshot and say "here is what is wrong" and this is usually useless.
I will say though that multimodal capability varies between models. Like if I show Copilot a picture of a flower and ask for an id it is always wrong, often spectacularly so. If I show them to Google Lens the accuracy is good. Overall I wouldn't try anything multimodal with Copilot.
For that matter I am finding these days that Google's AI mode outperforms Copilot and Junie at many coding questions. Like faced with a Vite problem, Copilot will write a several-line Vite plugin that doesn't work, Google says "use the vite-ignore" attribute.