Its much deeper than that.
Segmentation in 2d is mostly a solved problem (segment anything is pretty fucking great) Segmentation in 3d is also fairly well done. You can use dino V2 to do 3d object detection and segmentation.
The diffcult part _after_ that is interacting with the object. sparse and semi dense point clouds can be generated and refined in real time, but they are point clouds not meshes. this means that interacting with the object accurately is super hard, because its not a simple mesh that can be tested/interacted with. its a bunch of points around the edges.
Where this is useful is it allows you to generate a mostly plausible simple 3d model that can act as a standin for any further interactions. In VR you can use it as a collision object for physics. For robotics you can use it to plan interactions (ie place objects on the table)
Its also a step in the direction of answering "who's" object it is, rather than "what" the object is. Who's water bottle is much much harder to answer with machines (without markers) than "is this a water bottle" or "where is the water bottle in this scene"