I am confused by the color filter step.
Is the output produced by the sensor RGB or a single value per pixel?
The sensor outputs a single value per pixel. A later processing step is needed to interpret that data given knowledge about the color filter (usually Bayer pattern) in front of the sensor.
The raw sensor output is a single value per sensor pixel, each of which is behind a red, green, or blue color filter. So to get a usable image (where each pixel has a value for all three colors), we have to somehow condense the values from some number of these sensor pixels. This is the "Debayering" process.
It's a single value per pixel, but each pixel has a different color filter in front of it, so it's effectively that each pixel is one of R, G, or B
In its most raw form, camera sensors only see illumination not color.
In front of the sensor is a bayer filter which results in each physical pixel seeing illumination filtered R G or B.
From there the software onboard the camera or in your RAW converter does interpolation to create RGB values at each pixel. For example if the local pixel is R filtered, it then interpolates its G & B values from nearby pixels of that filter.
https://en.wikipedia.org/wiki/Bayer_filter
There are alternatives such as what Fuji does with its X-trans sensor filter.
https://en.wikipedia.org/wiki/Fujifilm_X-Trans_sensor
Another alternative is Foveon (owned by Sigma now) which makes full color pixel sensors but they have not kept up with state of the art.
https://en.wikipedia.org/wiki/Foveon_X3_sensor
This is also why Leica B&W sensor cameras have higher apparently sharpness & ISO sensitivity than the related color sensor models because there is no filter in front or software interpolation happening.