I am very interested in this, and I have personally built manual workflows to do Youtube video -> rip audio->transcript->llm context.
For example, taking a video about building garden retaining walls and generating detailed system prompts for Q&A with the expert in the video.
I reference ~home improvement or tool videos and often comments contain points of wisdom or even corrections of mistakes (errata) on videos that are otherwise good. For example, setting up a hand plane and ways to mark a board you're working on.
Do you use video comments in your context? I've (manually) scraped content on educational videos and built prompting to assess signal and incorporate what are likely important errata in LLM context.
> video/resource —> transcript/text —>
For this step in your pipeline, are you multi-modal? I mean, are you using the LLM to interpret what is shown in the video itself? How is that content used?
Do you have any sense for allowing people to generate educational content off arbitrary videos?
To your last question, what do you mean by arbitrary? If the video is not educational at all, then the generated course will likely not be good. If the video is pure entertainment then probably not a good use case.
For now we only use the YouTube transcript because for most educational content we've found it does about as well for lower cost.
We may make that an option though, since we also offer other resource types (pdf, slides, docs) -> course.