logoalt Hacker News

Show HN: Ourguide – OS wide task guidance system that shows you where to click

33 pointsby eshaangulatiyesterday at 6:19 PM13 commentsview on HN

Hey! I'm eshaan and I'm building Ourguide -an on-screen task guidance system that can show you where to click step-by-step when you need help.

I started building this because whenever I didn’t know how to do something on my computer, I found myself constantly tabbing between chatbots and the app, pasting screenshots, and asking “what do I do next?” Ourguide solves this with two modes. In Guide mode, the app overlays your screen and highlights the specific element to click next, eliminating the need to leave your current window. There is also Ask mode, which is a vision-integrated chat that captures your screen context—which you can toggle on and off anytime -so you can ask, "How do I fix this error?" without having to explain what "this" is.

It’s an Electron app that works OS-wide, is vision-based, and isn't restricted to the browser.

Figuring out how to show the user where to click was the hardest part of the process. I originally trained a computer vision model with 2300 screenshots to identify and segment all UI elements on a screen and used a VLM to find the correct icon to highlight. While this worked extremely well—better than SOTA grounding models like UI Tars—the latency was just too high. I'll be making that CV+VLM pipeline OSS soon, but for now, I’ve resorted to a simpler implementation that achieves <1s latency.

You may ask: if I can show you where to click, why can't I just click too? While trying to build computer-use agents during my job in Palo Alto, I hit the core limitation of today’s computer-use models where benchmarks hover in the mid-50% range (OSWorld). VLMs often know what to do but not what it looks like; without reliable visual grounding, agents misclick and stall. So, I built computer use—without the "use." It provides the visual grounding of an agent but keeps the human in the loop for the actual execution to prevent misclicks.

I personally use it for the AWS Console's "treasure hunt" UI, like creating a public S3 bucket with specific CORS rules. It’s also been surprisingly helpful for non-technical tasks, like navigating obscure settings in Gradescope or Spotify. Ourguide really works for any task when you’re stuck or don't know what to do.

You can download and test Ourguide here: https://ourguide.ai/downloads

The project is still very early, and I’d love your feedback on where it fails, where you think it worked well, and which specific niches you think Ourguide would be most helpful for.


Comments

ryannamphamtoday at 3:03 AM

I like how it actually shows an image of your screen and where to place your cursor. This is honestly pretty cool.

DontBreakAlexyesterday at 9:36 PM

Looks cool, I think you should try to target it towards the elderly. My 99 year old grandpa is capable of using a computer and browsing the web, but struggles whenever he gets out of the "usual flow" (accidentally removes the chrome icon from his taskbar, whenever the crappy web-based email he insists on using over thunderbird moves the add attachment button). I end up having to do teamviewer to show him what I can't explain over the phone. He would very much use an assistant that shows him what to do, especially if he can speak to it.

show 1 reply
davelradindratoday at 1:25 AM

Really interesting approach. Having a human in the loop seems like the right tradeoff given where computer-use models are today. One thing that came to mind is that this can be a new interface for software learning. If it works reliably, I could see it replacing static docs and videos!

ectotoday at 2:05 AM

Can I put it on my mom's computer yet?

show 1 reply
linkdeadtoday at 12:11 AM

Good idea. I hope AI can automatically learn from the documents of newer version. Yesterday I used ChatGPT for "how to xxxxx in Blender?". Putting screenshots manually is bothersome, and the biggest problem is ChatGPT doesn't have knowledge of Blender 5.

show 1 reply
iosguyryanyesterday at 9:49 PM

Nicely conceived! This is the kind of feature Apple ought to have already delivered with on device models and private cloud compute.

Sending many whole screenshots to an indie mystery box, though, should be a non-starter for anyone without the skills to verify what any given update to this app is doing. Your website's featured use case highlights the risks (to you and users) unintentionally well: "How do I export my passwords?" (I did a double take: was this performance art from The Onion?) If a user opens a plain text file of secrets without closing this app/the help task, what gets captured, sent over the network, and saved to disk? What protections exist for, say, a computer-challenged elderly person's banking details?

A suggestion about the FAQ ...

"Where is my task history stored? Is it private? Your privacy is our top priority. Your task history is stored securely and encrypted on your local machine by default. You have full control over your data."

... This invites unanswered questions about what exactly from the screenshots is stored, for how long, and what design backs the "securely" claim. Being up front about this would invite trust and helpful developer feedback.

aventus-techtoday at 1:51 AM

So sick

culopatinyesterday at 9:03 PM

What data do you extract from interactions?

gyanchawdharyyesterday at 9:54 PM

Checkout https://techcrunch.com/2016/05/02/google-acquires-synergyse-... ..

Google acquired these guys back in 2016 to help users learn how to use Google cloud products via interactive tutorials using a step by step guidance / walkthrough (the user had to install a chrome extension)

One of the best use cases would be edtech … think of interactive labs where your product can guide learners / students to complete a task and hand hold them ..