Ask HN: Do you have any evidence that agentic coding works?

223 points • by terabytest • yesterday at 12:45 PM • 204 comments • view on HN

I've been trying to get agentic coding to work, but the dissonance between what I'm seeing online and what I'm able to achieve is doing my head in.

Is there real evidence, beyond hype, that agentic coding produces net-positive results? If any of you have actually got it to work, could you share (in detail) how you did it?

By "getting it to work" I mean: * creating more value than technical debt, and * producing code that’s structurally sound enough for someone responsible for the architecture to sign off on.

Lately I’ve seen a push toward minimal or nonexistent code review, with the claim that we should move from “validating architecture” to “validating behavior.” In practice, this seems to mean: don’t look at the code; if tests and CI pass, ship it. I can’t see how this holds up long-term. My expectation is that you end up with "spaghetti" code that works on the happy path but accumulates subtle, hard-to-debug failures over time.

When I tried using Codex on my existing codebases, with or without guardrails, half of my time went into fixing the subtle mistakes it made or the duplication it introduced.

Last weekend I tried building an iOS app for pet feeding reminders from scratch. I instructed Codex to research and propose an architectural blueprint for SwiftUI first. Then, I worked with it to write a spec describing what should be implemented and how.

The first implementation pass was surprisingly good, although it had a number of bugs. Things went downhill fast, however. I spent the rest of my weekend getting Codex to make things work, fix bugs without introducing new ones, and research best practices instead of making stuff up. Although I made it record new guidelines and guardrails as I found them, things didn't improve. In the end I just gave up.

I personally can't accept shipping unreviewed code. It feels wrong. The product has to work, but the code must also be high-quality.

Comments

PlatoIsADisease • yesterday at 3:09 PM

Since we are on this topic, how would I make an agent that does this job:

I am writing an automation software that interfaces with a legacy windows CAD program. Depending on the automation, I just need a picture of the part. Sometimes I need part thickness. Sometimes I need to delete parts. Etc... Its very much interacting with the CAD system and checking the CAD file or output for desired results.

I was considering something that would take screenshots and send it back for checks. Not sure what platforms can do this. I am stumped how Visual Studio works with this, there are a bunch of pieces like servers, agents, etc...

Even a how-to link would work for me. I imagine this would be extremely custom.

➕ show 2 replies

8note • yesterday at 11:43 PM

try other harnesses than codex.

ive had more success with review tools, rather than the agent getting the code quality right the first time.

current workflow

1. specs/requirements/design, outputting tasks 2. implementation, outputting code and tests 3. run review scripts/debug loops, outputting tasks 4. implement tasks 5. go back to 3

the quality of specs, tasks, and review scripts make a big difference

one of the biggest things that gets the results better is if you can get a feedback loop in from what the app actually does back to the agent. good logs, being able to interact/take screenshots a la playwright etc

guidelines and guardrails are best if theyre tools that the agent runs, or that run automatically to give feedback.

tacone • yesterday at 3:35 PM

The way I see it, is that for non-trivial things you have to build your method piece by piece. Then things start to improve. It's a process of... developing a process.

Write a good AGENTS.md (or CLAUDE.md) and you'll see that code is more idiomatic. Ask it to keep a changelog. Have the LLM write a plan before starting code. Ask it to ask you questions. Write abstraction layers it (along with the fellow humans of course) can use without messing with the low-level detail every time.

In a way you have to develop a framework to guide the LLM behavior. It takes time.

ikidd • yesterday at 4:15 PM

If you're building something new, stick with languages/problems/projects that have plenty of analogues in the opensource world and keep your context windows small, with small changes.

One-shotting an application that is very bespoke and niche is not going to go well, and same goes for working on an existing codebase without a pile of background work on helping the model understand it piece by piece, and then restricting it to small changes in well-defined areas.

It's like teaching an intern.

cyrusradfar • today at 5:43 AM

I've built multiple new apps with it and manage two projects that I wrote. I barely write any code other than frontend, copy, etc.

One is a VSCode extension and has thousands of downloads across different flavors of the IDE -- won't plug it here to spare the downvotes ;)

Been a developer professionally for nearly 20 years. It is 100% replacing most of the things I used to code.

I spend most of my time while it's working testing what it's built to decide on what's next. I also spend way more time on DX of my own setup, improving orchestration, figuring out best practice guidance for the Agent(s), and building reusable tools for my Agents (MCP).

julianozen • today at 2:18 AM

My main rule is never to commit code you don’t understand because it’ll get away from you.

I employ a few tricks:

1- I avoid auto-complete and always try to read what it does before committing. When it is doing something I don’t want, I course correct before it continues

2- I ask the LLM questions about the changes it is making and why. I even ask it to make me HTML schema diagrams of the changes.

3- I use my existing expertise. So I am an expert Swift developer, and I use my Swift knowledge to articulate the style of what I want to see in TypeScript, a language I have never worked in professionally.

4- I add the right testing and build infrastructure to put guardrails on its work.

5- I have an extensive library of good code for it to follow.

traceroute66 • yesterday at 3:16 PM

I am in the same boat as you.

The only positive antigenic coding experience I had was using it as a "translator" from some old unmaintained shell + C code to Go.

I gave it the old code, told it to translate to Go. I pre-installed a compiled C binary and told it to validate its work using interop tests.

It took about four hours of what the vibecoding lovers call "prompt engineering" but at the end I have to admit it did give me a pretty decent "translation".

However for everything else I have tried (and yes, vibecoders, "tried" means very tightly defined tasks) all I have ever got is over-engineered vibecoding slop.

The worst part of of it is that because the typical cut-off window is anywhere between 6–18 months prior, you get slop that is full of deprecated code because there is almost always a newer/more efficient way to do things. Even in languages like Go. The difference between an AI-slop answer for Go 1.20 and a human coded Go 1.24/1.25 one can be substantial.

afavour • yesterday at 3:09 PM

I’ve heard coding agents best described as a fleet of junior developers available to you 24/7 and I think that’s about right. With the added downside that they don’t really learn as they go so they will forever be junior developers (until models get better).

There are projects where throwing a dozen junior developers at the problem can work but they’re very basic CRUD type things.

➕ show 2 replies

ammmir • today at 2:38 AM

When you have a hammer, everything looks like a nail. Ad nauseam.

AI has made it possible for me to build several one-off personal tools in the matter of a couple of hours and has improved my non-tech life as a result. Before, I wouldn't even have considered such small projects because of the effort needed. It's been relieving not to have to even look at code, assuming you can describe your needs in a good prompt. On the other hand, I've seen vibe coded codebases with excessive layers of abstraction and performance issues that came from a possibly lax engineering culture of not doing enough design work upfront before jumping into implementation. It's a classic mistake, that is amplified by AI.

Yes, average code itself has become cheap, but good code still costs, and amazing code, well, you might still have an edge there for now, but eventually, accept that you will have to move up the abstraction stack to remain valuable when pitted against an AI.

What does this mean? Focus on core software engineering principles, design patterns, and understanding what computer is doing at a low level. Just because you're writing TypeScript doesn't mean you shouldn't know what's happening at the CPU level.

I predict the rise in AI slop cleanup consultancies, but they'll be competing with smarter AIs who will clean up after themselves.

7777332215 • yesterday at 6:06 PM

Don't use it myself. But I have a client who uses it. The bugs it creates are pretty funny. Constantly replacing parts of code with broken or completely incorrect things. Making things that previously worked broken. Deleting random things.

Kerrick • today at 7:08 AM

Yes. Over the last month, I've made heavy use of agentic coding (a bit of Junie and Amp, but mostly Antigravity) to ship https://www.ratatui-ruby.dev from scratch. Not just the website... the entire thing.

The main library (rubygem) has 3,662 code lines and 9,199 comment lines of production Ruby and 4,933 code lines and 710 comment lines of Rust. There are a further 6,986 code lines and 2,304 comment lines of example applications code using the library as documentation, and 4,031 lines of markdown documentation. Plus, 11,902 code lines and 2,164 comment lines of automated tests. Oh, and 4,250 lines in bin/ and tasks/ but those are lower-quality "internal" automation scripts and apps.

The library is good enough that Sidekiq is using it to build their TUI. https://github.com/sidekiq/sidekiq/issues/6898

But that's not all I've built over this timeframe. I'm also a significant chunk of the way through an MVU framework, https://rooibos.run, built on top of it. That codebase is 1,163 code lines and 1,420 comment lines of production Ruby, 4,749 code lines and 521 comment lines of automated tests. I need to add to the 821 code lines 221 comment lines of example application code using the framework as documentation, and to the 2,326 lines of markdown documentation.

It's been going so well that the plan is to build out an ecosystem: the core library, an OOP and an FP library, and a set of UI widgets. There are 6,192 lines of markdown in the Wik about it: mailing list archives, AI chat archives, current design & architecture, etc.

For context, I am a long-time hobbyist Rubyist but I cannot write Rust. I have very little idea of the quality of the Rust code beyond what static analyzers and my test suite can tell me.

It's all been done very much in public. You can see every commit going back to December 22 in the git repos linked from the "Sources" tab here: https://sr.ht/~kerrick/ratatui_ruby/ If you look at the timestamps you'll even notice the wild difference between my Christmas vacation days, and when I went back to work and progress slowed. You can also see when I slowed down to work on distractions like https://git.sr.ht/~kerrick/ramforge/tree and https://git.sr.ht/~kerrick/semantic_syntax/tree.

If it keeps going as well as it has, I may be able to rival Charm's BubbleTea and Bubbles by summertime. I'm doing this to give Rubyists the opportunity to participate in the TUI renaissance... but my ultimate goal is to give folks who want to make a TUI a reason to learn Ruby instead of Go or Rust.

nathan_compton • yesterday at 3:30 PM

I think of coding agents more like "typing assistants" than programmers. If you know exactly what and how to do what you want, you can ask them to do it with clear instructions and save yourself the trouble of typing the code out.

Otherwise, they are bad.

jaxn • yesterday at 3:50 PM

I have a small-ish vertical SaaS that is used heavily by ~700 retail stores. I have enabled our customer success team to fix bugs using GitHub copilot. I approve the PRs, but they have fixed a surprising number of issues.

dpcan • yesterday at 3:40 PM

Yes, constantly.

I don’t know what I do differently, but I can get Cursor to do exactly what I want all the time.

Maybe it’s because it takes more time and effort, and I don’t connect to GitHub or actual databases, nor do I allow it to run terminal commands 99% of the time.

I have instructions for it to write up readme files of everything I need to know about what it has done. I’ve provided instructions and created an allow list of commands so it creates local backups of files before it touches them, and I always proceed through a plan process for any task that is slightly more complicated, followed by plan cleanup, and execution. I’m super specific about my tech stack and coding expectations too. Tests can be hard to prompt, I’ll sometimes just write those up by hand.

Also, I’ve never had to pay over my $60 a month pro plan price tag. I can’t figure out how others are even doing this.

At any rate, I think the problem appears to be the blind commands of “make this thing, make it good, no bugs” and “this broke. Fix!” I kid you not, I see this all the time with devs. Not at all saying this is what you do, just saying it’s out there.

And “high quality code” doesn’t actually mean anything. You have to define what that means to you. Good code to me may be slop to you, but who knows unless it is defined.

hahahahhaah • today at 6:44 AM

Yes.

Caveat: can't be pure vibes. Needs ownership, care, review and willingness to git reset and try again when needed. Needs a lot of tests.

Cavaet: Greenfield.

recroad • yesterday at 3:10 PM

Works pretty great for me, especially Spec-driven development using OpenSpec

- Cleaner code - Easily 5x speed minimum - Better docs, designs - Focus more on the product than than the mechanics - More time for family

➕ show 1 reply

highspeedbus • yesterday at 3:20 PM

Honestly, I only use coding agents when I feel too lazy to type lots of boilerplate code.

As in "Please write just this one for me". Even still, I take care to review each line produced. The key is making small changes at a time.

Otherwise, I type out and think about everything being done when in ‘Flow State’. I don't like the feeling of vibe coding for long periods. It completely changes the way work is done, it takes away agency.

On a bit of a tangent, I can't get in Flow State when using agents. At least not as we usually define it.

jorgeleo • yesterday at 3:12 PM

I did the same experiment as you, and this is what I learned:

https://www.linkedin.com/pulse/concrete-vibe-coding-jorge-va...

The bottom line is this:

* The developer stop been a developer, and becomes a product designer with high technical skills.

  * This is a different set of skills than than a developer or a product owner currently have. It is a mix of both, and the expectations of how agentic development works need to be adjusted.

* Agents will behave like junior developers, they can type very fast, and produce something that has a high probability to work. They priority will be to make it work, not maintainability, scalability, etc. Agents can achieve that if you detail how to produce it.

  * The working with an agent feels more like mentoring the AI than ask and receive.

* When I start to work on a product that will be vibe coded, I need to have clear in my head all the user stories, code architecture, the whole system, then I can start to tell the agent what to build, and correct and annotate in the md files the code quality decisions so it remembers them.

* Use TDD, ask the agent to create the tests, and then code to the test. Don't correct the bugs, make the agent correct them and explain why that is a bug, specially with code design decisions. Store those in AGENTS.md file at the root of the project.

There are more things that can be done to guide the agent, but I need to have clear in an articulable way the direction of the coding. On the other side, I don't worry about implementation details like how to use libraries and APIs that I am not familiar with, the agent just writes and I test.

Currently I am working on a product and I can tell you, working no more than 10 hours a week (2 hours here, 3 there, leave the agent working while I am having dinner with family) I am progressing at I would say 5 to 10 times faster than without it. So, yeah it works, but I had to adjust how I do my job.

saikatsg • yesterday at 3:17 PM

> Scaling long-running autonomous coding https://news.ycombinator.com/item?id=46624541

➕ show 2 replies

dionian • today at 6:26 AM

I review it as i generate it. for quality. i guide it to be self-testing. create unit tests and integration tests according to my standards

damnitbuilds • yesterday at 1:07 PM

You are asking two very different questions here.

i.e. You are asking a question about whether using agents to write code is net-positive, and then you go on about not reviewing the code agents produce.

I suspect agents are often net-positive AND one has to review their code. Just like most people's code.

➕ show 1 reply

PaulHoule • yesterday at 3:19 PM

Treat it as a pair programmer. Ask it questions like "How do I?", "When I do X, Y happens, why is that?", "I think Z, prove me wrong" or "I want to do P, how do you think we should do it?"

Feed it little tasks (30 s-5 min) and if you don't like this or that about the code it gives you either tell it something like

   Rewrite the selection so it uses const, ? and :

or edit something yourself and say

   I edited what you wrote to make it my own,  what do you think about my changes?

If you want to use it as a junior dev who gets sent off to do tickets and comes back with a patch three days later that will fail code review be my guest, but I greatly enjoy working with a tight feedback loop.

dlandis • yesterday at 3:59 PM

> Last weekend I tried building an iOS app for pet feeding reminders from scratch.

Just start smaller. I'm not sure why people try to jump immediately to creating an entire app when they haven't even gotten any net-positive results at all yet. Just start using it for small time saving activities and then you will naturally figure out how to gradually expand the scope of what you can use it for.

fragmede • today at 3:58 AM

Care to share the pet feeder's code and what the bugs are and how it went off the rails? Seems like a perfect scenario for us to see how much is prompting skill, how much is a setup, how much is just patience for the thing, and how much is hype/lies.

koakuma-chan • yesterday at 2:57 PM

> The product has to work, but the code must also be high-quality.

I think in most cases the speed at which AI can produce code outweighs technical debt, etc.

➕ show 1 reply

mythrwy • today at 1:39 AM

I haven't (yet) tried Claude but have good experiences with Codex CLI the last few weeks.

Previously I tried to use Aider and openAI about 6 or 7 months ago and it was terrible mess. I went back to pasting snippets in the browser chat window until a few weeks ago and thought agents were mostly hype (was wrong).

I keep a browser chat window open to talk about the project at a higher level. I'll post command line output like `ls` and `cat` to the higher level chat and use Codex strictly for coding. I haven't tried to one shot anything. I just give it a smallish piece of work at a time and check as it goes in a separate terminal window. I make the commits and delete files (if needed) and anything administrative. I don't have any special agent instructions. I do give Codex good hints on where to look or how to handle things.

It's probably a bit slower than what some people are doing but it's still very fast and so far has worked well. I'm a bit cautious because of my previous experience with Aider which was like roller skating drunk while juggling open straight razors and which did nothing but make a huge mess (to be fair I didn't spend much time trying to tame it).

I'm not sold on Codex or openAI compared to other models and will likely try other agents later, but so far it's been good.

hakanderyal • today at 6:39 AM

I've been increasingly removing myself from the typing part since August. For the last few months, I haven't written a single line of code, despite producing a lot more.

I'm using Claude Code. I've been building software as a solo freelancer for the last 20+ years.

My latest workflow

- I work on "regular" web apps, C#/.NET on backend, React on web.

- I'm using 3-8 sessions in parallel, depending on the tasks and the mental bandwidth I have, all visible on external display.

- I've markdown rule files & documentation, 30k lines in total. Some of them describes how I want the agent to work (rule files), some of them describes the features/systems of the app.

- Depending on what I'm working on, I load relevant rule files selectively into the context via commands. I have a /fullstack command that loads @backend.md, @frontend.md and a few more. I have similar /frontend, /backend, /test commands with a few variants. These are the load bearing columns of my workflow. Agents takes a lot more time and produces more slop without these. Each one is written by agents also, with my guidance. They evolve based on what we encounter.

- Every feature in the app, and every system, has a markdown document that's created by the implementing agent, describing how it works, what it does, where it's used, why it's created, main entry points, main logic, gotchas specific to this feature/system etc. After every session, I have /write-system, /write-feature commands that I use to make the agent create/update those, with specific guidance on verbosity, complexity, length.

- Each session I select a specific task for a single system. I reference the relevant rule files and feature/system doc, and describe what I want it to achieve and start plan mode. If there are existing similar features, I ask the agent to explore and build something similar.

- Each task is specifically tuned to be planned/worked in a single session. This is the most crucial role of mine.

- For work that would span multiple sessions, I use a single session to create the initial plan, then plan each phase in depth in separate sessions.

- After it creates the plan, I examine, do a bit of back and forth, then approve.

- I watch it while it builds. Usually I have 1-2 main tasks and a few subtasks going in parallel. I pay close attention to main tasks and intervene when required. Subtasks rarely requires intervention due to their scope.

- After the building part is done, I go through the code via editor, test manually via UI, while the agent creates tests for the thing we built, again with specific guidance on what needs to be tested and how. Since the plan is pre-approved by me, this step usually goes without a hitch.

- Then I make the agent create/update the relevant documents.

- Last week I built another system to enhance that flow. I created a /devlog command. With the assist of some CLI tools and cladude log parsing, it creates a devlog file with some metadata (tokens, length, files updated, docs updated etc) and agent fills it with a title, summary of work, key decisions, lessons learned. First prompt is also copied there. These also get added to the relevant feature/system document automatically as changelog entries. So, for every session, I've a clear document about what got done, how long it took, what was the gotchas, what went right, what went wrong etc. This proved to be invaluable even with a week worth of develops, and allows me to further refine my workflows.

This looks convoluted at a first glance, but it's evolved over the months and works great. The code quality is almost the same with what I would have written by myself. All because of existing code to use as examples, and the rule files guiding the agents. I was already a fast builder before, but with agents it's a whole new level.

And this flow really unlocked with Opus 4.5. Sonnet 3.5/4/4.5 was also working OK, but required a lot more handholding and steering and correction. Parallel sessions wasn't really possible without producing slop. Opus 4.5 is significantly better.

More technical/close-to-hardware work will most likely require a different set of guidance & flow to create non-slop code. I don't have any experience there.

You need to invest in improving the workflow. The capacity is there in the models. The results all depends on how you use them.

allisdust • yesterday at 3:43 PM

You need to perturb the token distribution by overlaying multiple passes. Any strategy that does this would work.

TomWizOverlord • today at 4:29 AM

This is 1/3 response to a short prompt about implementation options for GitHub Runner form broken Server to Github Enterprise Cloud: # EC2-Based GitHub Actions Self-Hosted Runners - Complete Implementation

## Architecture Overview

This solution deploys auto-scaling GitHub Actions runners on EC2 instances that can trigger your existing AWS CodeBuild pipelines. Runners are managed via Auto Scaling Groups with automatic registration and health monitoring.

## Prerequisites

- AWS CLI configured with appropriate credentials - GitHub Enterprise Cloud organization admin access - Existing CodeBuild project(s) - VPC with public/private subnets

## Solution Components

### 1. CloudFormation Template### 2. GitHub Workflow for CodeBuild Integration## Deployment Steps

### Step 1: Create GitHub Personal Access Token

1. Navigate to GitHub → Settings → Developer settings → Personal access tokens → Fine-grained tokens 2. Create token with these permissions: - *Repository permissions:* - Actions: Read and write - Metadata: Read - *Organization permissions:* - Self-hosted runners: Read and write

```bash # Store token securely export GITHUB_PAT="ghp_xxxxxxxxxxxxxxxxxxxx" export GITHUB_ORG="your-org-name" ```

### Step 2: Deploy CloudFormation Stack

```bash # Set variables export AWS_REGION=us-east-1 export STACK_NAME=github-runner-ec2 export VPC_ID=vpc-xxxxxxxx export SUBNET_IDS="subnet-xxxxxxxx,subnet-yyyyyyyy"

# Deploy stack aws cloudformation create-stack \ --stack-name $STACK_NAME \ --template-body file://github-runner-ec2-asg.yaml \ --parameters \ ParameterKey=VpcId,ParameterValue=$VPC_ID \ ParameterKey=PrivateSubnetIds,ParameterValue=\"$SUBNET_IDS\" \ ParameterKey=GitHubOrganization,ParameterValue=$GITHUB_ORG \ ParameterKey=GitHubPAT,ParameterValue=$GITHUB_PAT \ ParameterKey=InstanceType,ParameterValue=t3.medium \ ParameterKey=MinSize,ParameterValue=2 \ ParameterKey=MaxSize,ParameterValue=10 \ ParameterKey=DesiredCapacity,ParameterValue=2 \ ParameterKey=RunnerLabels,ParameterValue="self-hosted,linux,x64,ec2,aws,codebuild" \ ParameterKey=CodeBuildProjectNames,ParameterValue="" \ --capabilities CAPABILITY_NAMED_IAM \ --region $AWS_REGION

# Wait for completion (5-10 minutes) aws cloudformation wait stack-create-complete \ --stack-name $STACK_NAME \ --region $AWS_REGION

# Get stack outputs aws cloudformation describe-stacks \ --stack-name $STACK_NAME \ --query 'Stacks[0].Outputs' \ --region $AWS_REGION ```

### Step 3: Verify Runners

```bash # Check Auto Scaling Group ASG_NAME=$(aws cloudformation describe-stacks \ --stack-name $STACK_NAME \ --query 'Stacks[0].Outputs[?OutputKey==`AutoScalingGroupName`].OutputValue' \ --output text)

aws autoscaling describe-auto-scaling-groups \ --auto-scaling-group-names $ASG_NAME \ --region $AWS_REGION

# List running instances aws ec2 describe-instances \ --filters "Name=tag:aws:autoscaling:groupName,Values=$ASG_NAME" \ --query 'Reservations[].Instances[].[InstanceId,State.Name,PrivateIpAddress]' \ --output table

# Check CloudWatch logs aws logs tail /github-runner/instances --follow ```

### Step 4: Verify in GitHub

Navigate to: `https://github.com/organizations/YOUR_ORG/settings/actions/r...`

You should see your EC2 runners listed as "Idle" with labels: `self-hosted, linux, x64, ec2, aws, codebuild`

## Using One Runner for Multiple Repos & Pipelines

### Organization-Level Runners (Recommended)

EC2 runners registered at the organization level can serve all repositories automatically.

*Benefits:* - Centralized management - Cost-efficient resource sharing - Simplified scaling - Single point of monitoring

*Configuration in CloudFormation:* The template already configures organization-level runners via the UserData script: ```bash ./config.sh --url "https://github.com/${GitHubOrganization}" ... ```

### Multi-Repository Workflow Examples### Advanced: Runner Groups for Access Control### Label-Based Runner Selection Strategy

*Create different runner pools with specific labels:*

```bash # Production runners RunnerLabels: "self-hosted,linux,ec2,production,high-performance"

# Development runners RunnerLabels: "self-hosted,linux,ec2,development,general"

# Team-specific runners RunnerLabels: "self-hosted,linux,ec2,team-platform,specialized" ```

*Use in workflows:*

```yaml jobs: prod-deploy: runs-on: [self-hosted, linux, ec2, production]

  dev-test:
    runs-on: [self-hosted, linux, ec2, development]
  
  platform-build:
    runs-on: [self-hosted, linux, ec2, team-platform]

```
## Monitoring and Maintenance
### Monitor Runner Health

```bash # Check Auto Scaling Group health aws autoscaling describe-auto-scaling-groups \ --auto-scaling-group-names $ASG_NAME \ --query 'AutoScalingGroups[0].[DesiredCapacity,MinSize,MaxSize,Instances[].[InstanceId,HealthStatus,LifecycleState]]'

# View instance system logs INSTANCE_ID=$(aws autoscaling describe-auto-scaling-groups \ --auto-scaling-group-names $ASG_NAME \ --query 'AutoScalingGroups[0].Instances[0].InstanceId' \ --output text)

aws ec2 get-console-output --instance-id $INSTANCE_ID

# Check CloudWatch logs aws logs get-log-events \ --log-group-name /github-runner/instances \ --log-stream-name $INSTANCE_ID/runner \ --limit 50 ```

### Connect to Runner Instance (via SSM)

```bash # List instances aws autoscaling describe-auto-scaling-groups \ --auto-scaling-group-names $ASG_NAME \ --query 'AutoScalingGroups[0].Instances[].[InstanceId,HealthStatus]' \ --output table

# Connect via Session Manager (no SSH key needed) aws ssm start-session --target $INSTANCE_ID

# Once connected, check runner status sudo systemctl status actions.runner. sudo journalctl -u actions.runner.* -f ```

### Troubleshooting Common Issues## Advanced Scaling Configuration

### Lambda-Based Dynamic Scaling

For more sophisticated scaling based on GitHub Actions queue depth:### Deploy Scaling Lambda

```bash # Create Lambda function zip function.zip github-queue-scaler.py

aws lambda create-function \ --function-name github-runner-scaler \ --runtime python3.11 \ --role arn:aws:iam::ACCOUNT_ID:role/lambda-execution-role \ --handler github-queue-scaler.lambda_handler \ --zip-file fileb://function.zip \ --timeout 30 \ --environment Variables="{ ASG_NAME=$ASG_NAME, GITHUB_ORG=$GITHUB_ORG, GITHUB_TOKEN=$GITHUB_PAT, MAX_RUNNERS=10, MIN_RUNNERS=2 }"

# Create CloudWatch Events rule to trigger every 2 minutes aws events put-rule \ --name github-runner-scaling \ --schedule-expression 'rate(2 minutes)'

aws events put-targets \ --rule github-runner-scaling \ --targets "Id"="1","Arn"="arn:aws:lambda:REGION:ACCOUNT:function:github-runner-scaler" ```

## Cost Optimization

### 1. Use Spot Instances

Add to Launch Template in CloudFormation:

```yaml LaunchTemplateData: InstanceMarketOptions: MarketType: spot SpotOptions: MaxPrice: "0.05" # Set max price SpotInstanceType: one-time ```

### 2. Scheduled Scaling

Scale down during off-hours:

```bash # Scale down at night (9 PM) aws autoscaling put-scheduled-action \ --auto-scaling-group-name $ASG_NAME \ --scheduled-action-name scale-down-night \ --recurrence "0 21 * * " \ --desired-capacity 1

# Scale up in morning (7 AM) aws autoscaling put-scheduled-action \ --auto-scaling-group-name $ASG_NAME \ --scheduled-action-name scale-up-morning \ --recurrence "0 7 * MON-FRI" \ --desired-capacity 3 ```

### 3. Instance Type Mix

Use multiple instance types for better availability and cost:

```yaml MixedInstancesPolicy: InstancesDistribution: OnDemandBaseCapacity: 1 OnDemandPercentageAboveBaseCapacity: 25 SpotAllocationStrategy: price-capacity-optimized LaunchTemplate: Overrides: - InstanceType: t3.medium - InstanceType: t3a.medium - InstanceType: t2.medium ```

## Security Best Practices

1. *No hardcoded credentials* - Using Secrets Manager for GitHub PAT 2. *IMDSv2 enforced* - Prevents SSRF attacks 3. *Minimal IAM permissions* - Scoped to specific CodeBuild projects 4. *Private subnets* - Runners not directly accessible from internet 5. *SSM for access* - No SSH keys needed 6. *Encrypted secrets* - Secrets Manager encryption at rest 7. *CloudWatch logging* - All runner activity logged

## References

- [GitHub Self-hosted Runners Documentation](https://docs.github.com/en/actions/hosting-your-own-runners/...) - [GitHub Runner Registration API](https://docs.github.com/en/rest/actions/self-hosted-runners) - [AWS Auto Scaling Documentation](https://docs.aws.amazon.com/autoscaling/ec2/userguide/what-i...) - [AWS CodeBuild API Reference](https://docs.aws.amazon.com/codebuild/latest/APIReference/We...) - [GitHub Actions Runner Releases](https://github.com/actions/runner/releases) - [AWS Systems Manager Session Manager](https://docs.aws.amazon.com/systems-manager/latest/userguide...)

This solution provides a production-ready, cost-effective EC2-based runner infrastructure with automatic scaling, comprehensive monitoring, and multi-repository support for triggering CodeBuild pipelines.

3vidence • yesterday at 3:16 PM

Googler opinions are my own.

If agentic coding worked as well as people claimed on large codebases I would be seeing a massive shift at my Job... Im really not seeing it.

We have access to pretty much all the latest and greatest internally at no cost and it still seems the majority of code is still written and reviewed by people.

AI assisted coding has been a huge help to everyone but straight up agentic coding seems like it does not scale to these very large codebases. You need to keep it on the rails ALL THE TIME.

➕ show 2 replies

cat_plus_plus • yesterday at 4:03 PM

Do not blame the tools? Given a clear description (overall design, various methods to add, inputs, outputs), Google Antigravity often writes better zero shot code than an average human engineer - consistent checks for special cases, local optimizations, extensive comments, thorough text coverage. Now in terms of reviews, the real focus is reviewing your own code no matter which tools you used to write it, vi or agentic AI IDE, not someone else reviewing your code. The later is a safety/mentorship tool in the best circumstances and all too often just an excuse for senior architects to assert their dominance and justify their own existence at the expense of causing unnecessary stress and delaying getting things shipped.

Now in terms of using AI, the key is to view yourself as a technical lead, not a people manager. You don't stop coding completely or treat underlying frameworks as a black box, you just do less of it. But at some point fixing a bug yourself is faster than writing a page of text explaining exactly how you want it fixed. Although when you don't know the programming language, giving pseudocode or sample code in another language can be super handy.

TomWizOverlord • today at 4:27 AM

claude response to a query to give options for GitHub runner ... it haas generated 3 more files to review test and make it work: # EC2-Based GitHub Actions Self-Hosted Runners - Complete Implementation

## Architecture Overview

## Prerequisites

- AWS CLI configured with appropriate credentials - GitHub Enterprise Cloud organization admin access - Existing CodeBuild project(s) - VPC with public/private subnets

## Solution Components

### 1. CloudFormation Template### 2. GitHub Workflow for CodeBuild Integration## Deployment Steps

### Step 1: Create GitHub Personal Access Token

```bash # Store token securely export GITHUB_PAT="ghp_xxxxxxxxxxxxxxxxxxxx" export GITHUB_ORG="your-org-name" ```

### Step 2: Deploy CloudFormation Stack

```bash # Set variables export AWS_REGION=us-east-1 export STACK_NAME=github-runner-ec2 export VPC_ID=vpc-xxxxxxxx export SUBNET_IDS="subnet-xxxxxxxx,subnet-yyyyyyyy"

# Wait for completion (5-10 minutes) aws cloudformation wait stack-create-complete \ --stack-name $STACK_NAME \ --region $AWS_REGION

# Get stack outputs aws cloudformation describe-stacks \ --stack-name $STACK_NAME \ --query 'Stacks[0].Outputs' \ --region $AWS_REGION ```

### Step 3: Verify Runners

aws autoscaling describe-auto-scaling-groups \ --auto-scaling-group-names $ASG_NAME \ --region $AWS_REGION

# Check CloudWatch logs aws logs tail /github-runner/instances --follow ```

### Step 4: Verify in GitHub

Navigate to: `https://github.com/organizations/YOUR_ORG/settings/actions/r...`

You should see your EC2 runners listed as "Idle" with labels: `self-hosted, linux, x64, ec2, aws, codebuild`

## Using One Runner for Multiple Repos & Pipelines

### Organization-Level Runners (Recommended)

EC2 runners registered at the organization level can serve all repositories automatically.

*Benefits:* - Centralized management - Cost-efficient resource sharing - Simplified scaling - Single point of monitoring

*Configuration in CloudFormation:* The template already configures organization-level runners via the UserData script: ```bash ./config.sh --url "https://github.com/${GitHubOrganization}" ... ```

### Multi-Repository Workflow Examples### Advanced: Runner Groups for Access Control### Label-Based Runner Selection Strategy

*Create different runner pools with specific labels:*

```bash # Production runners RunnerLabels: "self-hosted,linux,ec2,production,high-performance"

# Development runners RunnerLabels: "self-hosted,linux,ec2,development,general"

# Team-specific runners RunnerLabels: "self-hosted,linux,ec2,team-platform,specialized" ```

*Use in workflows:*

```yaml jobs: prod-deploy: runs-on: [self-hosted, linux, ec2, production]

  dev-test:
    runs-on: [self-hosted, linux, ec2, development]
  
  platform-build:
    runs-on: [self-hosted, linux, ec2, team-platform]

```
## Monitoring and Maintenance
### Monitor Runner Health

# View instance system logs INSTANCE_ID=$(aws autoscaling describe-auto-scaling-groups \ --auto-scaling-group-names $ASG_NAME \ --query 'AutoScalingGroups[0].Instances[0].InstanceId' \ --output text)

aws ec2 get-console-output --instance-id $INSTANCE_ID

# Check CloudWatch logs aws logs get-log-events \ --log-group-name /github-runner/instances \ --log-stream-name $INSTANCE_ID/runner \ --limit 50 ```

### Connect to Runner Instance (via SSM)

```bash # List instances aws autoscaling describe-auto-scaling-groups \ --auto-scaling-group-names $ASG_NAME \ --query 'AutoScalingGroups[0].Instances[].[InstanceId,HealthStatus]' \ --output table

# Connect via Session Manager (no SSH key needed) aws ssm start-session --target $INSTANCE_ID

# Once connected, check runner status sudo systemctl status actions.runner. sudo journalctl -u actions.runner.* -f ```

### Troubleshooting Common Issues## Advanced Scaling Configuration

### Lambda-Based Dynamic Scaling

For more sophisticated scaling based on GitHub Actions queue depth:### Deploy Scaling Lambda

```bash # Create Lambda function zip function.zip github-queue-scaler.py

# Create CloudWatch Events rule to trigger every 2 minutes aws events put-rule \ --name github-runner-scaling \ --schedule-expression 'rate(2 minutes)'

aws events put-targets \ --rule github-runner-scaling \ --targets "Id"="1","Arn"="arn:aws:lambda:REGION:ACCOUNT:function:github-runner-scaler" ```

## Cost Optimization

### 1. Use Spot Instances

Add to Launch Template in CloudFormation:

```yaml LaunchTemplateData: InstanceMarketOptions: MarketType: spot SpotOptions: MaxPrice: "0.05" # Set max price SpotInstanceType: one-time ```

### 2. Scheduled Scaling

Scale down during off-hours:

### 3. Instance Type Mix

Use multiple instance types for better availability and cost:

## Security Best Practices

## References

lostmsu • yesterday at 3:08 PM

It works in the sense that there are lots of professional (as in they earn money from software engineering) developers out there who do the work of exactly same quality. I would even bet they are the majority (or at least were prior to late 2024).

fourthrigbt • today at 5:13 AM

I’ve had a major conversion on this topic within the last month.

I’m not exactly a typical SWE at the moment. The role I’m in is a lot of meeting with customers, understand their issues, and whip up demos to show how they might apply my company’s products to their problem.

So I’m not writing production code, but I am writing code that I want to to be maintainable and changeable so I can stash a demo for a year and then spin it up quickly when someone wants to see if or update/adapt it as products/problems change. Most of my career has been spent writing aircraft SW so I am heavily biased toward code quality and assurance. The demos I am building are not trivial or common in the training data. They’re highly domain specific and pretty niche, performance is very important, and usually span low level systems code all the way up to a decent looking gui. As a made up example, it wouldn’t be unusual for me to have a project to write a medical imaging pipeline from scratch that employs modern techniques from recent papers, etc.

Up until very recently, I only thought coding agents were useful for basic crud apps, etc. I said the same things a lot of people on this thread are saying, eg. people on twitter are all hype, their experience doesn’t match mine, they must be working on easy problems or be really bad at writing code

I recently decided to give into the hype and really try to use the tooling and… it’s kind of blown my mind.

Cursor + opus 4.5 high are my main tools and their ability to one shot major changes across many files and hundreds of lines of code, encompassing low level systems stuff, GOU accelerated stuff, networking, etc.

It’s seriously altering my perception of what software engineering is and will be and frankly I’m still kind of recoiling from it.

Don’t get me wrong, I don’t believe it fundamentally eliminates the need for SWEs, it still takes a lot of work on my part to come up with a spec (though I do have it help me with that part), correct things that I don’t like in its planning or catch it doing the wrong thing in real time in and re direct it. And it will make strange choices that I need to correct on the back end sometimes. But it has legitimately allowed me to build 10x faster than I probably could on my own.

Maybe the most important thing about it is what it enables you to build that would have been not worth the trouble before, Stuff like wrapping tools in really nice flexible TUIs, creating visualizations/dashboards/benchmark, slightly altering how an application works to cover a use case you hadn’t thought of before, wrapping an interface so it’s easy to swap libs/APIs later, etc.

If you are still skeptical, I would highly encourage you to immerse yourself in the SOTS tools right now and just give in to the hype for a bit, because I do think we’re rapidly going to reach a point here where if you aren’t using these tools you won’t be employable.

vonneumannstan • yesterday at 4:42 PM

Claude Cowork was apparently completely written by Claude Code. So this appears to yet again be a skill issue.

➕ show 1 reply

lostsock • today at 12:28 AM

I’m honestly kind of amazed that more people aren’t seeing the value, because my experience has been almost the opposite of what you’re describing.

I agree with a lot of your instincts. Shipping unreviewed code is wrong. “Validate behavior not architecture” as a blanket rule is reckless. Tests passing is not the same thing as having a system you can reason about six months later. On that we’re aligned.

Where I diverge is the conclusion that agentic coding doesn’t produce net-positive results. For me it very clearly does, but perhaps it's very situation or condition dependent?

For me, I don’t treat the agent as a junior engineer I can hand work to and walk away from. I treat it more like an extremely fast, extremely literal staff member who will happily do exactly what you asked, including the wrong thing, unless you actively steer it. I sit there and watch it work (usually have 2-3 agents working at the same time, ideally on different codebases but sometimes they overlap). I interrupt it. I redirect it. I tell it when it is about to do something dumb. I almost never write code anymore, but I am constantly making architectural calls.

Second, tooling and context quality matter enormously. I’m using Claude Code. The MCP tools I have installed make a huge different: laravel-boost, context7, and figma (which in particular feels borderline magical at converting designs into code!).

I often have to tell the agent to visit GitHub READMEs and official docs instead of letting it hallucinate “best practices”, the agent will oftentimes guess and get stack, so if it's doing that, you’ve already lost.

Third, I wonder if perhaps starting from scratch is actually harder than migrating something real. Right now I’m migrating a backend from Java to Laravel and rebuilding native apps into KMP and Compose Multiplatform. So the domain and data is real and I can validate against a previous (if buggy) implimentation). In that environment, the agent is phenomenal. It understands patterns, ports logic faithfully, flags inconsistencies, and does a frankly ridiculous amount of correct work per hour.

Does it make mistakes? Of course. But they’re few and far between, and they’re usually obvious at the architectural or semantic level, not subtle landmines buried in the code. When something is wrong, it’s wrong in a way that’s easy to spot if you’re paying attention.

That’s the part I think gets missed. If you ask the agent to design, implement, review, and validate itself, then yes, you’re going to get spaghetti with a test suite that lies to you. If instead you keep architecture and taste firmly in human hands and use the agent as an execution engine, the leverage is enormous.

My strong suspicion is that a lot of the negative experiences come from a mismatch between expectations and operating model. If you expect the agent to be autonomous, it will disappoint you. If you expect it to be an amplifier for someone who already knows what “good” looks like, it’s transformative.

So while I guess plenty of hype exists, for me at least, they hype is justified. I’m shipping way (WAY!) more, with better consistency, and with less cognitive exhaustion than ever before in my 20+ years of doing dev work.

NedF • yesterday at 10:40 PM

[dead]

dmitrygr • today at 2:10 AM

1. give it toy assignment which is a simplified subcomponent of your actual task

2. wait

3. post on LinkedIn about how amazing AI now is

4. throw away the slop and write proper code

5. go home, to repeat this again tomorrow

nickphx • yesterday at 4:02 PM

lol no. it is all fomo clown marketing. they make outlandish claims and all fall short of producing anything more than noise.

alt Hacker News

Ask HN: Do you have any evidence that agentic coding works?

Comments