logoalt Hacker News

100M-Row Challenge with PHP

157 pointsby brentroosetoday at 10:24 AM77 commentsview on HN

Comments

brentroosetoday at 10:24 AM

A month ago, I went on a performance quest trying to optimize a PHP script that took 5 days to run. Together with the help of many talented developers, I eventually got it to run in under 30 seconds. This optimization process with so much fun, and so many people pitched in with their ideas; so I eventually decided I wanted to do something more.

That's why I built a performance challenge for the PHP community

The goal of this challenge is to parse 100 million rows of data with PHP, as efficiently as possible. The challenge will run for about two weeks, and at the end there are some prizes for the best entries (amongst the prize is the very sought-after PhpStorm Elephpant, of which we only have a handful left).

I hope people will have fun with it :)

show 6 replies
chrismarlow9today at 7:45 PM

I don't have time to put together a submission but I'm willing to bet you can use this:

https://github.com/kjdev/php-ext-jq

And replicate this command:

jq -R ' [inputs | split(",") | {url: .[0], date: .[1] | split("T")[0]}] | group_by(.url) | map({ (.[0].url): ( map(.date) | group_by(.) | map({(.[0]): length}) | add ) }) | add ' < test-data.csv

And it will be faster than anything you can do in native php

Edit: I'm assuming none of the urls have a comma with this but it's more about offloading it through an extension, even if you custom built it

show 1 reply
Xeoncrosstoday at 3:00 PM

This is why I jumped from PHP to Go, then why I jumped from Go to Rust.

Go is the most battery-included language I've ever used. Instant compile times means I can run tests bound to ctrl/cmd+s every time I save the file. It's more performant (way less memory, similar CPU time) than C# or Java (and certainly all the scripting languages) and contains a massive stdlib for anything you could want to do. It's what scripting languages should have been. Anyone can read it just like Python.

Rust takes the last 20% I couldn't get in a GC language and removes it. Sure, it's syntax doesn't make sense to an outsider and you end up with 3rd party packages for a lot of things, but can't beat it's performance and safety. Removes a whole lot of tests as those situations just aren't possible.

If Rust scares you use Go. If Go scares you use Rust.

show 3 replies
pxtailtoday at 12:47 PM

Side note - I wasn't aware that there is active collectors scene for Elephpants, awesome!

https://elephpant.me/

show 2 replies
semiquavertoday at 3:10 PM

Are they just confused about what characters require escaping in JSON strings or is PHP weirder than I remember?

    {
        "\/blog\/11-million-rows-in-seconds": {
            "2025-01-24": 1,
            "2026-01-24": 2
        },
        "\/blog\/php-enums": {
            "2024-01-24": 1
        }
    }
show 4 replies
matei88today at 7:41 PM

It reminds me of a good read about optimizing PHP for 1 billion rows challenge. TLDR; at some point you hit a limit in PHP’s stream layer

https://dev.to/realflowcontrol/processing-one-billion-rows-i...

tveitatoday at 1:36 PM

> Also, the generator will use a seeded randomizer so that, for local development, you work on the same dataset as others

Except that the generator script generates dates relative to time() ?

Retr0idtoday at 1:19 PM

How large is a sample 100M row file in bytes? (I tried to run the generator locally but my php is not bleeding-edge enough)

show 1 reply
csjhtoday at 4:42 PM

Obligatory DuckDB solution:

> duckdb -s "COPY (SELECT url[20:] as url, date, count(*) as c FROM read_csv('data.csv', columns = { 'url': 'VARCHAR', 'date': 'DATE' }) GROUP BY url, date) TO 'output.json' (ARRAY)"

Takes about 8 seconds on my M1 Macbook. JSON not in the right format, but that wouldn't dominate the execution time.

show 1 reply
poizan42today at 3:04 PM

> The output should be encoded as a pretty JSON string.

...

> Your parser should store the following output in $outputPath as a JSON file:

    {
        "\/blog\/11-million-rows-in-seconds": {
            "2025-01-24": 1,
            "2026-01-24": 2
        },
        "\/blog\/php-enums": {
            "2024-01-24": 1
        }
    }
They don't define what exactly "pretty" means, but superflous escapes are not very pretty in my opinion.
show 1 reply
spiderfarmertoday at 12:36 PM

Awesome. I’ll be following this. I’ll probably learn a ton.

wangzhongwangtoday at 1:33 PM

[dead]

tomaytotomatotoday at 5:15 PM

Tempted to submit a Java app wrapped in PHP exec() :D