logoalt Hacker News

garritfratoday at 2:07 PM1 replyview on HN

Thanks!

Honestly, the current implementations are pretty naive — they pass the tests and feel snappy on the small sheets I work with, but they'd buckle pretty quickly under real load. Most of what you're asking about is already on the tracker; I opened a batch of issues citing your comment as the prompt.

Recalculation. Right now it's a full recalc on every edit: recalculate collects all formula cells, computes in-degrees across the whole formula set, topo-sorts, and evaluates top to bottom. The dirty flag gets propagated by mark_dirty but isn't actually used to prune work. It's also re-parsing every formula from its raw string on every pass. Two issues cover this: #8 introduces a batch boundary so paste/fill/CSV import trigger one recalc instead of N, and #7 adds criterion benches so we can actually tell whether the parser, the BFS, or the topo sort is the hotspot before optimizing. AST caching on Cell is the obvious next step once #7 confirms parsing dominates.

https://github.com/garritfra/cell/issues/8 https://github.com/garritfra/cell/issues/7

Dependency tracking. The bigger smell is in extract_deps — a range like SUM(A1:A1000) literally enumerates 1000 cell positions into the dep graph, with a HashSet per cell on each side. Fine at hundreds of cells, a disaster at hundreds of thousands. Range expansion is one of the bench cases in #7; the proper fix (interval-keyed deps so ranges stay first-class instead of fanning out) doesn't have its own issue yet — I should open one, since #7 only measures the problem.

Undo/redo. This is the worst offender right now. UndoEntry only had a single-cell variant until very recently; #12 added MultiCellEdit, but #13 tracks two destructive paths I missed — visual-mode d and p/P paste — that still don't push undo entries at all. #9 is the broader coalescing story (one dd = one undo, CSV import = one undo, etc.), tied to the batch mechanism from #8 so a single transaction produces a single undo entry. sort_by_column is also non-undoable today and belongs in that bucket.

https://github.com/garritfra/cell/issues/13 https://github.com/garritfra/cell/issues/9

Larger CSVs. Storage is HashMap<CellPos, Cell> — fine sparse but with overhead per cell; for very wide imports a column-oriented or arena layout would pay off. I haven't profiled it though, so this is speculative; the dependency-graph blowup will hurt before raw storage does. #7 includes a 100k-row CSV load case to put numbers on it.

And #10 is the meta-issue to lift all of this out of source comments and into actual architecture docs, which I probably should have done before posting.

https://github.com/garritfra/cell/issues/10

So: nothing here scales today, but the architecture splits cleanly enough that none of it needs a rewrite — AST caching, dirty-set recalc, range-aware deps, and grouped undo are the four threads, and most have issues attached.


Replies

s_suiindiktoday at 3:28 PM

Range as first-class is the right priority. Pattern that works: keep ranges as single AST nodes (one dep edge per range, not N), then use interval trees on the reverse side so a cell change at C5 becomes "find intervals covering (C, 5)" instead of scanning all formulas. Pairs well with column-oriented storage if you go there

On the AST caching point, worth caching by structural hash of the parsed expression, not the source string. Copy paste with relative references produces different strings but identical AST shape, which hits a lot in financial-model-style workbooks where parallel columns share structure

Also worth a look: the "сalculation chain" docs in Microsot's OOXML SpreadsheetML spec describe how they serialize the dep order in xlsx files. Different problem (persistence vs runtime) but the data model is informative for what level of granularity ends up being practical