When imperfect systems are good: Bluesky's lossy timelines

785 points • by cyndunlop • 02/19/2025 • 304 comments • view on HN

Comments

pornel • 02/19/2025

I wonder why timelines aren't implemented as a hybrid gather-scatter choosing strategy depending on account popularity (a combination of fan-out to followers and a lazy fetch of popular followed accounts when follower's timeline is served).

When you have a celebrity account, instead of fanning out every message to millions of followers' timelines, it would be cheaper to do nothing when the celebrity posts, and later when serving each follower's timeline, fetch the celebrity's posts and merge them into the timeline. When millions of followers do that, it will be cheap read-only fetch from a hot cache.

➕ show 5 replies

ChuckMcM • 02/19/2025

As a systems enthusiast I enjoy articles like this. It is really easy to get into the mindset of "this must be perfect".

In the Blekko search engine back end we built an index that was 'eventually consistent' which allowed updates to the index to be propagated to the user facing index more quickly, at the expense that two users doing the exact same query would get slightly different results. If they kept doing those same queries they would eventually get the exact same results.

Systems like this bring in a lot of control systems theory because they have the potential to oscillate if there is positive feedback (and in search engines that positive feedback comes from the ranker which is looking at which link you clicked and giving it a higher weight) and it is important that they not go crazy. Some of the most interesting, and most subtle, algorithm work was done keeping that system "critically damped" so that it would converge quickly.

Reading this description of how user's timelines are sharded and the same sorts of feedback loops (in this case 'likes' or 'reposts') sounds like a pretty interesting problem space to explore.

➕ show 7 replies

rakoo • 02/19/2025

Ok I'm curious: since this strategy sacrifices consistency, has anyone thoughts about something that is not full fan-out on reads or on writes ?

Let's imagine something like this: instead of writing to every user's timeline, it is written once for each shard containing at least one follower. This caps the fan-out at write time to hundreds of shards. At read time, getting the content for a given users reads that hot slice and filters actual followers. It definitely has more load but

- the read is still colocated inside the shard, so latency remains low

- for mega-followers the page will not see older entries anyway

There are of course other considerations, but I'm curious about what the load for something like that would look like (and I don't have the data nor infrastructure to test it)

dsauerbrun • 02/20/2025

I'm a bit confused.

The lossy timeline solution basically means you skip updating the feed for some people who are above the number of reasonable followers. I get that

Seeing them get 96% improvements is insane, does that mean they have a ton of users following an unreasonable number of people or do they just have a very low number for reasonable followers. I doubt it's the latter since that would mean a lot of people would be missing updates.

How is it possible to get such massive improvements when you're only skipping a presumably small % of people per new post?

EDIT: nvm, I rethought about it, the issue is that a single user with millions of follows will constantly be written to which will slow down the fanout service when a celebrity makes a post since you're going through many db pages.

➕ show 4 replies

spoaceman7777 • 02/20/2025

Hmm. Twitter/X appears to do this at quite a low number, as the "Following" tab is incredibly lossy (some users are permanently missing) at only 1,200 followed people.

It's insanely frustrating.

Hopefully you're adjusting the lossy-ness weighting and cut-off by whether a user is active at any particular time? Because, otherwise, applying this rule, if the cap is set too low, is a very bad UX in my experience x_x

➕ show 1 reply

rconti • 02/19/2025

> Additionally, beyond this point, it is reasonable for us to not necessarily have a perfect chronology of everything posted by the many thousands of users they follow, but provide enough content that the Timeline always has something new.

While I'm fine with the solution, the wording of this sentence led me to believe that the solution was going to be imperfect chronology, not dropped posts in your feed.

jadbox • 02/20/2025

So, let's say I follow 4k people in the example and have a 50% drop rate. It seems a bit weird that if all (4k - 1) accounts I follow end up posting nothing in a day, that I STILL have a 50% chance that I won't see the 1 account that posts in a day. It seems to me that the algorithm should consider my feed's age (or the post freshness of my followers). Am I overthinking?

➕ show 3 replies

knallfrosch • 02/19/2025

Anyone following hundreds of thousands of users is obviously a bot account scraping content. I'd ban them and call it a day.

However, I do love reading about the technical challenge. I think Twitter has a special architecture for celebrities with millions of followers. Given Bluesky is a quasi-clone, I wonder why they did not follow in these footsteps.

➕ show 7 replies

sphars • 02/19/2025

When I go directly to a user's profile and see all their posts, sometimes one of their posts isn't in my timeline where it should be. I follow less than 100 users on Bluesky, but I guess this explains why I occasionally don't see a user's post in my timeline.

Lossy indeed.

➕ show 2 replies

cavisne • 02/19/2025

AWS has a cool general approach to this problem (one badly behaving user effecting others on their shard)

https://aws.amazon.com/builders-library/workload-isolation-u...

The basic idea is to assign each user to multiple shards, decreasing the changes of another user sharing all their shards with the badly behaving user.

Fixing this issue as described in the article makes sense, but if they did shuffle sharding in the first place it would cover any new issues without effecting many other users.

➕ show 1 reply

ultra-boss • 02/20/2025

Love reading these sorts of "technical problem + solution" pieces. The world does not need more content, in general, but it does need more of this kind of quality information sharing.

artee_49 • 02/19/2025

I am a bit perplexed though as to why they have implemented fan-out in a way that each "page" is blocking fetching further pages, they would not have been affected by the high tail latencies if they had not done this,

"In the case of timelines, each “page” of followers is 10,000 users large and each “page” must be fanned out before we fetch the next page. This means that our slowest writes will hold up the fetching and Fanout of the next page."

Basically means that they block on each page, process all the items on the page, and then move on to the next page. Why wouldn't you rather decouple page fetcher and the processing of the pages?

A page fetching activity should be able to continuously keep fetching further set of followers one after another and should not wait for each of the items in the page to be updated to continue.

Something that comes to mind would be to have a fetcher component that fetches pages, stores each page in S3 and publishes the metadata (content) and the S3 location to a queue (SQS) that can be consumed by timeline publishers which can scale independently based on load. You can control the concurrency in this system much better, and you could also partition based on the shards with another system like Kafka by utilizing the shards as keys in the queue to even "slow down" the work without having to effectively drop tweets from timelines (timelines are eventually consistent regardless).

I feel like I'm missing something and there's a valid reason to do it this way.

➕ show 1 reply

ramblejam • 02/20/2025

Nice problem to have, though. Over on Nostr they're finding it a real struggle to get to the point where you're confident you won't miss replies to your own notes, let alone replies from other people in threads you haven't interacted with.

The current solution is for everyone to use the same few relays, which is basically a polite nod to Bluesky's architecture. The long-term solution is—well it involves a lot of relay hint dropping and a reliance on Japanese levels of acuity when it comes to picking up on hints (among clinets). But (a) it's proving extreme slow going and (b) it only aims to mitigate the "global as relates to me" problem.

arcastroe • 02/19/2025

I found it odd to base the loss-factor on the number of people you follow, rather than a truer indication of timeline-update-frequency. What if I follow 4k accounts, but each of those accounts only posts once a decade? My timeline would be become unnecessarily lossy.

NoGravitas • 02/19/2025

The funny thing is that all of the centralization in Bluesky is defended as being necessary to provide things like global search and all replies in a thread, things that Mastodon simply punts on in the name of decentralization. But then ultimately, Bluesky has to relax those goals after all.

➕ show 1 reply

skybrian • 02/19/2025

This design makes sense if you didn’t previously have any limit on the number of people an account could follow. But why not have a limit?

➕ show 1 reply

nasso_dev • 02/19/2025

Interesting! I wonder what value they chose for the `reasonable_limit`.

➕ show 2 replies

inportb • 02/20/2025

An interesting solution to a challenging problem. Thank you for sharing it.

I must admit, I had some trouble following the author's transition from "celebrity" with many followers to "bot" with many follows. While I assume the work done for a celebrity to scatter a bunch of posts would be symmetric to the work done for a commensurate bot to gather a bunch of posts, I had the impression that the author was introducing an entirely different concept in "Lossy Timelines."

thmrtz • 02/20/2025

That’s quite interesting and a challenge I have not thought of. I understand the need for a solution and I believe this works reasonably well, but I am wondering what is happening to users that follow a lot of accounts with below-average activity. This may naturally happen on new social media platforms with people trying out the service and possibly abandoning it.

The „reasonable limit“ is likely set to account for such an effect, but I am wondering if a per-user limit based on the activity of the accounts one follows will be an improvement on this approach.

fastest963 • 02/20/2025

To help avoid the hot shard problem, I wonder how capping followers per "timeline" would perform. Especially each user would have a separate timeline per 1000 followers and the client would merge them. You could still do the lossy part, if necessary, by only loading a percent of the actual timelines. That wouldn't help the celebrity problem but it was already acknowledged earlier that the solution to that is to not fan out celebrity accounts.

Artoooooor • 02/19/2025

Are users informed that they follow too many creators and now they will not see every post on their timelines?

buxidao • 02/20/2025

In the fanout design, why not dynamically move on to the next 10,000-user page as soon as all tasks for the current page are either queued or processing? Would that approach improve throughput, or could it introduce issues like resource contention?

trhway • 02/19/2025

So the system design puts the burden on what seems to be synchronous, not queued, writes to get easy reads. I usually prefer simpler cheaper writes at the cost of more complicated reads as the reads scale and parallelize better.

➕ show 1 reply

crabbone • 02/19/2025

Anecdotally, I ran into a similar solution "by chance".

Long ago, I worked for a dating site. Our CTO at the time was a "guest of honor" who was brought in by a family friend who was working in the marketing at the time. The CTO was a university professor who took on a job as a courtesy (he didn't need the money nor fame, he had enough of both, and actually liked teaching).

But he instituted a lot of experimental practices in the company. S.a. switching roles every now and then (anyone in the company could apply for a different role except administration and try themselves wearing a different hat), or having company-wide discussions of problems where employees would have to prepare a presentation on their current work (that was very unusual at the time, but the practice became more institutional in larger companies afterwards).

Once he announced a contest for the problem he was trying to solve. Since we were building a dating site, the obvious problem was matching. The problem was that the more properties there were to match on, the longer it would take (beside other problems that is). So, the program was punishing site users who took time to fill out the questionnaires as well as they could and favored the "slackers".

I didn't have any bright ideas on how to optimize the matching / search for matches. So, ironically, I asked "what if we just threw away properties beyond certain threshold randomly?" I was surprised that my idea received any traction at all. And the answer was along the lines of "that would definitely work, but I wouldn't know how to explain this behavior to the users". Which, at the time, I took to be yet another eccentricity of the old man... but hey, the idea stuck with me for a long time!

➕ show 1 reply

flaburgan • 02/20/2025

The solution to this problem is known and implemented already: the social web should be distributed between thousands of pods which should contain at the maximum a few thousands users. Diaspora is already working like this for 15 years. It is technically harder to build initially but it then divide all the problems (maintenance, moderation, load, censorship, trust of the owner...) Which makes the network much more resilient. Bluesky knows that and they are allowing other people to host their software but they are really not pushing for it and it highly doubt that the experience of a user on a small external pod is the same than on bluesky.com.

➕ show 1 reply

mpweiher • 02/20/2025

On a related note, I am pretty confident that one of the main reasons the WWW succeeded where previous attempts failed was that it very specifically allowed 404s.

KolmogorovComp • 02/20/2025

A simpler option is to put a limit on the number of accounts one’s can follow. Who needs to follow more than 4k followers if not bots?

udioron • 02/20/2025

> some of them will do abnormal things like… well… following hundreds of thousands of other users.

Sounds like Bluesky Pro.

yibg • 02/20/2025

I think something like this was a FB engineering interview (several years ago), just for instagram feeds.

JadeNB • 02/19/2025

I understand that it's a different point, but how can someone write a whole essay called "When imperfect systems are good" without once mentioning Gabriel or https://en.wikipedia.org/wiki/Worse_is_better?

robbale • 02/20/2025

the use of fan-out to followers and a lazy fetch of popular followed accounts when follower's timeline is served a good implementations in hot reload scenarios

dtonon • 02/20/2025

The typical problem of a centralized infrastructure.

Indeed:

> This means each user gets their own Timeline partition, randomly distributed among shards of our horizontally scalable database (ScyllaDB), replicated across multiple shards for high availability

Nemo_bis • 02/20/2025

"Lossy timelines" have already been implemented in ActivityPub and Mastodon by design. Will Bluesky ever catch up? It remains to be seen.

andsoitis • 02/20/2025

Principle: Progress over perfection.

nightpool • 02/19/2025

Note that all of this reflects design decisions on Bluesky's closed-source "AppView" server—any federated servers interacting with Bluesky would need to construct their own timelines, and do not get the benefit of the work described here.

➕ show 4 replies

timewizard • 02/19/2025

> This process involves looking up all of your followers, then inserting a new row into each of their Timeline tables in reverse chronological order with a reference to your post.

Seriously? Isn't this the nut of your problem right here?

➕ show 1 reply

PaulHoule • 02/19/2025

An airline reservation system has to be perfect (no slack in today's skies), a hotel reservation can be 98% perfect so long as there is some slack and you don't mind putting somebody up in a better room than they paid for from time to time.

A social media system doesn't need to be perfect at all. It was clear to me from the beginning that Bluesky's feeds aren't very fast, not like they are crazy slow, but if it saves money or effort it's no problem if notifications are delayed 30s.

➕ show 5 replies

bitmasher9 • 02/19/2025

It’s really impressive how well Bluesky is performing. It really feels like a throwback to older social media platforms with its simplicity and lack of dark-patterns. I’m concerned that all the great work on the platform, protocol, etc won’t shine in the long term as they eventually need to find a revenue source.

➕ show 3 replies

mifydev • 02/19/2025

"Hot Shards in Your Area" - 10/10 heading

dang • 02/19/2025

[stub for offtopicness]

➕ show 6 replies

cush • 02/20/2025

"Hot Shards in Your Area"... I died

alexnewman • 02/20/2025

I don’t see much call for blusky anymore….

➕ show 2 replies

alt Hacker News

When imperfect systems are good: Bluesky's lossy timelines

Comments