logoalt Hacker News

Show HN: Data Engineering Book – An open source, community-driven guide

167 pointsby xx123122yesterday at 9:35 PM19 commentsview on HN

Hi HN! I'm currently a Master's student at USTC (University of Science and Technology of China). I've been diving deep into Data Engineering, especially in the context of Large Language Models (LLMs).

The Problem: I found that learning resources for modern data engineering are often fragmented and scattered across hundreds of medium articles or disjointed tutorials. It's hard to piece everything together into a coherent system.

The Solution: I decided to open-source my learning notes and build them into a structured book. My goal is to help developers fast-track their learning curve.

Key Features:

LLM-Centric: Focuses on data pipelines specifically designed for LLM training and RAG systems.

Scenario-Based: Instead of just listing tools, I compare different methods/architectures based on specific business scenarios (e.g., "When to use Vector DB vs. Keyword Search").

Hands-on Projects: Includes full code for real-world implementations, not just "Hello World" examples.

This is a work in progress, and I'm treating it as "Book-as-Code". I would love to hear your feedback on the roadmap or any "anti-patterns" I might have included!

Check it out:

Online: https://datascale-ai.github.io/data_engineering_book/

GitHub: https://github.com/datascale-ai/data_engineering_book


Comments

hliyantoday at 5:59 AM

I'm not sure whether this is an artefact of translation, but things like this don't inspire confidence:

> The "Modern Data Stack" (MDS) is a hot concept in data engineering in recent years, referring to a cloud-native, modular, decoupled combination of data infrastructure

https://github.com/datascale-ai/data_engineering_book/blob/m...

Later parts are better and more to the point though: https://github.com/datascale-ai/data_engineering_book/blob/m...

Edit: perhaps I judged to early. The RAG sections isn't bad either: https://github.com/datascale-ai/data_engineering_book/blob/m...

esafaktoday at 4:03 AM

I'd have titled the submission 'Data Engineering for LLMs...' as it is focused on that.

osamabinladentoday at 5:15 AM

this is great and i bookmarked it so i can read it later. i’m just curious though, was the readme written by chatgpt? i can’t tell if im paranoid thinking everything is written by chatgpt

alexotttoday at 8:08 AM

Parquet alone is not for modern data engineering. Delta, Iceberg should be in the list

guillem_lefaittoday at 12:54 AM

The figures in the different chapters are in english (it's not the case for the image in README_en.md).

show 1 reply
xx123122today at 4:11 AM

[dead]

dvrptoday at 12:32 AM

If you are interested in (2026-)internet scale data engineering challenges (e.g. 10-100s of petabyte processing) challenges and pre-training/mid-training/post-training scale challenges, please send me an email to [email protected] !

rafavargascomyesterday at 11:16 PM

谢谢

How is possible a Chinese publication gets to the top in HN?

show 2 replies
MUSTANG303today at 6:41 AM

[dead]