Show HN: Data Engineering Book – An open source, community-driven guide

167 points • by xx123122 • yesterday at 9:35 PM • 19 comments • view on HN

Hi HN! I'm currently a Master's student at USTC (University of Science and Technology of China). I've been diving deep into Data Engineering, especially in the context of Large Language Models (LLMs).

The Problem: I found that learning resources for modern data engineering are often fragmented and scattered across hundreds of medium articles or disjointed tutorials. It's hard to piece everything together into a coherent system.

The Solution: I decided to open-source my learning notes and build them into a structured book. My goal is to help developers fast-track their learning curve.

Key Features:

LLM-Centric: Focuses on data pipelines specifically designed for LLM training and RAG systems.

Scenario-Based: Instead of just listing tools, I compare different methods/architectures based on specific business scenarios (e.g., "When to use Vector DB vs. Keyword Search").

Hands-on Projects: Includes full code for real-world implementations, not just "Hello World" examples.

This is a work in progress, and I'm treating it as "Book-as-Code". I would love to hear your feedback on the roadmap or any "anti-patterns" I might have included!

Check it out:

Online: https://datascale-ai.github.io/data_engineering_book/

GitHub: https://github.com/datascale-ai/data_engineering_book

Comments

hliyan • today at 5:59 AM

I'm not sure whether this is an artefact of translation, but things like this don't inspire confidence:

> The "Modern Data Stack" (MDS) is a hot concept in data engineering in recent years, referring to a cloud-native, modular, decoupled combination of data infrastructure

https://github.com/datascale-ai/data_engineering_book/blob/m...

Later parts are better and more to the point though: https://github.com/datascale-ai/data_engineering_book/blob/m...

Edit: perhaps I judged to early. The RAG sections isn't bad either: https://github.com/datascale-ai/data_engineering_book/blob/m...

esafak • today at 4:03 AM

I'd have titled the submission 'Data Engineering for LLMs...' as it is focused on that.

osamabinladen • today at 5:15 AM

this is great and i bookmarked it so i can read it later. i’m just curious though, was the readme written by chatgpt? i can’t tell if im paranoid thinking everything is written by chatgpt

joshuaissac • yesterday at 10:58 PM

English version: https://github.com/datascale-ai/data_engineering_book/blob/m...

➕ show 1 reply

alexott • today at 8:08 AM

Parquet alone is not for modern data engineering. Delta, Iceberg should be in the list

guillem_lefait • today at 12:54 AM

The figures in the different chapters are in english (it's not the case for the image in README_en.md).

➕ show 1 reply

xx123122 • today at 4:11 AM

[dead]

dvrp • today at 12:32 AM

If you are interested in (2026-)internet scale data engineering challenges (e.g. 10-100s of petabyte processing) challenges and pre-training/mid-training/post-training scale challenges, please send me an email to [email protected] !

rafavargascom • yesterday at 11:16 PM

谢谢

How is possible a Chinese publication gets to the top in HN?

➕ show 2 replies

MUSTANG303 • today at 6:41 AM

[dead]

alt Hacker News

Show HN: Data Engineering Book – An open source, community-driven guide

Comments