Hi HN! I'm currently a Master's student at USTC (University of Science and Technology of China). I've been diving deep into Data Engineering, especially in the context of Large Language Models (LLMs).
The Problem: I found that learning resources for modern data engineering are often fragmented and scattered across hundreds of medium articles or disjointed tutorials. It's hard to piece everything together into a coherent system.
The Solution: I decided to open-source my learning notes and build them into a structured book. My goal is to help developers fast-track their learning curve.
Key Features:
LLM-Centric: Focuses on data pipelines specifically designed for LLM training and RAG systems.
Scenario-Based: Instead of just listing tools, I compare different methods/architectures based on specific business scenarios (e.g., "When to use Vector DB vs. Keyword Search").
Hands-on Projects: Includes full code for real-world implementations, not just "Hello World" examples.
This is a work in progress, and I'm treating it as "Book-as-Code". I would love to hear your feedback on the roadmap or any "anti-patterns" I might have included!
Check it out:
Online: https://datascale-ai.github.io/data_engineering_book/
GitHub: https://github.com/datascale-ai/data_engineering_book
That's a great point. I completely agree—'Data Engineering for LLMs' is much more accurate given the content.I'll pass this feedback on to the project lead immediately. Thanks for the suggestion.
Project lead? The post gives the impression that you are a student open sourcing your learning notes.
No, I think it gives the impression that the author is using an LLM without much supervision to not only write the submitted content, but to reply to posts here.