What I've learned 6 months into data engineering
I come from a software engineering background. In that world, it’s mostly about having:
- A database
- A back-end server that connects to the database and serves an API
- A front-end client that connects to the API and provides user functionality
But data engineering is very different. For my job, I had to beyond just building the application, and build a data pipeline capable of ingesting both structured and unstructured sources as the minimum requirement to even show data to users. I learned the hard way that data engineering != application engineering. Here were some of my learnings in no particular order:
Take your time to learn the data
Learning the lingo of an industry and the datasets they work with, takes time. It took me weeks to just understand the schemas, and that’s totally fine. Think of it as an investment.
The better you understand the problem space, the industry, and the datasets, the more effective you can be in designing data pipelines.
Just stick to Python
I tried starting off using JavaScript. Don’t laugh too hard - I didn’t have high hopes, but I wanted to see where the ecosystem was at. While there are many libraries that mirrored their Python equivalents (ex. Danfo.js), the ecosystem didn’t feel large nor complete.
While I have many poor things to say about Python’s approach to virtual environments, packages, and types, it’s hard to beat the sheer amount of online resources out there focused on getting all types of data work done with Python. Most data APIs support Python SDKs. Further, LLMs were extremely good at writing fully functional Python code with a deep understanding of all the key libraries like Pandas, Seaborn, XGBoost etc.
It’s also worth thinking about ease of hiring. When you bring on another data engineer, they are more than likely to come from a Python background.
Orchestrators are a must
When I started, I had a series of notebooks labelled
001_notebook.ipynb
002_notebook.ipynb
test.ipynb
Which worked until the dependencies got tangled and complicated. I started evaluating orchestration frameworks (after first learning what an orchestrator was) - Airflow, Dagster, Prefect, etc. I ultimately settled on Dagster because I liked it’s Python-first approach to defining all assets, operations, and resources. Transitioning all the notebooks into functional Dagster assets took many hours, but it was amazing when it all came together.
We could visualize dependencies, track the history of our runs, monitor data quality over time, re-run failed jobs, and read logs to diagnose issues all from one place. Overall, it a first-class experience compared to what I was doing before.
Don’t use Pandas everywhere
As much as I sing the praises of pandas, I later found that is was a code-smell to be using it everywhere for data transformations.
- Most data at the size I work with did not need pandas - lists and dicts were totally fine
- Once data is in pandas, type hints are lost
- The code can be hard to read for other developers
The biggest issue I found was that type hints were lost, which leads to my next point:
Use Pydantic if you want your code to not break in a few months
It starts with adding a new column to your data. Then you add another. Then you have to rename one. Then something breaks and it’s hard to tell where the problems are.
It’s good to have as much static-type-checking as possible. And runtime type checking for key inputs and outputs.
I wanted to see how close I could get to Typescript and Zod, which turned out to be MyPy and Pydantic. After some self-flagellation for not learning this sooner, I migrated all of our data models to use Pydantic and immediately eliminated an extremely common class of bug. There is actual confidence now that the system don’t break because a column changed slightly.
Don’t look like a bot when scraping
My work often involves scraping data from other websites. Samller websites are too small to have any protections, but larger websites will put up a fight to stop bots from accessing their data.
I’ve tried randomizing user agents. I’ve tried Playwright Stealth. To date, the best library I have found is Undetectable Chromedriver (UC), which patches Selenium’s Chromedriver to bypass bot detection by modifying browser fingerprints and removing automation flags. I use it through a scraping utility library called SeleniumBase, but it can be used directly too.