The Modern Data Stack is a subject that we have read and heard increasingly on the internet about Data Engineering. So, in this post we are going to explain some concepts related to this topic and breakdown some trends for the area for 2023.
Data Engineering is about creating, maintaining and optimizing data pipelines, which are the ways of moving data from its sources right to the consumer’s end state. Although the workflow in these data pipelines is standard nowadays (consisting of known steps to perform the extraction, transforming and loading of data), this process is extremely sensitive to changes in the data, whether in its structure or values and these changes can directly address the availability of the data pipeline, causing failures and making them unavailable. So, here is where Data Observability comes into play.
Data Engineers need a way of operationalizing any instant detection we have within our data pipelines. In other words, as the data pipelines are becoming bigger throughout the time and because we constantly need to handle even massive amount of data, we need automated tools to prevent downstream, for cases when a data pipeline breaks down, we need to find things right away, alert the Data Engineering teams to fix the issues. Therefore, Data Observability is one of today’s hottest topics in Data Enginering.
Today, in the Data Engineering area, we are getting in touch with Data Observability concepts and tools. There is much to learn in this field and apply it to our data pipelines. To make this process easier, we can break down the three pillars of Observability (logs, metrics and traces) and answer questions related to our data like: is our data updating correctly? Is our data in the right size? Has the data schema/structure changed? Is our data pipeline correctly built in terms of shooting alerts whenever a task fails?
I believe this is the best way to start using Data Observability daily and apply the concepts presented before in our data pipelines. There are many orchestration frameworks built covering this subject, and we just need to understand it better to take advantage of it and prevent downstream in our data pipelines.
Data Quality is a way to assess the level of accuracy and reliability of the data used and generated within the data pipelines within an organization. The Data Quality process does not escape much from the control of activities carried out by a company, where process improvements are continuously made. In the same way, the data also go through this cycle of constant evaluation and refinement and must be kept under the control of the domain areas.
As everything in life, when we work with data, it is necessary to name priorities, which means to assess which data are most critical and relevant to the business and which applications the data will be part of. This analysis will guide the processing and use of data in decision making and, thus, it will be possible to optimize the process by cleaning the data, separating the useful ones from those that can cause noise in the analyses.
This work requires support from the data analysts and engineers who are part of the data path to the pipeline’s end. To help this quality control process, we can make use of some criteria like quality gates. Among them, we can highlight accuracy, completeness, consistency, reliability and whether it’s up to date. By implementing these data quality control practices, we make their lineage much more complete, ensuring the availability of data for use, as well as the security of the information we are generating.
“A data contract is a written agreement between the owner of a source system and the team ingesting data from that system for use in a data pipeline. The contract should state what data is being extracted, via what method (full, incremental), how often, as well as who (person, team) are the contacts for both the source system and the ingestion” — Data Pipelines Pocket Reference: Moving and Processing Data for Analytics / James Densmore. 1st edition. O’Reilly; 2021.
It means that if we need to ingest data directly from a source database, creating and maintaining a data contract is a way of warrantee the data schema of the data and not suffering from unexpected changes of it. They’re also important because when data become popular and widely used in your data pipelines, you need to implement versioning and manage compatibility. This is particularly important, because in a distributed architecture it’s harder to oversee changes.
Nowadays, we see many data pipelines depending on each other, and it can lead to a coupling issue. Coupling means that there is a high degree of interdependence between applications (in our case, data pipelines). So, whenever we have this scenario, even a slight change on the source data can introduce errors in a cascading effect. This can go unnoticed and might affect the decision making process at the end of the pipeline.
Sometimes we can see organizations going after tools like data catalogs and anomaly detection to try and alleviate some of these issues and even building Data Engineering teams to manage the ingestion of data into a centralized Data Warehouse. But these do nothing to improve the quality of your data. Therefore, data contracts are positioned to be the solution to this technical problem. It will guarantee data compatibility and include the terms of service. The terms of service will describe how the data can be used (for example, it could be only for development, testing, or production). It typically describes the data delivery and interface quality and might include uptime, error rates, and availability, as well as deprecation, a roadmap, and version numbers.