07. September 2025

Einführung in Data Vault 2.0 - Teil II

Im letzten Blogbeitrag haben wir einen grundlegenden Überblick über Data Vault 2.0, seine Prinzipien, Methodik und den Ansatz zum Datenmodellieren besprochen. In diesem Beitrag implementieren wir ein Beispiel-Data-Vault, um ein praktischeres Verständnis dafür zu bekommen, wie die einzelnen Teile zusammenpassen.

Wir verwenden den OpenFlights-Datensatz als Roh-Quelltabellen und hosten unser Data Warehouse lokal auf PostgreSQL. Für das Datenmodellieren setzen wir dbt zusammen mit dem von Scalefree entwickelten DataVault4dbt-Paket ein. Natürlich könnte man die Data-Vault-Logik auch manuell implementieren, aber wir würden das für die meisten Anwendungen nicht empfehlen, da es bereits mehrere Automatisierungstools gibt, deren Nutzung Zeit spart und die Komplexität reduziert.

Der Quellcode ist verfügbar, falls Sie einen genaueren Blick auf die Implementierung werfen möchten, und natürlich gilt wie immer: Bitte zögern Sie nicht, uns bei Fragen oder Kommentaren zu kontaktieren. Mit all dem gesagt: legen wir los!

1 Staging

Wenn wir uns unsere sechs Quelltabellen ansehen, haben wir eine airlines-Tabelle, eine airports- und eine airports_extended-Tabelle, eine countries-Tabelle, eine planes-Tabelle sowie eine routes-Tabelle. Der gesamte Datensatz ist in einem Sternschema modelliert, wobei die routes-Tabelle als Faktentabelle fungiert und alles andere als Dimensionstabellen. Diese Roh-Quelltabellen finden sich in unserem Quellcode im Ordner OpenFlights.

Wir beginnen mit der Staging-Schicht, in der wir die Rohquellen standardisieren. Als ersten Schritt definieren wir unsere dbt-Sources (das sind einfach die Roh-Quelltabellen) in unserer sources YAML:

version: 2

sources:
  - name: dv
    database: dv
    schema: landing_zone
    tables:
      - name: airlines
      - name: airports
      - name: airports_extended
      - name: countries
      - name: planes
      - name: routes

Als Nächstes müssen wir Tabellen- und Spaltennamen in lower snake_case umbenennen, da dies aufgrund PostgreSQL-spezifischer Syntaxanforderungen notwendig ist, um Fehler zu vermeiden. Das Schöne an dbt ist, dass sich solche Operationen sehr einfach mit etwas Jinja-Scripting durchführen lassen. Hier ein Beispiel, wie wir die Spalten der airlines-Tabelle umbenennen:

{% set src_relation = source('dv','airlines') %}
{% set cols = adapter.get_columns_in_relation(src_relation) %}

select
    {%- for col in cols %}
        "{{ col.name }}" as {{ col.name | lower }}{{ "," if not loop.last }}
    {%- endfor %}
from {{ src_relation }}

Sobald wir mit dem Umbenennen fertig sind, können wir den eigentlichen Staging-Schritt starten. Im Wesentlichen läuft dieser darauf hinaus, unsere Primärschlüssel-Spalten zu hashen, Hash-Diffs für Change Data Capture in späteren Satelliten hinzuzufügen und ein paar weitere Schritte durchzuführen. Glücklicherweise können wir das mit dem stage-Makro von DataVault4dbt erledigen, das die Staging-Logik für uns implementiert. Hier der Staging-Code für die airlines-Tabelle:

{%- set yaml_metadata -%}
source_model: 'airlines'
ldts: CURRENT_TIMESTAMP
rsrc: '!landing_zone.airlines'
include_source_columns: true
hashed_columns:
    hk_airlines:
        - airline_id
    hd_airlines:
        is_hashdiff: true
        columns:
            - name
            - alias
            - iata
            - icao
            - callsign
            - country
            - active
{%- endset -%}

{{ datavault4dbt.stage(yaml_metadata=yaml_metadata)

Und glauben Sie es oder nicht, damit ist das Staging erledigt! Wir wiederholen diesen Schritt einfach für alle Tabellen und haben damit unsere Staging-Tabellen erstellt. Natürlich wären in einer realen Implementierung mehr Schritte nötig. Beispielsweise führen wir hier keine Datenprüfungen durch, was, wie jeder gute Data Engineer sofort sagen würde, ein Rezept für Desaster ist. Da wir aber in unserem Fall einen bereits bereinigten Datensatz verwenden, ist Data Testing hier nicht essenziell, um den Modellierungsansatz zu demonstrieren.

2 Raw Vault

Der nächste Schritt nach dem Staging ist die Implementierung unseres Raw Vaults, und wie zuvor stellt uns DataVault4dbt Makros zum Erstellen von Hubs, Links und Satelliten bereit. Die genaue Modellierung hängt vom Anwendungsfall und den Anforderungen an das EDW ab. In unserem Beispiel behandeln wir jede Quelltabelle als Kern-Geschäftseinheit und definieren dafür jeweils einen Hub. Hier das Beispiel für den airlines-Hub:

{%- set yaml_metadata -%}
hashkey: 'hk_airlines'
business_keys: 
    - airline_id
    - name
source_models: stg_airlines
{%- endset -%}

{{ datavault4dbt.hub(yaml_metadata=yaml_metadata) }}

Beachten Sie, dass wir im Hub nur den Business Key der Airline-Entität aufnehmen. Alle anderen beschreibenden Attribute wandern in die Satelliten. Hubs speichern eindeutige Business Keys sowie Metadaten für Lineage und Auditing.

Sobald alle Hubs definiert sind, müssen wir sie über Links verbinden. Hier ein Beispiel für einen Link zwischen routes und airlines:

{%- set yaml_metadata -%}
link_hashkey: 'hk_routes_airlines'
foreign_hashkeys: 
    - 'hk_routes'
    - 'hk_airlines'
source_models: 
    - name: stg_routes
{%- endset -%}    

{{ datavault4dbt.link(yaml_metadata=yaml_metadata) }}

Wie man sieht, verwenden wir die Hashkeys von routes und airlines als Eingaben, um den routes_airlines-Link-Hashkey zu generieren, der jede Beziehung zwischen einer Route und einer Airline eindeutig identifiziert. Auf diese Weise modellieren Links viele-zu-viele-Beziehungen zwischen Hubs und enthalten ihre eigenen Metadaten.

Zuletzt erzeugen wir Satelliten-Tabellen für die beschreibenden Attribute jedes Hubs. Bitte beachten: Auch Links können und tun dies oft ebenfalls, aber in unserem Beispiel lassen wir sie weg. Hier der Airlines-Satellit:

{%- set yaml_metadata -%}
parent_hashkey: 'hk_airlines'
src_hashdiff: 'hd_airlines'
src_payload:
    - name
    - alias
    - iata
    - icao
    - callsign
    - country
    - active
source_model: 'stg_airlines'
{%- endset -%}    

{{ datavault4dbt.sat_v0(yaml_metadata=yaml_metadata) }}

Sobald wir alle Hubs, Links und Satelliten haben, ist unser Raw Vault im Wesentlichen fertig! An diesem Punkt sollte deutlich werden, warum der Einsatz von Automatisierungstools wie DataVault4dbt so mächtig ist.

Noch eine Anmerkung: In unserem Beispiel implementieren wir keinen Business Vault. Die Implementierung wäre jedoch sehr einfach, sobald ein Raw Vault steht. Unten sehen Sie den Directed Acyclic Graph (DAG) des abgeschlossenen dbt-Projekts.

3 Informations-Marts

Dieser Schritt ist recht geradlinig, sobald das Vault steht. In unseren Informations- bzw. Data Marts erstellen wir Ad-hoc- und benutzerdefinierte Abfragen und Tabellen aus den Vault-Daten, um nützliche Erkenntnisse für das Geschäft zu generieren. Hier zeigt sich auch, warum der Data-Vault-2.0-Ansatz so mächtig ist: Jede Änderung oder Erweiterung unserer Datenstruktur in den Quellsystemen kann problemlos in das Vault integriert werden. Sobald das Vault als Single Source of Truth etabliert ist, besteht die Erstellung von Marts einfach darin, die benötigten Daten auszuwählen. In einer realen Implementierung würden Marts in der Regel über Point-in-Time (PIT)- und Bridge-Tabellen für BI verfügbar gemacht.

In unserem Beispiel implementieren wir drei verschiedene Marts, um unterschiedliche BI-Anforderungen zu demonstrieren:

ein OBT (One Big Table) für Business-Analysten, um die Performance verschiedener Routen und Airlines zu untersuchen,
ein Mart zur Bewertung der Konnektivität jedes Flughafens,
sowie ein Mart zur Analyse der Auslastung der Flugzeugtypen insgesamt.

Da es sich hierbei um recht lange Ad-hoc-Queries handelt, führen wir sie hier nicht auf, empfehlen aber dringend, sie selbst zu analysieren. Das Ergebnis der Flugzeugtypen-Nutzung ist unten dargestellt:

4 dbt-Features

Wir hoffen, dass die obige Implementierung gezeigt hat, wie wichtig wiederholbare und konsistente Logik beim Aufbau eines Data Vault 2.0 ist. Wir persönlich mögen dbt sehr, da es den gesamten Datenmodellierungsprozess angenehmer und schneller macht. Pakete wie DataVault4dbt oder dbtutils ermöglichen es zudem, Dinge sehr schnell einzurichten und zu deployen.

Wie bereits erwähnt, gibt es auch andere Tools. Es existiert sogar ein weiteres dbt-Paket, das ebenfalls eine Data-Vault-2.0-Implementierung darstellt (AutomateDV). Das Hauptziel dieses Blogs war es, Data Vault 2.0 Modellierung zu demonstrieren, nicht ein spezifisches Tool.

Bei der Einschätzung der verschiedenen Tools, können wir gerne behilflich sein, um das bestpassende Tool für Ihre Anforderungen zu finden.

5 Schlussbemerkungen

Wir hoffen, dass Sie durch diese einfache praktische Implementierung nun ein besseres Verständnis für Data Vault 2.0 und alle seine Kernkomponenten haben. Wenn Sie tiefer in das Thema einsteigen möchten, kontaktieren Sie uns oder schauen Sie sich gerne unsere Quellen an.

Ob Sie Data Vault einsetzen sollten oder nicht, ist eine wichtige und vielschichtige Frage, die Sie sorgfältig anhand Ihrer Datenquellen, BI-Anforderungen und Datenvolumina sowie weiterer Faktoren abwägen sollten.

Wir würden uns sehr über Ihr Feedback zu dieser Blogserie freuen, ebenso wie über Kommentare, Anmerkungen oder Fragen. Vielen Dank und bis zum nächsten Mal!

Quellen

Scalefree (kein Datum) Data Vault 2.0 definition. Verfügbar unter: https://www.scalefree.com/consulting/data-vault-2-0/ (Zugriff: 28. August 2025).
Linstedt, D. und Olschimke, M. (2015) Building a scalable data warehouse with Data Vault 2.0. Waltham, MA: Morgan Kaufmann.
Scalefree International GmbH (2025) DataVault4dbt. Verfügbar unter: https://www.datavault4dbt.com/ (Zugriff: 29. August 2025).
OpenFlights.org (2025) OpenFlights.org: Flight logging, mapping, stats and sharing. Verfügbar unter: https://openflights.org (Zugriff: 29. August 2025).
dbt Labs (2025) Transform your data with dbt. Verfügbar unter: https://www.getdbt.com/lp/dbt-the-data-build-tool (Zugriff: 29. August 2025).
PostgreSQL Global Development Group (2025) PostgreSQL: The world’s most advanced open source database. Verfügbar unter: https://www.postgresql.org (Zugriff: 29. August 2025).

Introduction to Data Vault 2.0 - Part II

In the last blog post, we discussed a basic overview of Data Vault 2.0, its principles, methodology, and approach to data modeling. In this entry, we'll implement an example Data Vault to hopefully get a more practical understanding of how the different pieces fit together.

We use the OpenFlights dataset and use it as our raw source tables and we host our data warehouse locally on PostgreSQL. For data modeling, we use dbt alongside the DataVault4dbt package created by Scalefree. Of course, one could also manually implement the Data Vault logic, but we would not recommend that for most applications as several automation tools already exist and using them saves time and reduces complexity.

The source code is available if you wish to take a closer look at the implementation and of course, as always, please do not hesitate to contact with any questions or comments you may have. With all that said, let's dive in!

1 Staging Layer

Looking at our six source tables, we have an airlines table, an airports and an airports_extended table, a countries table, a planes table, and a routes table. The entire dataset is modeled in a star schema with the routes table acting as a facts table and everything else as dimension tables. These raw source tables can be seen in our source code under the 'OpenFlights' folder.

We start with the staging layer, where we standardize raw sources. As a first step, we define our dbt sources (which are simply the raw source tables) in our sources YAML:

version: 2

sources:
  - name: dv
    database: dv
    schema: landing_zone
    tables:
      - name: airlines
      - name: airports
      - name: airports_extended
      - name: countries
      - name: planes
      - name: routes

Next, we need to rename the tables abd columns to be lower snake case due to some PostgresSQL-specific syntax requirements to avoid errors later on. The nice thing about dbt is that these sort of operations can easily be done through some Jinja scripting. Here's how we rename the columns of the airlines table as an example:

{% set src_relation = source('dv','airlines') %}
{% set cols = adapter.get_columns_in_relation(src_relation) %}

select
    {%- for col in cols %}
        "{{ col.name }}" as {{ col.name | lower }}{{ "," if not loop.last }}
    {%- endfor %}
from {{ src_relation }}

Once we're done with the renaming, we can start the actual staging step, which essentially boils down to hashing our primary key columns, adding hash diffs for change data capture in our satellites later down the line, and a few other steps. Luckily for us, we can do this using DataVault4dbt's stage macro, which implements the staging logic for us. Here is the staging code for the airlines table:

{%- set yaml_metadata -%}
source_model: 'airlines'
ldts: CURRENT_TIMESTAMP
rsrc: '!landing_zone.airlines'
include_source_columns: true
hashed_columns:
    hk_airlines:
        - airline_id
    hd_airlines:
        is_hashdiff: true
        columns:
            - name
            - alias
            - iata
            - icao
            - callsign
            - country
            - active
{%- endset -%}

{{ datavault4dbt.stage(yaml_metadata=yaml_metadata)

And believe or not, that's it for staging! We simply repeat the same step for all of our tables and we have our staging tables. Of course, in reality there would be more steps involved. For example, we are not really performing data checks, which if you ask any good data engineer, would quickly tell you is a recipe for disaster, however in our case here, we are using an already cleaned dataset and data testing is not essential to demonstrating the modeling approach.

2 Raw Vault

The next step after staging is to implement our raw vault, and just like before, DataVault4dbt provides macros for creating hubs, links, and satellites. The exact way you would approach modeling your data here is case-specific and depends on what you need from your EDW. In our case, we treat each source table as a core business entity and therefore define a hub for it. Here's how the airlines hub is defined:

{%- set yaml_metadata -%}
hashkey: 'hk_airlines'
business_keys: 
    - airline_id
    - name
source_models: stg_airlines
{%- endset -%}

{{ datavault4dbt.hub(yaml_metadata=yaml_metadata) }}

Notice how we only take the business key associated with the airline entity in our hub. All other descriptive attributes go into the satellites. The hubs store unique business keys plus metadata for lineage and auditing.

Once we defined all hubs, we need to connect them using links. Here is an example of a link between routes and airlines:

{%- set yaml_metadata -%}
link_hashkey: 'hk_routes_airlines'
foreign_hashkeys: 
    - 'hk_routes'
    - 'hk_airlines'
source_models: 
    - name: stg_routes
{%- endset -%}    

{{ datavault4dbt.link(yaml_metadata=yaml_metadata) }}

As you can see we are using the routes and airlines hub hashkeys as inputs to generate the routes_airlines link hashkey, which uniquely identifies each relationship between a route and an airline. This way, the links model many-to-many relationships between hubs and carry their own metadata.

Last but not least, we generate satellite tables for the descriptive attributes of each hub. Please keep in mind that links also can and often do have satellites as well, but in our example we omit them. The airlines satellite is created like this:

{%- set yaml_metadata -%}
parent_hashkey: 'hk_airlines'
src_hashdiff: 'hd_airlines'
src_payload:
    - name
    - alias
    - iata
    - icao
    - callsign
    - country
    - active
source_model: 'stg_airlines'
{%- endset -%}    

{{ datavault4dbt.sat_v0(yaml_metadata=yaml_metadata) }}

Once we have all of our hubs, links, and satellites, we have essentially set up our raw vault! At this point, I hope you are starting to see why using automation tools like DataVault4dbt is so powerful.

One last note as well before we move on, we do not implement a business vault in our example either. However, implementing a business vault once you have your raw vault is very simple. Below is the Directed Acyclic Graph (DAG) of the completed dbt project.

3 Information Marts

This step is straightforward once the vault is in place. In our information/data marts, we create ad-hoc and custom queries and tables from our data in the vault to create useful insights for the business. You can also hopefully see why the Data Vault 2.0 modeling approach is so powerful, as any changes or additions to our data structure from our source systems can easily be accommodated into our vault. Once we have the vault set up as our single source of truth, creating the required marts is a question of simply picking what we need. In a real implementation, we would usually surface marts from the vault via Point In Time (PIT) and Bridge tables to serve BI.

In our example, we implement three different marts to demonstrate various BI requirements:

an OBT (One Big Table) for business analysts to examine the performance of different routes and airlines,
a mart for evaluating the connectivity of each airport,
as well as a mart for analyzing the overall utilization of aircraft types.

Since these are rather long ad-hoc queries, we do not list them here, but we strongly recommend analyzing them yourself. The result of the aircraft type usage is shown below:

4 dbt Features

We hope the implementation above has demonstrated the need for repeatable and consistent logic when setting up a Data Vault 2.0. We personally really like dbt as it makes the entire data modeling experience a much more pleasant and faster experience, and having packages such as DataVault4dbt or dbtutils, etc. makes it so that you can very quickly set things up and deploy.

As mentioned before, there are other tools that one could use and there even exists another package for dbt which is also an implementation of Data Vault 2.0 (AutomateDV). The main goal of this blog was to demonstrate Data Vault 2.0 modeling, not a specific tool.

We are happy to assist you in evaluating the various tools to find the one that best suits your requirements.

5 Concluding Remarks

We hope that through this basic practical implementation, you now have a better understanding of Data Vault 2.0 and all of its core components. If you would like to delve deeper into the topic, please contact us or take a look at our sources.

Whether you should or should not use Data Vault is a rather important and nuanced question that you should carefully consider depending on the nature of your data sources, business intelligence needs, and data volume, amongst other factors.

We would be happy to hear what you think of this blog series, alongside any other comments, feedback, or questions you might have. Thank you and see you next time!

Sources

Scalefree (no date) Data Vault 2.0 definition. Available at: https://www.scalefree.com/consulting/data-vault-2-0/ (Accessed: 28 August 2025).
Linstedt, D. and Olschimke, M. (2015) Building a scalable data warehouse with Data Vault 2.0. Waltham, MA: Morgan Kaufmann.
Scalefree International GmbH (2025) DataVault4dbt. Available at: https://www.datavault4dbt.com/ (Accessed: 29 August 2025).
OpenFlights.org (2025) OpenFlights.org: Flight logging, mapping, stats and sharing. Available at: https://openflights.org(Accessed: 29 August 2025).
dbt Labs (2025) Transform your data with dbt. Available at: https://www.getdbt.com/lp/dbt-the-data-build-tool (Accessed: 29 August 2025).
PostgreSQL Global Development Group (2025) PostgreSQL: The world’s most advanced open source database. Available at: https://www.postgresql.org (Accessed: 29 August 2025).

Tags: data, Data Warehouse, Data Lakehouse, Data Vault, Data Vault 2.0, Raw Vault, dbt, dbt Labs, DataVault4dbt, PostgreSQL, Scalefree, dbutils, AutomateDV

Kommentar schreiben

Kommentare: 0