Building A Data Mesh With Starburst

Data mesh is a new, decentralized approach to data that allows end-users to easily access data where it lives without a data lake or data warehouse. Domain-specific teams manage and serve data as a product to be consumed by others. Its objective is to allow for data products to be created from virtually any data source while minimizing intervention from data engineers.

Data mesh has four principles to achieve this objective. The principles are:

Domain-Oriented Ownership
Data as a Product
Self-Service Data Infrastructure
Federated Computational Governance

Starburst can be used to achieve a data mesh. The following sections outline how Starburst can be used in alignment with each of the four principles of data mesh.

Domain-Oriented Ownership

In a data mesh, data teams are organized by domain, which is another word for the subject area. Teams publish data products that other teams can access and use to derive their own new data products. Starburst’s goal is to allow teams to focus less on building infrastructure and data pipelines around serving data products and more on using familiar tools such as SQL to prepare data products for end-users.

To achieve this, Starburst provides a large set of connectors that allows each domain to connect to data wherever and in whatever format it may live using a SQL query interface.

Figure 1 shows various example data sources that can be accessed from Starburst.

*Figure 1: Example data sources in Starburst*

‍

Data as a Product

After connecting to a data source in Starburst, Starburst allows you to curate data products from it for other users to access.

Users can browse the published data products as shown in Figure 2.

*Figure 2: Example data products created from data sources in Starburst*

‍

Self-Service Data Infrastructure

Starburst’s SQL query interface allows users to discover, understand, and evaluate the trustworthiness of data products. Figure 3 shows an example of using the SQL query interface to query an Amazon S3-based data product.

*Figure 3: Querying an S3 file in Starburst*

‍

Using the SQL query interface, you can also join data products from different technologies together. For example, Figure 4 shows an example of joining together a PostgreSQL-based data product with an Amazon S3-based data product on a common field. The result of this join can be considered a new, derived data product that can also be registered in Starburst.

*Figure 4: Deriving a new data product from different technologies using Starburst*

‍

Federated Computational Governance

Data mesh proposes a federated model for data governance that focuses on shared responsibility between the domains and the central IT organization in order to adhere to governance, risk, and compliance concerns while allowing adequate autonomy for the domains.

Starburst provides connectors and access to various data governance and data catalog tools such as Collibra and Alation to help users discover, understand, and evaluate the trustworthiness of data products.

Starburst also significantly reduces the need to create copies of data between systems as Starburst’s query engine can read across data sources and can replace or reduce a traditional ETL/ELT pipeline. Copying data also requires reapplying entitlements, which can result in potential opportunities for a data breach; with Starburst that risk is minimized simply because fewer copies of the data will exist since data is mostly queried at the source. This concept, known as data minimization, means data privacy, security, and governance are more achievable goals in organizations that embrace Starburst together with data mesh.

‍