The motivation behind Cloudera Altus SDX is to enable multiple clusters to share the same consistent view of enterprise data hosted on Amazon S3 and Microsoft ADLS. At the heart of Altus SDX is a repository of attributes describing locations and structure of data, access rights, business glossary definitions, lineage and more.
We often hear from our customers about use cases where data is in the cloud and clusters are created on demand to ingest new datasets. The lifetime of clusters is short relative to the lifetime of the data itself. Being able to destroy idle clusters saves money, but it also means that metadata accumulated by those transient clusters is lost at the end of each ETL pipeline execution.
For example, one common pain point we hear about is that partition information is lost when clusters running ETL jobs terminate. To process the new data on another cluster, the user first needs to recover partitions by instructing Apache Hive to crawl the file system (MSCK REPAIR TABLE). As you can imagine, this can be rather expensive for large datasets stored in S3.
The root of these metadata persistence problems is that metadata is traditionally stored in a local RDBMS hosted on the cluster itself. This makes metadata a part of the cluster state and each cluster responsible for maintaining its own copy. This approach does not scale well to large data enterprises where multiple clusters are processing the same large shared dataset stored in S3 or ADLS.
Altus SDX replaces the local RDBMS with an external repository. Clusters configured with Altus SDX are always working with the most recent metadata snapshot shared with other clusters.
Figure 1. An Altus SDX-enabled cluster uses Altus API to read and write metadata. For example, Hive Metastore (HMS) is shown here using Altus SDX to read/write data locations and schema.
SDX Namespaces are containers for metadata. When two clusters share the same namespace, they see the same metadata such as databases, tables, partitions, statistics, etc.
When creating a namespace, the user must supply the namespace name. This name is later used as a shorthand logical identifier for the namespace, for example, when creating clusters. We’ll come back to this later in the post.
Once a namespace is created, it does not require further administration. A namespace can be deleted if it is no longer in use.Once created, each namespace receives a globally unique identifier. This identifier, known as CRN, can be used to distinguish between different namespaces even if they are sharing the same logical name. For example, if a namespace with the name foo gets deleted and then a new namespace foo is created, the new CRN will be different from the old one.
Creating Clusters with SDX
When enabling SDX for a new cluster, we must choose which namespace to use. When using Altus CLI or Altus Java SDK, we can specify the namespace either using the namespace name or the CRN.
Figure 4. Altus Console makes choosing the correct namespace easy by showing a dropdown with all namespaces in the user’s Altus account.
And that is all there is to it. In every other respect, creating an SDX-enabled cluster is the same as creating a regular Altus cluster. Once the cluster is ready, it will appear prepopulated with schema and other metadata available in the associated SDX namespace.
Sign up for immediate Beta access to Cloudera Altus SDX and write some queries!
Vasili Zolotov is the Tech Lead for Altus SDX