Every Second Counts - Up to 80% Savings Potential Hidden in Cloud Service Costs

Typical reporting of business performance indicators can accumulate long-hidden errors and unnecessary costs in an organization's cloud services. In this blog, I will discuss where these hidden errors typically originate and how costs can be optimized in the Databricks and Microsoft Fabric environments.

Hidden errors and costs in cloud services typically start accumulating from the planning phase, or rather, from a lack of planning. It is not uncommon, for example, for data to be run into data centers on automatic settings, systems not to be turned off, or jobs not to be properly terminated. Data centers may regularly grind away data that is completely wasted every weekend without anyone knowing. Preventing this does not require a six-month customization project. Often, just a few more hours of planning time and asking the right questions can prevent these issues from arising. Investing time in overall planning yields significant savings in the long run. The foundation of cost-optimized data operations is built on cloud-native data warehousing. From a cost perspective, the best solutions can be divided into two components: data storage and optimizing the use of cloud computing capacity.

From Linear to Logarithmic Cost Development

Business performance indicator reporting is often the first data management tool from which the benefits of cloud technology and data are sought. Typically, the solution is first tried in part or in just one business unit. The trial generates new management and production control tools, excitement, and a desire to expand the solution more broadly across different units of the organization. As an organization's data operations begin to expand, solutions begin to develop under the hood that do not take into account the activities of different units. Hidden costs start to accumulate if each business unit begins to expand its own data pipeline from its own starting points, processes data from its own perspectives, and does not consider, for example, the possibilities of data sharing. In this case, data may be run for processing thousands or tens of thousands of lines several times a day. As a result, the same things are achieved that could have been achieved by running the dataset once. Over time, this leads to linearly increasing and long-hidden costs. Instead of a linear growth curve, the goal should be logarithmic cost development: the more data and computing, the less cost relative to the amount of data and results.

Delta Lake Solves Most Cost Issues

Delta Lake, integrated into Databricks or Microsoft Fabric, offers a comprehensive approach to cost optimization throughout the data lifecycle. It is a kind of intermediate data warehouse that operates on top of an existing raw data storage facility. Often, simply adopting Delta Lake resolves most of the data processing cost issues. At best, the solution has produced up to 80 percent savings in cloud costs.

What is Delta Lake?

Enables ACID transactions ensuring data integrity during reads and writes. This reduces the need for costly data recovery processes or complex workarounds in case of consistency issues.
Enables efficient and scalable metadata processing. This allows it to handle petabytes of data in billions of files with minimal performance impact.
Processes task-specific and real-time data in the same manner. This ensures that data processing pipelines are built as efficiently as possible. A unified processing approach reduces the need for separate systems for different types of data workloads, directly reducing infrastructure and maintenance costs.
Reduces the amount of data scanned during queries. This not only speeds up query times but can also lower data processing costs in the cloud environment.
Compresses and indexes data, improving query performance, which can significantly reduce the computing resources and costs required for data processing tasks.
Supports incremental data loading and queries of previous versions. This means that not all data needs to be reprocessed with each new batch of data. Incremental data processing features reduce operational costs by processing only the data that has changed since the last run.
Streamlines and simplifies ETL processes (Extract, Transform, Load), reducing the costs associated with data input and preparation.

Task Clusters Save Money

Databricks ja Microsoft Fabric mahdollistavat pilvilaskentakapasiteetin optimoinnin sekä räätälöityjen laskentatehtävien määrittelyn ja käsittelyn kustannustehokkaasti. Tehtäväklusterit määritellään rajattujen töiden suorittamista varten siten, että ne käyttävät vain tehtävän suorittamiseen tarvittavan laskentakapasiteetin. Työt voidaan ajoittaa suoritettaviksi tiettyinä aikoina, esimerkiksi silloin, kun laskentakapasiteetin hinnat ovat ruuhka-aikoja halvempia. Tehtävä voidaan ajastaa Microsoftin tarjoamille Spot-tunneille, jolloin yhtiö myy konesaliensa ylijäämäkapasiteettia alennuksella. Databricks esimerkiksi automatisoi klustereiden hallinnan käyttöönotosta skaalaamiseen ja lopettamiseen. Tämä vähentää käyttäjille aiheutuvia toimintakustannuksia ja tehostaa kapasiteetin käyttöä merkittävästi.

Tehtäväklusteri määritellään koon, tyypin (esim. CPU- tai GPU-pohjainen) ja muiden asetusten, kuten automaattisen skaalautumisen mukaisesti loppukäyttäjän tarpeiden mukaisesti. Databricksissa tehtävän käsite viittaa laskentatehtävään tai tehtäväsarjaan, kuten tietojenkäsittelyyn, analysointiin tai koneoppimismallien koulutukseen. Työ määritellään yhdellä tai useammalla tehtävällä, jotka voivat olla Notebook-tehtäviä, Spark jar -tehtäviä, Python-tehtäviä jne. Jokainen tehtävä määrittää suoritettavan koodin ja sen vaatimat parametrit.

Databricks and Microsoft Fabric enable the optimization of cloud computing capacity as well as the definition and processing of customized computing tasks cost-effectively. Task clusters are defined for performing specific tasks so that they use only the computing capacity needed to perform the task. Tasks can be scheduled to run at certain times, for example, when computing capacity prices are cheaper during off-peak hours. The task can be timed for Microsoft's Spot hours, when the company sells excess capacity from its data centers at a discount. Databricks, for example, automates the management of clusters from deployment to scaling and termination. This reduces operational costs for users and significantly enhances the use of capacity.

A task cluster is defined according to size, type (e.g., CPU- or GPU-based), and other settings such as automatic scaling, according to end-user needs. In Databricks, a task refers to a computing task or series of tasks, such as data processing, analysis, or training machine learning models. A job is defined by one or more tasks, which can be Notebook tasks, Spark jar tasks, Python tasks, etc. Each task defines the code to be executed and the required parameters.

Databricks and MS Fabric

Enables modern data warehousing and the necessary layering to provide a flexible solution tailored to the needs of the organization.
Designed for handling and analyzing large volumes of data. It integrates data preparation, data science, machine learning, and business analytics into a single platform. In collaborative teams, there is no need to transfer data between different tools and platforms.
Built on Apache Spark, which offers fast analytics and data processing capabilities for large volumes of data through distributed computing. Users can easily scale computing resources up or down according to the data processing requirements.
Provides an internal collaboration notebook environment where users can write and execute code (e.g., Python, Scala, SQL), visualize data, and share results with their team.

Shared Clusters Keep the Machine Warm

Where task clusters are intended only for scheduled runs, shared clusters make it possible to perform data runs simultaneously with development work. Typically, cost issues begin to arise when each developer is allocated their own cluster instead of shared clusters. Costs may be incurred by 10–20 clusters, when two would suffice.

Shared clusters speed up data processing in data centers. They allow changes to be made without the data processing machine in the data center having time to shut down. When it takes a few minutes to start up a machine, unoptimized development methods begin to accumulate hidden costs, which over time lead to significant expenses. For example, in Databricks, costs are calculated down to the second, and in this case, small streams eventually form a strong flow of costs if development practices are not cost-optimized and commonly agreed upon.

Cost optimization is possible in many different ways. Tools familiar to us for achieving this include code optimization, unit reservation, and direct orchestration with Databricks, Databricks Photon. More options can be found on Microsoft's pages about the built-in features of Fabric.

Data Culture Development Journey

The basic idea of cloud services is to enable centralized sharing of information to a wide audience. Too often, this basic premise is overlooked. Ultimately, it's about people and the capabilities of the organization. Cost surprises are often underpinned by similar characteristics, which indicate, on the one hand, enthusiasm for expanding data development within the organization, but also a still-developing data culture. Data development-related decision-making is driven by business objectives. This is the right approach, but often decision-making structures do not adequately consider what is happening under the hood of data development and why.

Data development is often unnecessarily rigid, for example, due to issues related to data access rights. It is essential to ask what results each business unit is aiming for and what are the specific needs of data development. For cost-efficiency, it is crucial to ask what kind of common data different business units are able to utilize. The cost-effectiveness of cloud computing requires clear ownership and data management practices, as well as smooth cooperation between IT and the different business units of the organization.

Data development and applications must always be business-driven. At the same time, it requires business decision-makers to grasp the whole picture and have enough patience to achieve results cost-effectively. Building a data culture is a step-by-step development journey, where the distribution of speed in the initial phase determines success. Seeking savings in the wrong part of the data development journey may ultimately prove to be an expensive detour masquerading as a shortcut.