Have you ever noticed that when you ask about data strategy, you sometimes receive a response about digitalization strategy instead? This can be confusing and frustrating, as it may not address your original question. Misunderstandings can easily happen when discussing complex concepts, even among professionals in the same field. Communication is challenging, and it's crucial to establish a shared language to ensure clarity. As data professionals, we often need to clarify terminology with business partners, even...
Marko Oja

Recent posts
Let's start right from the beginning with a disclaimer: Artificial intelligence still doesn't know how to figure out data quality deficiencies. It can't even search for them very well on its own. You need to show it some love for it to manage. But you can easily get quite many benefits for the task at hand from machine learning as it is today. Data profiling is one of my all-time favorite data development tools. A few years ago, I got to know the Pandas Profiling Python library, which does so much of the work...
I don't recall such hype from a single technology during my career that Open AI’s Chat GPT has made. As a regular person, I am just as into that hype as the next person. But for it to revolutionize the data industry. Well, for that we might still have a bit way to go. On that path, however, the Azure Open AI is a hefty step in the right direction. Here is a story of my first impressions trying to utilize the new service offering. I wanted to do a text analysis that would, instead of just picking up words, use...
In Power BI, like in quite a few other analytics tools as well, there have always been challenges when moving to really large amounts of data. Like now, for example, analyzing a table with a billion rows. This kind of problem has usually been attempted to be answered by storing part of the data in the memory of the analytics tool, which is usually limited, and then directing detailed queries to a database. A few years back, while testing how Databricks would perform against Power BI's direct queries, I was...
There are many different services on Azure that are used for IIoT solutions. Playfully I call this versatile stack as a jungle. And why not. We can see few different layers there from usage and also from architectural perspective. But before we go into details let’s take a step back to see the forest from the trees so to speak. Download our blueprint about IIoT.
I 💙 pandas, I really do. And I also like to do data profiling. So I can’t possibly go wrong with panda profiling, right? Well, jokes aside, in my opinion data profiling should, at least to some extent, be a standard practice in all data processing work. For me and for many of my colleagues our history with data development goes far further than to the time before we had all these nice tools we have now, though. The process of checking data issues has previously been mainly manual. Some of us who are lazy enough...
A week after I started to write the first blog post covering Great Expectations framework, I am back at it again. I managed to first create a custom expectation (i.e., a custom data validation rule) and after which I investigated the more formal way of using the framework. Here’s how it went and what I learned.
So, here’s the deal. One weekend I found myself stuck at home in Covid quarantine waiting for the test results for my kid. Instead of watching the TV the entire Sunday, I decided I might try to use my time doing something a bit more productive.
If you follow any trends in data world, you have probably heard about Data Lakehouse. It’s a new architecture proposed by Databricks that utilises Delta tables as data lake storage format.
In a previous blog post I talked about how Databricks Data Lakehouse can be created with low code implementation only. That is almost true. System needs to be setup and for that initial configuration some code is needed. What this code does is it creates a mount to storage account that will be used as storage for delta tables. Fortunately this code is well documented and there are multiple guides to accomplish this like this one:...