In Power BI, like in quite a few other analytics tools as well, there have always been challenges when moving to really large amounts of data. Like now, for example, analyzing a table with a billion rows. This kind of problem has usually been attempted to be answered by storing part of the data in the memory of the analytics tool, which is usually limited, and then directing detailed queries to a database. A few years back, while testing how Databricks would perform against Power BI's direct queries, I was disappointed. The speed was not as expected. And I also heard a similar comment like this last week from a report developer. But surely things must have improved somewhat in last couple of years?
I had heard that managed tables (data stored in the Databricks instance's own storage) should be faster than those on "external" storage. I also read an article about delta cache, data stored in the memory of a Databricks cluster. The thought began to tickle in my mind that it is necessary to put all these together and get to testing. Surely the technology had already matured a bit. Nothing more than using the old 30M row data set and setting up the processing cluster. Views in place and the Power BI report running. The results were confusing.
The report with 30 million row dataset didn't even cough. What in the world is happening here? External storage and internal cache had an almost imperceptibly small difference in speed. I had created a new cluster with all the latest versions, but the difference was far too big to be explained by just changing the software versions. I ran a "basic" cluster alongside which I normally use for development. Although I upgraded the versions to the same ones, the speed remained at a more traditional level. The answers to queries did get processed, but slowly. With the the new cluster, I had clicked on, without a second thought, the Photon Accelerator. Something that I had heard about at the previous Databricks presentation.
Indeed, in August of last year (2022), Databricks introduced a new functionality - Photon. This ground-breaking query engine should, at least based on the marketing claims, be something never seen before and wondrous. Well, we have all hear promises like these from time to time. Like Snowflake is 9 times faster than Azure SQL DW. Databricks is 20 times faster than Snowflake and so on. (Disclaimer: I didn't actually check the figures, but the claims of this ballpark are being made.) Still, all "databases" perform quite sluggish when you need to crunch large amounts of data, without a data architect polishing the SQL queries in between. So, would this feature just be hype and fluff, like so many others before?
After a painful 15-minute wait, the data was in. Whipped up a Power BI report with a "minimum effort" model. And yes, it must be said: It was surprisingly fast. So that you don't have to rely only on the subjective opinion of the developer (aka. me), here is a video where you can evaluate Power BI report's update times yourself. Some queries Power BI could not handle. The 10 million row limit is restricting the graphs. Delta tables were partitioned by date, and using it is the key when doing an analysis of this magnitude. Calculation results that are run over the entire mass are generated in a fairly reasonable time, but when you select, for example, a week's worth of data (out of 365 days), the results are exhilarating.