So many vendors rely on federated queries and caching that you need to know how to identify this in a product, because they bring limitations. Redshift Spectrum, which is basically Athena inside a Redshift VPC, was released in April 2017. Amazon Athena, which was released at the end of 2016, is built on Presto. Since then, Presto has become the federated query engine of choice for directly querying data lakes. They contributed it to open source in 2013. In 2012, a few years after self-service analytics started, Facebook built Presto to execute distributed queries at scale across Hadoop and other data sources. Because the clients couldn’t scale, as data exploded and some departmental BI tools started to become enterprise-wide, some BI tool vendors started to create server-based versions. Federated analyticsįederated analytics mostly started with BI tools building federated queries within thick clients of the 90s and 2000s. Otherwise your analysts will go around your query engine and use the #1 engine and BI tool that has been the leader for over 30 years the spreadsheet. If you implement Presto for self-service analytics, make sure your analysts can do the work on their own or that at the very least there is an SLA of hours for delivering new data. Waiting weeks for a Spark job to be written and run, or for a data warehouse to be tested is unacceptable. Performance is not as important as getting the report done within hours or a few days. There is always the need to create new reports that merge data in the warehouse with some new data. Self service analytics is an ideal use case. This idea and related technologies moved into other related technologies such as data wrangling. This was the birth of self-service analytics. You could then pass the metadata over to the data warehouse team, who could convert the federated query into an ETL job. Self-service analyticsĪbout a decade ago, some BI and data integration vendors, including Informatica, which is where I was at the time, realized that an analyst could get around the data warehouse, and query multiple data sources on their own to quickly build a new report. Since they are not part of the ETL process, you would have to build each drilldown manually and expose it as a service you could call from within a BI tool. It’s just not what Presto, Athena or Dremio were designed for. I remember using this in the late 90s within an ad hoc query tool. They were also used for drill down operations in business intelligence tools, to let you “drill back” from a report back to the original source and raw data. Application integration and Web services depended on this type of architecture to simplify development. Eventually products like Composite emerged to let applications query across multiple sources without having to move the data or write a lot of code. Drill downsįederated queries were first used to simplify data access for application integration and Web services. Federated analytics technologies were purpose-built for use cases so it’s important to understand this when choosing the right engine. There are several uses of federated query engines that have emerged over the last few decades.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |