I was been manning the Data Discovery & Visualization and Data Warehousing booths at TechEd Europe last week, and we saw lots of excitement over Data Explorer. One of the questions I got was about how Data Explorer does its filtering, and I realized there hasn’t been much shared about this yet. It seems like the general assumption is that Data Explorer would pull all rows into Excel, and then perform its filtering in memory (as that’s how you’d build your steps in the UI), but in fact, it’s a lot smarter that this – it will automatically push filters directly to the source query. The team calls this technique “Query Folding”. It is an extremely powerful feature, especially in a “self service ETL” tool where many users aren’t thinking about query performance. Unfortunately, it’s not immediately obvious that this feature exists unless you monitor the underlying queries it uses – so let’s do that now.
Filters
From Data Explorer, we’ll connect to a SQL Server instance:
We’ll read from the DimGeography dimension of the AdventureWorksDW sample:
Take the first five columns:
- GeographyKey
- City
- StateProvinceCode
- StateProvinceName
- CountryRegionCode
Click Done to load the data into Excel. We’ll then launch SQL Profiler, and connect it to our SQL instance. Once profiler is running, we can click the Refresh button in Excel, and see the queries that get executed:
We can see a (surprising) number of queries against SQL Server. We can see that the first “set” are fetching metadata about data types, columns, and indexes (you see lots of SELECT statements from sys.objects and sys.indexes). We’ll ignore these for now (but they’d make a great topic for a future post).
After retrieving metadata information, we see a query against the DimGeography table.
select [GeographyKey], [City], [StateProvinceCode], [StateProvinceName], [CountryRegionCode] from [dbo].[DimGeography] as [$Table]
Already we can see the Data Explorer is smarter than the average query tool. Even though we selected the full table in the UI before hiding the columns certain columns, we can see the source query only contains the columns we want.
Let’s open the query again in Excel by clicking the Filter & Shape button.
With the query open, we’ll add a filter on the CountryRegionCode field. Click on the field title, select Text Filters | Equals …
We’ll filter on CA.
Leaving us with our records from Canada.
We can save the query by clicking Done, clear our current profiler trace and refresh the workbook to see the updated SQL query.
select [_].[GeographyKey], [_].[City], [_].[StateProvinceCode], [_].[StateProvinceName], [_].[CountryRegionCode] from ( select [GeographyKey], [City], [StateProvinceCode], [StateProvinceName], [CountryRegionCode] from [dbo].[DimGeography] as [$Table] ) as [_] where [_].[CountryRegionCode] = N'CA' and [_].[CountryRegionCode] is not null
We see the query has gotten a bit more complicated, but it now contains the ‘CA’ filter we specified in the UI.
Sources and Other Types of Folding
Data Explorer isn’t able to do query folding for every source (i.e. there is no “query” when reading from a flat file), but it does it where it can. Here is an (unofficial) list of supported sources from the Data Explorer team:
- SQL Databases
- OData and OData based sources, such as the Windows Azure Marketplace and SharePoint Lists
- Active Directory
- HDFS.Files, Folder.Files, and Folder.Contents (for basic operations on paths)
I should also note that “filters” aren’t the only type of query folding that Data Explorer can do. If you watch the queries, you’ll see that other operations, such as column removal, renaming, joins, and type conversions are pushed as close to the source as possible. (I’ll explore this in a future post).
Preview warning – this information is current for the June 2013 build of the Data Explorer preview (Version: 1.2.3263.4). The information contained in this post may change prior to RTM.