My previous blog post showed how to iterate over a set of web pages in Power Query using a parameterized function. The post contained two queries – the GetData function, and a query to invoke it over a set number of pages.
GetData function
(page as number) as table => let Source = Web.Page(Web.Contents("http://boxofficemojo.com/yearly/chart/?page=" & Number.ToText(page) & "&view=releasedate&view2=domestic&yr=2013&p=.htm")), Data1 = Source{1}[Data], RemoveBottom = Table.RemoveLastN(Data1,3) in RemoveBottom
Query to invoke it
let Source = {1..7}, ToTable = Table.FromList(Source, Splitter.SplitByNothing(), null, null, ExtraValues.Error), Renamed = Table.RenameColumns(ToTable,{{"Column1", "Page"}}), Added = Table.AddColumn(Renamed, "Custom", each GetData([Page])), Expand = Table.ExpandTableColumn(Added, "Custom", {"Rank", "Movie Title (click to view)", "Studio", "Total Gross /Theaters", "Total Gross /Theaters2", "Opening /Theaters", "Opening /Theaters2", "Open", "Close"}, {"Custom.Rank", "Custom.Movie Title (click to view)", "Custom.Studio", "Custom.Total Gross /Theaters", "Custom.Total Gross /Theaters2", "Custom.Opening /Theaters", "Custom.Opening /Theaters2", "Custom.Open", "Custom.Close"}) in Expand
This approach uses a pre-generated list of page numbers (Source = {1..7}), which works well if you know the range of pages you want to access. But what do you do if you don’t know the range upfront?
The Power Query Formula Language (M) is (partially) lazy – some steps won’t be fully evaluated until the data they reference is needed. We’ll use this capability to define a query that iterates over a large number of pages (10,000), but dynamically stops itself once the first error is hit.
let PageRange = {1..10000}, Source = List.Transform(PageRange, each try {_, GetData(_)} otherwise null), First = List.FirstN(Source, each _ <> null), Table = Table.FromRows(First, {"Page", "Column1"}), Expanded = Table.ExpandTableColumn(Table, "Column1", {"Rank", "Movie Title (click to view)", "Studio", "Total Gross /Theaters", "Total Gross /Theaters2", "Opening /Theaters", "Opening /Theaters2", "Open", "Close"}, {"Rank", "Movie Title (click to view)", "Studio", "Total Gross /Theaters", "Total Gross /Theaters2", "Opening /Theaters", "Opening /Theaters2", "Open", "Close"}) in Expanded
Let’s break this down:
Line 2 (PageRange) defines a large range of page numbers (1 to 10,000).
Line 3 (Source) uses List.Transform to invoke a function (GetData) over each value in the list. It uses a try…otherwise statement, which will catch errors thrown by GetData. If an error occurs, the otherwise statement returns null.
Line 4 (First) uses List.FirstN, and passes in a condition (each _ <> null) that essentially says to take all rows until the first null is reached.
Line 5 (Table) converts the list to a table, and then Line 6 (Expanded) fully expands the table to get at the date.
The key to this working is that using Table.ExpandTableColumn causes List.Transform to be lazily evaluated – the function which goes out and grabs the data from the page (GetData) won’t actually be called until the table is expanded. Since the query specifies that we only want rows up until we get our first error/null, the Power Query engine will stop making calls to GetData once it gets back a null value. In this example, we have 7 pages of data – page 8 returns an error page with no table, which causes our query to fail and return null.
Important note: if you try to paste this code into the Power Query editor, and click on any of the steps before the last one (Expanded), List.Transform will not be lazily evaluated … if you watch the requests being made (with Fiddler, for example), you’ll see Power Query trying to evaluate and access all 10,000 pages.
Image may be NSFW.
Clik here to view.