Quantcast
Channel: Power Query – Matt Masson
Viewing all articles
Browse latest Browse all 20

Iterating over multiple pages of web data using Power Query

$
0
0

This is one of my go-to demos for showing off the Power Query Formula Language (M). This approach works well with websites that spread data over multiple pages, and have one or more query parameters to specify which page or range of values to load. It works best when all pages have the same structure. This post walks through the end to end process, which includes:

  1. Creating the initial query to access a single page of data
  2. Turning the query into a parameterized function
  3. Invoking the function for each page of data you want to retrieve

This example uses the yearly box office results provided by BoxOfficeMojo.com.

You can find a video of this demo in the recording Theresa Palmer and I did at TechEd North America earlier this year.

Creating the initial query

Access one of the pages of data on the site using Power Query’s From Web functionality.

image

From Web actually generates two separate M functions – Web.Contents to access the URL that you enter, and then another function based on the content-type of the URL. In this case we are accessing a web page, so Web.Contents gets wrapped by a call to Web.Page. This function brings up a Navigator that lets you select one of the tables found on the page. In this case, we’re interested in the second table on the page (labeled Table 1). Selecting it and clicking Edit will bring up the Query Editor.

image

From here, we can filter and shape the data as we want it. Once we’re happy with the way it looks, we will convert it to a function.

Turning a Query into a parameterized Function

Open the Advanced Editor to bring up the M code behind your query.

image

For the sake of the demo, I’ve kept the query simple – I’m just accessing the data, and I’ve removed the “Changed Type” step that Power Query automatically inserted for me. The only shaping I did was to remove the bottom 3 summary rows on the page. My code now looks this:

let
    Source = Web.Page(Web.Contents("http://boxofficemojo.com/yearly/chart/?page=1&view=releasedate&view2=domestic&yr=2013&p=.htm&utm_source=rss&utm_medium=rss")),
    Data1 = Source{1}[Data],
    RemoveBottom = Table.RemoveLastN(Data1,3)
in
    RemoveBottom

Note that the url value in the call to Web.Contents contains a query parameter (page) that specifies the page of data we want to access.

To turn this query into a parameterized function, we’ll add the following line before the let statement.

(page as number) as table =>

The two as statements specify the expected data types for the page parameter (number) and the return value of the function (table). They are optional, but I like specifying types whenever I can.

We’ve now turned our query into a function, and have a parameter we can use within the code. We are going to dynamically build up the query string, replacing the existing page value in the URL with the page parameter. Since we’ve indicated that page is a number, we will need to convert the value to text using the Number.ToText function. The updated code looks like this:

(page as number) as table =>
let
    Source = Web.Page(Web.Contents("http://boxofficemojo.com/yearly/chart/?page=&utm_source=rss&utm_medium=rss" & Number.ToText(page) & "&view=releasedate&view2=domestic&yr=2013&p=.htm")),
    Data1 = Source{1}[Data],
    RemoveBottom = Table.RemoveLastN(Data1,3)
in
    RemoveBottom

Clicking Done on the advanced editor brings us back to the query editor. We now have a function expecting a parameter.

image

You can click on the the Invoke button and enter a page value to test it out.

image

image

Be sure to delete the Invoked Function step, then give the function a meaningful name (like GetData). Once the function has been given a good name, click Close & Load to save the query.

Invoking the function for each page of data you want to retrieve

Now that we have a function that can get the data, we’ll want to invoke it for each page we want to retrieve. M doesn’t have any concept of Loops – to perform an action multiple times, we’ll need to generate a List (or Table) of values we want to act on.

From the Power Query ribbon, select From Other Sources –> Blank Query. This brings up an empty editor page. In the formula bar, type the following formula:

= {1..7}

This gives us a list of numbers from 1 to 7.

image

Convert this to a table by clicking the To Table button, and click OK on the prompt.

Rename the column to something more meaningful (i.e. “Page”).

Go to the Add Column menu, and click Add Custom Column.

We can invoke our function (GetData) for each page with the following formula:

GetData([Page])

Click OK to the return to the editor. We now have a new column (Custom) with Table values. Note – clicking the whitespace next to the “Table” text (and not “Table” itself) will bring up a preview window in the bottom of the editor.

image

Click the Expand Columns button to expand the table inline.

image

The full query now looks like this

let
    Source = {1..7},
    ToTable = Table.FromList(Source, Splitter.SplitByNothing(), null, null, ExtraValues.Error),
    Renamed = Table.RenameColumns(ToTable,{{"Column1", "Page"}}),
    Added = Table.AddColumn(Renamed, "Custom", each GetData([Page])),
    Expand = Table.ExpandTableColumn(Added, "Custom", {"Rank", "Movie Title (click to view)", "Studio", "Total Gross /Theaters", "Total Gross /Theaters2", "Opening /Theaters", "Opening /Theaters2", "Open", "Close"}, {"Custom.Rank", "Custom.Movie Title (click to view)", "Custom.Studio", "Custom.Total Gross /Theaters", "Custom.Total Gross /Theaters2", "Custom.Opening /Theaters", "Custom.Opening /Theaters2", "Custom.Open", "Custom.Close"})
in
    Expand

Clicking Close & Load brings us back to the workbook. After the query executes, we can scroll to the bottom of the sheet to see that we’ve pulled in 7 pages of data

 

image


Viewing all articles
Browse latest Browse all 20

Trending Articles