Migrating from BigQuery Legacy SQL to Standard SQL

Big query

Sometime ago the BigQuery Team renamed its SQL dialect to Legacy SQL and introduced beta support of ANSI 2011 SQL named Standard SQL in the BigQuery world. 

This made life easier for the huge army of Digital Analysts, Data Science specialists, SQL developers and surely for me. Except that now I have had to migrate all the queries written in the last couple of years to a new Standard SQL. While it is true that in BigQuery, Standard SQL dialect is still at a beta stage it is pretty usable and tools like Tableau force users to switch.

Migrating from Legacy SQL to Standard SQL in BigQuery is a very straightforward process and for every single query I was able to get exactly the same query result by only fixing the dialect syntax leaving semantical structure untouched. As a second stage in the process, I modified some queries to better reflect Standard SQL (ANSI 2011) ideology, making queries simpler and easier to read. I didn't notice any substantial performance gain though. In fact, many queries written and adopted for Standard SQL executed 1.2 - 1.5 times slower than in Legacy SQL! What I have also noticed is that the internal caching engine does not work as well as it did for Legacy SQL. Well, that's probably why BigQuery Standard SQL is still in beta.

Changes required for transition from Legacy SQL to Standard SQL in BigQuery
  1. Comma replacements
    Basically, you can't have comma "," before the FROM keyword. BigQuery Standard SQL is not that forgiving, plus the error message is not clear - so it is good to remember this.
  2. Addressing table
    Instead of using the square brackets, we now use the "`" symbol and project name and dataset name is now separated with a dot instead of a semicolon

    The changes above are very straightforward and could even be done on multiple queries using regex

  3. Querying wildcard tables
    The TABLE_DATE_RANGE function does not exist anymore. Instead there is a _TABLE_SUFFIX pseudo-column. Consider the following query extracting a maximum transaction revenue between 01 Jan 2017 and 31 Jan 2017

    SELECT ROUND(MAX(totals.totalTransactionRevenue) / 1000000) as maxTransactionRevenue
    FROM `myproject.XXXXXXXXX.ga_sessions_*`
    WHERE _TABLE_SUFFIX between '20170101' and '20170131'

    The pseudo-column _TABLE_PREFIX really made queries look cleaner and allowed for more freedom in some complex JOINs

  4. Substring search
    LIKE operator is preferable for simple matches. 
    REGEXP_MATCH function is replaced with REGEXP_CONTAINS using same signature, so that's a quick fix

  5. UNION operation
    UNION operation in BigQuery Standard SQL requires columns to be in exactly the same order while BigQuery Legacy SQL did matching by name

  6. WITHIN RECORD
    This is where things need to get creative. In Legacy SQL to find if session had a specific Hit Scope Custom Dimension (or a specific Event) you could do something like 

    MAX(IF(hits.customDimensions.index = 3, hits.customDimensions.value, FALSE)) WITHIN RECORD AS ...,

    This will not inflate rows and give you MAX value of the Hit Scope Custom Dimension within session.
    There is no direct alternative to it in Standard SQL, so the solution could be to use a subquery

    (SELECT MAX(IF(index = 3,  value, FALSE)) from UNNEST(hits.customDimensions) ) AS ...,


    This may not be optimal, especially if you are querying tens of Custom Dimensions this way.
A few tips & tricks:
  • In most queries, hit data is requested therefore we need to UNNEST hits and CROSS JOIN them with data on the session level. I recommend aliasing unnested hits as hits. Much more readable. Seriously :)
    FROM `XXXXXXXX.ga_sessions_*`, UNNEST(hits) as hits

  • Performance matching a substring
    Anyone working with relational databases knows that it is best to avoid using substring match operator LIKE whenever possible and only in exceptional situations write: LIKE '%needle'. This SQL will force an engine to perform a full table scan due to inability to use a pre-computed index, which is a very time-consuming operation. In BigQuery every query is a table scan! BigQuery is a really fast engine and because they store data per column instead of per row, so in each particular query BigQuery would only need to 'read' requested columns doing a full table scan.
    Well, usage of LIKE '%needle' operand is now a preferable way of filtering data! And in my tests it was way faster than using the REGEXP_CONTAINS function

    AND trafficSource.campaign LIKE '%dom%'  --faster
    AND REGEXP_CONTAINS(trafficSource.campaign, 'dom') --slower

  • Another tip: queries which look simpler - are better and work faster :)

As an Analytics 360 (formerly Google Analytics Premium) reseller and service provider we provide full support for Analytics 360 products and services including BigQuery. Talk to us if you need help migrating Legacy SQL queries to a Standard SQL.  Pick up the phone, Live Chat, or email us if you would like us to share our skills and knowledge to achieve your business goals and targets.

Internetrix combines digital consulting with winning website design, smart website development and strong digital analytics and digital marketing skills to drive revenue or cut costs for our clients. We deliver web-based consulting, development and performance projects to customers across the Asia Pacific (APAC), Australia, Sydney, and Wollongong, ranging from small business sole traders to ASX listed businesses and all levels of Australian government. Thank you for browsing this post, stay tuned for more from Internetrix. 

  

Comments (0)

No one has commented on this page yet.