Google's approach to Big Data is BigQuery - a web service for interactive analysis of massively large datasets using SQL-like syntax. How big is Big? Millions and billions of rows can be processed in seconds. Oh, and yes - you can run Regex in your queries!
While BigQuery syntax is very much SQL-like, it is not a traditional relational database when you consider how it stores and processes data. You may want to denormalise your data to ensure super fast processing and you could also use nested records. The BigQuery engine follows the NoSQL (Not Only SQL) concept, which has recently gained popularity.
If you are a happy owner of a Google Analytics Premium account, the BigQuery Export for Google Analytics is available to you. This awesome feature will automatically export raw session and hit level data to BigQuery. Basically, each visit and each event will be represented as a single row with all hit attributes. That's right - data is not aggregated. Do you feel the potential? That's a major difference from Google Analytics API export. Do we get sampling in BigQuery? Of course not! And it works much, much faster than the GA interface or GA scheduled unsampled reports.
However, it does comes at a cost. With the amount of data Google Analytics stores for every hit, this does not come as a surprise. The good news is that Google provides Premium clients with a significant monthly credit to spend on their's BigQuery business needs.
Google charges for data storage and processing (measured by the amount of data being processed). At the time of writing, Google charges $0.020 USD per GB, per month for storage and $5 USD per 1 TB of data being processed by queries. To put these numbers into context, let's see how many bytes a single Hit may take.
The sample GA table (ga_sessions_20130910) dated back to Sep 2013 is pretty small: 62Kb for 63 rows (one session one row). However, it gives us a clear idea about the average practical row size. Less than 1KB per session or 175 bytes per hit on average. How do I know that? I calculated the number of hits using this simple query:
SELECT sum(totals.hits) FROM [google.com:analytics-bigquery:LondonCycleHelmet.ga_sessions_20130910]
So with 175 bytes per hit and having 10 mln monthly we are looking at approximately 1.63Gb of data. This leads to a BigQuery storage cost of $0.03 - not much at all. If you request your whole dataset using BigQuery approximately 628 times - another $5 will be added to your bill. Saving 1B (it's one billion, lads) hits in BigQuery will result in a storage cost of $3.26 monthly. Pretty affordable.
The numbers above are for simple and standard tagging. Who is doing basic tagging these days? People shop digitally, enjoying micro-moments in 2015 and to improve our ROI we need to measure a little bit more than pageviews and sessions. And by doing so we significantly inflate our bytes-per-row number. How significantly? It depends, as every byte you sent is stored and counted against your quota.
Lets get a little bit more technical - based on the latest export schema we have, the following fields count:
In BigQuery GA tables structure, single row (Session or Hit) can include multiple rows such as Custom Dimensions and Custom Metrics. For example a Hit row may include 20 or more Custom Dimensions inside a field named customDimensions. This helps save space and makes the table structure more robust, however it also makes the average row size calculation less accurate and more unpredictable.
With a situation where we have 20 Custom Dimensions and 10 Custom Metrics for every hit (pageview and event), each Custom Dimension adds one Integer field and one String field. Each Custom Metric adds 2 Integer fields.
Strings can be Custom Dimensions value (150 bytes limit), Event attributes such as Label, Action, Category (150..500 bytes limit), page title, url's etc. An educated guess for an average string bytes count would be 65 bytes. Most probably, in your case it would be considerably less as there is a whole bunch of strings staying empty for Apps related fields (such as hits.appInfo.appId) in the Web property and otherwise. It might be more though.
Approximate and average amount of bytes required to save our hypothetical analytics data would be:
6814 bytes = 93 string * 65 bytes per string + 95 integers * 8 bytes per integer + 9 single-byte Boolean
Or 6.6Kb per hit.
Assuming our tagging results are 10 mln pageviews and 20 mln events, we then have a total of 30 mln hits monthly or 189Gb of data monthly or $5.67 monthly for storage. If you manage to process 2 TB of data (10 times over your data) you will be charged $10.
Still, pretty affordable.
While this is not the absolute maximum you can technically receive in your BigQuery tables from Google Analytics - the tagging configuration looks close to what you would see in real life. If you're reaching 30 mln hits, you should be very seriously considering Google Analytics Premium as you will get heavy sampling and no confidence in the GA reports as a result.
As mentioned before, Google Analytics Premium clients enjoy a monthly $500 credit to spend on BigQuery which is seriously more than the numbers above even considering that you will have to add a processing cost to it.
Well in all seriousness, you might want to integrate your GA data with your CRM, especially if you are utilising the User ID feature. This data will count towards your storage and processing limits as well. You may even want to migrate your existing data warehouse infrastructure to the BigQuery cloud, importing huge datasets of data to Google Storage. This is when you get serious. This is when you are probably going to reach your credit limit and you might actually end up getting billed.
Some useful links to get you started with BigQuery
What is your average row size in bytes from real-life experience?
As an Google Analytics Partner, Internetrix work with customers providing them with analytics solutions to give a competitive edge. Talk to use, if you wanna know how Google Analytics Premium can help your business grow.