Universal Analytics Data Deletion

For those who use Google Analytics there are 2 key dates coming up:

  • 1st July 2023: Universal Analytics (UA) sunsets, for non-360 accounts
  • 1st July 2024: Universal Analytics data is deleted by Google, for non 360 accounts.

Those on GA360 have a years grace, but will experience the same sunset in 2024.

Many businesses are already prepared for the UA sunset in July, and replacement with GA4, but few are also prepared for the UA data deletion 12 months later, despite this having as great an impact on business decision-making.

Why is the UA data deletion deadline important?

The deletion of UA data will restrict businesses ability to understand long-term marketing performance, particularly for year-on-year (YoY) reporting or comparisons to pre-pandemic trends. On the other hand businesses who take steps to retain their UA data can continue to analyse this data and gain insights into marketing performance and customer behaviour.

This data is a valuable asset that your business has invested time and funds to gather. In retaining this data you can ensure that its value is not lost in the transition to GA4. With the deadline for data deletion fast approaching, it is essential to take action and ensure that your business can continue to use UA data in reporting and analysis.

For example you may create a regular report on customer transactions and channel acquisition using Google Analytics. This might include key information such as YoY impacts (this sort of information). You switch to GA4 in July 2023 and keep comparing your data to the equivalent UA data in 2022 using the Google Analytics connectors in Looker Studio. This continues working for 12 months. But on the 1st July 2024 all your UA data vanishes. Your previous reports stop working, and you can no longer comment on long term performance or whether there’s a long term shift in channel acquisition. Without this information you can’t adjust your marketing and sales strategies to meet the needs of your customers.

UA data is critical for making informed decisions about your business. By collecting and preserving this data, you can gain valuable insights into customer behaviour that can help you stay ahead of the curve. Without it, you’ll be flying blind, missing out on opportunities to improve and grow your business.

How can I retain my UA data?

Google advises that you can retain your UA data by:

  • downloading it directly from Google Analytics as a CSV/Excel file or
  • exporting it using the GA Reporting API.

In most cases downloading data will not be robust enough, as it requires you to limit the data you want to store and know in advance the specific data you might need going forwards. The GA Reporting API on the other hand is a good solution, as it allows you to store all your UA data, but can be complicated to use.

To use the GA reporting API you’ll need access to a cloud data warehouse, such as BigQuery in Google Cloud Platform, and some knowledge of data engineering skills, such as cloud-based architecture and coding in R.  We set out the step-by-step guide to this in the following technical guide.

We recommend BigQuery as a data warehouse solution because it offers low storage costs, powerful analytical offerings, simple data management and seamless integration with GA4. Once you’re set up with your UA data in BigQuery it  is a powerful tool for businesses that want to retain access to valuable UA data.

If you require any assistance with retaining your UA data, contact 26 at datascience@26-agency.com.

Technical Guide

Overview

To collect UA data before deletion, a cloud-based infrastructure that employs R code and Google Cloud Platform (GCP) can be utilized. This architecture utilizes Cloud Run, a serverless computing platform in Google Cloud, and R code to retrieve data from the GA API. The googleAnalyticsR package simplifies the process of working with the GA API, and the googleCloudRunner package streamlines integration with Cloud Run. The use of parallel processing accelerates data collection, enabling large amounts of data to be retrieved quickly.

Once the infrastructure is set up, sending YAML configuration files to Google Cloud Storage will trigger a Pub/Sub message, which in turn starts up Cloud Run, which collects Universal Analytics data from the API and sends it to Big Query for further analysis.

This technical guide will take you through the four key steps to export all of your UA data

  1. Initial Set-up
    • 1.1. Prerequisites
    • 1.2 Authentication
  2. Deploying the UA Historic Data Cloud Run API
  3. Setting up Google Cloud Storage and Pub/Sub
    • 3.1 Pub/Sub Configuration
    • 3.2 Create Notification to Pub/Sub on Cloud Storage Update
  4. Running The UA Historic Data APICreating The YAML Configuration
    • 4.2 Testing your Cloud Run service
    • 4.3 Running In Parallel
    • 4.4 SQL Considerations

1.     Initial Set-up

1.1 Prerequisites

To retrieve historic UA data, the first step is to set up an R environment and use the functions in the googleCloudRunner Package. The instructions for setting up can be found here. Once you have successfully run the cr_setup() and cr_setup_test() commands, you can begin the process.

Next, download the project for R from https://github.com/26-agency/uaPastData. This is not a package to install, it includes the Cloud Run UA Past Data API with some helper files to get it up and running.

1.2 Authentication

Authentication is an important step in this process. To keep things organized, it’s best to create separate service accounts and OAuth Client ID JSONs for the cloud build/run robots and the robot that will collect and store the UA data. This will make it easier to track and understand the actions of each robot.

Since we are using Cloud Run, you absolutely should not have the authentication files within the base files for the service (since they will be exposed on the web). As such it’s best to use Secret Manager and reference the authentication files from there in Cloud Run.

From APIs & Services for your project  you need to create a Desktop app OAuth Client ID (you should have already configured the OAuth Consent screen when setting up googleCloudRunner) and a service account (shown in Figure 1)

Figure 1: GCP API and Services Library

For the Client ID you should download the JSON and store on your computer for later.

For the Service account you need to create a key and save it on your computer to upload back into GCP later.

Now we can upload these JSON files into Secret Manager by creating two secrets. You’ll need to enable the Secret Manager API if it isn’t already.  Using the JSONs that you’ve saved you can upload each file for each secret and create it. It’s good practice to then delete these from your computer.

For use in examples later, the OAuth Client ID will be called ga_client_secret (as in Figure 2) and the service account is called ua_past_data_service_account.

Figure 2: Creating a secret in Google Secret Manager

The service account needs to have role access to BigQuery and Storage. Admin roles for these will work but ideally a service account should follow the Principle of Least Privilege. To do this you need to create a role from IAM & Admin, which should have the permissions set out in Figure 3.

Figure 3: Service Account permissions

bigquery.datasets.create
bigquery.datasets.get
bigquery.jobs.create
bigquery.tables.create
bigquery.tables.get
bigquery.tables.getData
bigquery.tables.update
bigquery.tables.updateData
 
storage.buckets.get
storage.buckets.update
storage.objects.create
storage.objects.delete
storage.objects.get
storage.objects.list

In addition the service account email has to have Read & Analyze (Viewer) permissions to the Universal Analytics views that you want to pull data from.

2.  Deploying the UA Historic Data Cloud Run API

Before running the googleCloudRunner::cr_deploy_run() function, contained in the cloud_run_deploy.R file,  you need to set up the variables used within the function. These variables are:

  • secretManagerClientSecret is the name of the Oauth 2.0 client ID saved in Secret Manager.
  • secretManagerServiceAccount is the name of the service Account saved in Secret Manager
  • CLIENT_SECRET_PATH and SERVICE_ACCOUNT_PATH points to a non-accessible location for the cloud run service. You don’t have to change the values of these but the paths leading to the files must not be identical.
  • BUCKET is the bucket configured through googleCloudRunner.
  • NUMBER_OF_CONFIGURATIONS is the number of config files that are to be present in the Google Cloud Storage bucket inside a folder called config. This exists to facilitate parallel processing. More on this later.
  • cpu and memory is the maximum amount of CPU and memory, retrospectively, that the Cloud Run service can take. You will most likely want to test and monitor this depending on your requirements, so it’s likely that you’ll change these values later.
Figure 4: cr_deploy_run function  

Running this function is going to create a build, which is stored in the storage bucket. Successful Cloud Build deployment shows something like Figure 5.

Figure 5: Successful Build creation

Following this it will run the Cloud Run Deployment in Cloud Build, as in Figure 6.

Figure 6: Successful Cloud Run Deployment

Which will open a URL of the actual API. Copy and paste this URL for later use. For example, https://ua-past-data-api-xj2r3f1gfg-ew.a.run.app

In the your storage bucket you should also be able to see a .tar.gz file, as in Figure 7. This includes all the package requirements already installed for when the Cloud Run service is triggered.

Figure 7:.tar.gz file in storage bucket

If you haven’t got a config folder, it is worth creating one in the bucket at this point too.

3.  Setting up Google Cloud Storage and Pub/Sub

We want the API to trigger whenever a file is uploaded in our storage bucket. To do this we need to create a pub/sub topic and subscription then create a notification for that storage bucket.

3.1 Pub/Sub Configuration

Go to Pub/Sub from the Cloud Console and create a new topic, called universal_analytics_past_data in this example in Figure 8

Figure 8: Pub/Sub topic creation

Once the topic is created edit the default subscription, as in Figure 9, to choose Push delivery type and enter the API URL we just saved, appending “/pubsub” to the end. It’s important to set a reasonable message retention period, such as 7 days, as we may encounter API limits and quotas while using the GA reporting API. I’ve set the subscription to expire in a year, well after the UA data deletion date, but you could be more specific and set it to expire at the start of 2024. For the retry policy, I’ve set a minimum backoff delay of 100 to prevent hitting the API pull quota limit. However, it’s still possible to exceed the limit, especially if multiple instances are created and run from Cloud Run simultaneously. The retry policy includes an exponential backoff delay, which is the recommended approach for handling GA reporting API requests.

Figure 9: Edit default Pub/Sub subscription

3.2 Create Notification to Pub/Sub on Cloud Storage Update

Now we need to create a notification to pub/sub whenever a file is uploaded/updated to a config folder within the bucket. You can do this by writing a gsutil command from the cloud shell. Start by clicking on the Activate Cloud Shell Icon top right of the cloud console. Then fill in the relevant values into the below and paste into the cloud shell:

-t is topic destination.

-e is the event type – OBJECT_FINALIZE means an object has been created (or updated).

-p is a prefix path and allows us to specify to only trigger when the actual config folder has been edited (and not just anything in the bucket).

Example of running gsutil command in the cloud shell using values we’ve used for this demonstration is in Figure 10.

Figure 10: gsutil command example in cloud shell.

Now I can test to see if this has worked by listing the notification configs for our bucket, as in Figure 11.

Figure 11: notification configs command  

4.   Running The UA Historic Data API

4.1 Creating The YAML Configuration

In the working directory for uaPastData.Rproj there is a config folder, as in Figure 12, which outlines the possible YAML fields to fill in.

Figure 12: uaPastData config  

One important area to consider is the gcp.dataset.location. This should be set as the same as the location of the GA4 BQ dataset, otherwise you won’t be able to join the tables easily.

If you have set the Cloud Run Service to collect a years’ worth of data, for example, and then realise you need another years’ worth you could easily run the Service again with the new dates and write_disposition set to WRTIE_APPEND.

Let’s start with some examples of running API requests on a single view with smaller dimensions and metrics than the maximum limits for one request. This will help you get familiar with the Cloud Run and GA APIs and test if the Cloud Run service is set up correctly. Once we’re comfortable with that, we’ll move on to more complex API requests, such as running GA reporting API requests in parallel.

4.2 Testing your Cloud Run service

We can test the service by doing simple pull of aggregated channel data over a few days, using one YAML configuration file.

We therefor, make sure that the NUMBER_OF_CONFIGURATIONS environment variable for the ua-past-data-api Cloud Run Service is set to 1.

Once you have done this and inputted the values into the YAML all you need to do is upload the file into the Google Cloud Storage Bucket that your googleCloudRunner bucket is set to within the folder called config. The file can be called whatever you want it to, however, you must keep the number of configuration files to the value set by NUMBER_OF_CONFIGURATIONS.

Since this is a simple pull, I can remove the options in the YAML file for slow fetch and batching over days. The highest cardinality is date (not an ID like transaction ID or a user ID), so the number of rows is only something approximating the number of days multiplied by the number of channels (i.e. small).

4.3 Running In Parallel

If you’re trying to gather a large amount of data over several years, you may end up with around ten tables each with tens of millions of rows. To speed up this process you can run API requests in parallel, although there are some limitations, such as the GA reporting API’s limit on the number of concurrent requests. Nevertheless, the Cloud Run Service is built with Pub/Sub, so even if you exceed that limit, it will get a retry with exponential backoff.

The most crucial part of this is determining what data to collect. In the uaPastData.Rproj working directory, there is an R file called parallelGAFields.R, which contains a variable called dimensionsAndMetrics. I created this by comparing the dimensions and metrics  explorer with the GA360 Bigquery import data schema to produce something similar for the GA reporting API. Even if you haven’t configured this, there is a sample in the bigquery-public-data project that you can browse.

To tailor the data collection to your specific needs, you will need to customize the dimensionsAndMetrics variable. Although I have included most of the possible data, there are some things that I have left out, such as custom dimensions, which can be complex due to their different scopes. To include these, you may need to have separate requests for each scope. Additionally, I have not incorporated any advertising data, such as Google Ads or CM360 data, since it should still be possible to get this data from the platforms themselves and you cannot join these by the client ID. It’s also essential to consider the data you don’t want to avoid unnecessary API requests for non-existent data. For instance, if you don’t have any ecommerce data, there’s no need to include the product1, product2, transaction, promotion, and refund tables. I strongly recommend reviewing the data in the dimensionsAndMetrics variable to ensure that it includes everything you need. When building out the data required make sure to use the  dimensions and metrics to check that the API request is possible since some dimensions and metrics cannot go together.

As mentioned before, for the purposes of this demonstration we are going to assume we need to collect a large amount of data all in a parallel process and do this for a singular view. You can use this as a basis to tailor your needs.

The reason I have separated the data to be collected from the YAML file is to make it easier for you to make customisations to the data. Once you’re happy with the data you can create a base YAML configuration like Figure 13.

Figure 13: base YAML configuration  

Save your base config file in the working directory and keep a note of the name. Now edit the variables within create_configs_from_base.R file, as in Figure 14, and run it.

  • SOURCE_FILE This links to the dimensionsAndMetrics variable file. Here you can choose for parallel (parallelGaFields.R) or sequential (sequentialGaFields.R), alternatively you can create a different source file as long as there is a dimensionsAndMetrics variable.
  • BASE_CONFIG_FILE is the file name for the base configuration YAML.
  • REPLACE_CLIENT_ID_WITH_CD_INDEX is the custom dimension index for the ClientID. If using the ga:clientId instead then set this to NULL

This code updates the configuration with the views list and writes the updated configuration to a yaml file. If the source is parallel, the code loops through each table in the dimensionsAndMetrics list, creates a new configuration for each table, updates the view configuration with the dimensions and metrics for the current table, updates the list of BigQuery tables with the name of the current table, and writes the updated configuration to a yaml file named after the current table. The output is each yaml file for each table sent to a folder called conifg.

Figure 14: Create_configs_from_base file

We need to ready the Cloud Run Service for the parallel processing, which involves updating the NUMBER_OF_CONFIGURATIONS environment variable in Cloud Run. This should be the same value as length(dimensionsAndMetrics). Suppose length(dimensionsAndMetrics) == 10 then NUMBER_OF_CONFIGURATIONS in Cloud Run should be set to 10.

Once you have redeployed the service all you have to do is upload all the configuration yaml files to the config folder in your bucket. Remember to delete any other config files if they are no longer relevant.

Once the service has been triggered you should see something like Figure 15 and Figure 16 in the Cloud Run metrics.

Figure 15 Initial Operation During Dropoff (A Few Hours In).

Figure 16 Operation After The Initial Dropoff.

For our current configuration, we have allocated 8GiB of memory and 4CPUs.  However, since the CPU utilization never exceeds 50%, we could potentially reduce the CPU count to 2 and the memory allocation to 4GiB in order to cut costs. It’s important to note that during the operation, the received bytes will flatline after a few hours due to exceeding the maximum number of requests permitted per day for the view ID. To determine the number of requests made, you can check the Analytics Reporting API and cross-check against the API Requests Limits and Quotas.

If you are exceeding the GCP project API limit, requesting an increase in the limit may not be approved by Google since the API is set to become deprecated soon. A faster alternative is to set up the Cloud Run service on multiple projects, one for each view, or simply accept the wait time.

After the initial drop-off period, you may be able to further decrease the CPU or memory allocation to save costs, but you should continuously monitor the CPU usage to ensure it doesn’t exceed the limits. While some memory errors are expected, you can monitor the memory allocation by checking it’s not consistently high. For example, there are times during the process where it would be acceptable to take 2CPU and 2GiB, but as soon as there is a maximum CPU error or consistent memory errors in the logs then these would need to be increased again.

When reviewing the service logs, expect to see a consistent POST 504 error message indicating that the request has been terminated because it has reached the maximum request timeout. This is normal for Cloud Run service as it has a shorter run time compared to Cloud Build. The Cloud Run service is built with this in mind.

4.4 SQL Considerations

Once you’ve gathered all the required data, it’s essential to bear in mind the possibility of duplicate rows. Pub/Sub can experience a backlog of requests during drop-off periods when the API limits have been reached, leading to two API requests to BQ being executed around the same time before dates are verified. Because we’re dealing with time-based data and Client IDs, employing a SELECT DISTINCT operator in the SQL queries for each table is one approach to address this problem. You may also want to write queries that recreate the tables with the SELECT DISTINCT operator applied and sort the tables correctly. Furthermore, consider writing a query to create a new table that denormalizes the data into one table to save on storage costs. At this point, you may want to delete the normalized tables (or store them in long-term Cloud Storage as a backup).

If you created multiple projects to get past the GCP project limit, there isn’t a method to send the data to a different project, so you may want to consider writing queries that send the data to one project.

Lewis Bryan is Principal Data Engineer at MSQ’s 26

J.