Google Tag Manager: Hidden Data Leaks and Privacy Violations?

Aaron Burnett / 22nd January 2024 / Comment / Analytics

 

Video Transcript

Does Google Tag Manager suffer from hidden or even intentional data leaks?

Does Google Tag Manager give the illusion of consent policy control for site owners but ignore those same policies when it comes to their own tags?

Does GTM potentially violate GDPR?

These are some of the claims that come out of a new study out of Europe. The claims are pretty audacious, and if they’re accurate, they definitely have implications for anyone operating a website in Europe and in the US for anyone considering GTM as part of a HIPAA-compliant data strategy.

So, I think it’s worth a few minutes digging into the study, looking at the methods they use, what we can and cannot learn from it, and what this might mean for organizations considering server-side Google Tag Manager as part of a HIPAA compliance strategy in the US.

The study comes from the National Institute for Research in Digital Science and Technology, an organization out of France. It was founded in 1967 by the French government and is intended to be a bridge between academia and the commercial world. The title of the study is, “Google Tag Manager: Hidden Data Leaks and its Potential Violations Under EU Protection Law.”

The paper doesn’t make for the easiest reading – it’s an academic paper and they’re rarely easy bedtime reading. But what’s also true is the lines get pretty blurry with regard to the distinction between client-side and server-side tag management implementations, and so there are assertions of claims and findings that are really not clearly defined as being peculiar to client-side or server-side.

The paper includes both conclusions and legal analysis provided in concert with an attorney. We’re going to focus on the conclusions and won’t get into the legal analysis. There are enough issues with the structure and methodology of the paper that it just doesn’t seem fair or right to focus on the legal analysis.

The study evaluated both client-side and server-side Google Tag management. On the client side, evaluating 78 client tags, and on the server side – eight server-side tags and looking at two different consent management platforms. What they discovered:

·       Multiple hidden data leaks they claim

·       Tags bypassing GTM permission systems to inject scripts

·       And disagreement between the consent state that is indicated by a user and what is actually collected by Google Tag Manager

So again, pretty alarming claims. So it’s worth looking at their methods. It’s worth considering whether these claims are well-founded and whether this is a study that actually discovered something new or this is a study that had its objective in mind when it was executed and was structured in a way to achieve that objective. And I’ll sort of foreshadow here, it’s much more the latter than the former.

The study was conducted in late 2023 and was published on December 22nd, 2023. And here’s how they went about setting things up. They structured a new domain with:

·       One page of content

·       One paragraph of text

·       One HTML login form to test whether form interactions were being captured in an inadvertent manner as well.

On the client side, they tested all of the 78 tags that are formally approved and endorsed by Google. They installed those tags 1-by-1 and monitored what was being fired by those tags, comparing that with what was actually disclosed to the website publisher and what was indicated by configuration and then looked for disagreement with those things.

They also perform more rigorous tests on three of the most popular tags: One is the Google tag, the 2nd is the Pinterest Tag, and the third is hot jars own tracking code. In carrying out the tests on the client side, they repeatedly visited and navigated the page for 20 seconds each time. Every time they visited, they made sure it was a fresh session – no cookies, no cached data, nothing that would be carried over from a prior visit.

They used chromium debug tools to analyze traffic and look for any scripts that were being downloaded with the tag (looking for script injection). And for the three tags that they evaluated more deeply, more rigorously, they also inspected GET parameters and POST bodies and looked for data exchange via Websocket protocol to identify any data collected.

On the server side, they implemented the three most used tags in the same sort of paradigm: so new website, new domain, single page of content, single paragraph of text, single HTML login form. The three most used tags were the Google Analytics tag, the Facebook Conversion API and Mixpanel.

They provisioned the server container on a private server so they didn’t use Google’s automatic provisioning, which would implement the container in Google’s cloud environment. Instead, they wanted full control over the technical environment and used their own private server. They created a new web container, they installed the Google tag in that web container and then again they analyzed outgoing traffic from the server container by capturing and analyzing network exchanges. They used Wireshark to collect traffic and import encryption keys. And they implemented the Cookiebot consent management solution.

When testing the cookie consent solutions, they implemented and tested two different solutions which they configured to be GDPR compliant. They used Google Tag Manager’s consent mode features and tested in a similar fashion to client-side and server-side: visiting the site multiple times, 20 seconds per visit, reloading the page, navigating for another 20 seconds, using a new browser session each time and varying the consent state each time. They use GTM’s debugger to know when the tags were running. They use the browser debugger to monitor outgoing traffic, and they compared the data collected to the consent state that they created when they were testing those consent management platforms.

And here, in consent management testing, we get the first indication that this was maybe a bit of a stacked deck – an outcome that was achieved by a study that was structured to achieve that outcome. So, in consent management, you’re looking for one of two values – consent granted or consent denied. The study authors discovered that Google Tag Manager interprets a consent state of “undefined” (meaning that consent hasn’t been granted or denied) as granted. And they flagged this as a fault of Google Tag Manager. But this is a configuration fault. This is an engineering fault. You would not set Google Tag Manager up in this way. Instead, you would implement defaults that are used prior to the consent system being encountered that would prevent this kind of behavior resulting in an assumption of consent. So again a configuration error being highlighted as an error that’s inherent in Google Tag Manager isn’t quite fair in terms of overall findings.

The study claims that tags collect data that they don’t disclose, and they claim that this is true for both client-side and server-side implementations. We know this is true client-side. Part of the risk with client-side tags is that the tags are published by a third party and the data that is collected by those tags is governed by data collection libraries that are created by those third parties and those libraries are live. They can be modified by the third parties and that modification can result in script injection and collection of all sorts of data that isn’t explicitly indicated by the tag, so that’s a known issue.

They further claim that, on the client side, 56 of 78 officially supported client-side tags have a permission set to inject scripts which bypasses Google Tag Manager’s own permission system. So for example, HotJar is one of these tags and, through this injection process, Hotjar can track precise mouse movements, which is one of its values to its users. But this occurs via injection that isn’t occurring in the construct of a consent framework, and so that’s problematic. Five tags that are owned by Google in fact do this – again bypassing their own permission system to inject scripts. Eleven of the 78 officially supported scripts didn’t have the inject script permission set, but they do it anyway, and seven of these come from Google on the server side.

Really the significant finding is that the Google tag always sends the info that you can see in this table, regardless of consent state. That is certainly problematic in a GDPR context. Google does promise not to use that data if consent has not been granted. But in a GDPR context it really doesn’t matter. You don’t have permission to collect that data without consent.

But here again, this feels like a little bit of a stacked deck. If you were implementing Google Tag Manager in Europe and wanting to be compliant with GDPR, you would not use the Google tag in a default configuration. You would, for example do what we do and create a private client ID through which we can control every data element in a hyper-granular fashion – whether it is collected or not collected – and we do so positively at the moment of collection and then can also control (in the context of a tag management solution) what is shared out to other partners and platforms. Then they follow the the instructions provided by Google for a default implementation, but you wouldn’t implement Google Tag Manager in a default state to be compliant with GDPR.

So, what does all this mean for organizations operating in the US? In particular, what does it mean for what are called covered entities – organizations that are governed by HIPAA? Well, again, about a year ago, the Office of Civil Rights at HHS issued new guidance around the “Use of Online Tracking Technologies by HIPAA Covered Entities and their Business Associates.” The effect was to expand the definition of Protected Health Information (PHI) so that it’s now:

·       All web visitors (not just known patients)

·       Any data that enables identification (including IP address)

·       Any data that relates to an individual’s past or present or intended healthcare health state or intended treatment

The effect really was to render virtually all third-party cookies non-compliant with HIPAA because third-party cookies by definition identify a visitor, track their behavior, and don’t allow control or refinement of this tracking by a covered entity.

So, third party cookies, client-side cookies are off the table for HIPAA covered entities and their partners. So then really, the implication is for folks who are considering Google Tag management in a server-side context. There has been pretty murky info on this. Around whether GA4 can be implemented in a HIPAA-compliant manner and around whether Google will sign a BAA.

Google recently announced that they are offering a HIPAA-compliant version of Google Cloud services, which can be used in combination with Google Tag Manager, ostensibly to achieve HIPAA compliance using server-side Google GTM and by extension GA4 as well. But, as this study illustrates, employing a default, server-side implementation of GTM may be a risky proposition. Certainly, if you use the Google collect tag and you’re using it in an unmodified fashion, you’re going to end up collecting data that may rise to the level of PHI. Now, you can share that data. You could use Google Cloud and GTM and collect that data if you were under BAA – and Google now does offer a BAA as part of their HIPAA compliance offering. But it’s their BAA and it’s non-negotiable. It’s really much more like an end user license agreement (EULA). You find it in the settings for your Google Cloud instance, you review it and you click “I agree”. That is the sum total of any negotiation.

We’ve negotiated BAAs with a lot of different organizations and I don’t know of a single healthcare organization that would simply accept a BAA from a large platform provider like this – certainly not one that had even a single attorney or compliance expert on their teams. The risk is just too great. BAAs need to be carefully crafted for the particular operating context of that organization and need to be suitable for their particular use. I’ll include a link in the notes for this posting if you’re interested in seeing Google’s BAA.

So, what’s the TLDR here? What do we take from this study that was confusingly executed, seemed to have an outcome in mind, but did come up with some interesting results. This study certainly makes a very strong case that client-side GTM is not compliant with GDPR, but I think that was pretty well known prior to the study. We already know that client-side GTM is not compliant with HIPAA. Server-side GTM could be compliant, but there is clear reason to tread carefully here, including seriously considering not using Google’s tag in a server-side context. And, if you’re going to use GTM in the Google Cloud context, you have to sign their BAA and that BAA is a template. It’s not negotiable and won’t be tailored to your risk profile, your use case.

I hope this review has been helpful. I hope this saves you from having to read through the 17 pages of densely worded text in that study. We’ll publish more information on this on a regular basis in hopes of helping organizations as we collectively navigate the end of third-party cookies, the rise in criticality of first-party data and the work that we’re all doing to identify and implement the best privacy first solutions that enable us to continue to be very effective, high-performance marketers for our organizations and our clients.

By Aaron Burnett