Investigating Google's FLoC: Open-sourcing our SimHash implementation

There has been a lot of activity in the past months and years in the advertising industry, all focused on the same subject: better online privacy for users. From laws (GDPR, CCPA, LGPD) to the planned deprecation of third party cookies and the planned introduction of a privacy budget.

Third party cookies are used by the advertising industry to track users, create profiles to improve ad targeting and to establish attribution of conversions to ads shown. In other words, the current ad tech scene depends heavily on them.

Privacy budget will be a way to limit the amount of properties you can request out of a web browser. These properties can be used for good, like fetching the country of the user to show the correct language, or the device resolution to show a scaled version, but they can also be used for bad, like fingerprinting.

The planned removal of cookies (jokingly known in ad circles as the cookiepocalypse) will drive significant changes to our industry. Simplifying, the changes fall into two broad categories:

How can you get some idea of what ad to show to a user/visitor? This is the problem of cookieless targeting.
How can you attribute a purchase to an ad shown? This is the problem of cookieless attribution.

At Hybrid Theory we are always at the forefront of changes to the online advertising landscape, as our constantly updated cookieless positioning whitepaper shows. This is a multi-faceted issue, and means we need to:

Stay up to date with technologies and methods,
Need to keep our clients informed and prepared,
We need to plan for any future changes way before they happen.

One example where engineering has been the leading force has been in evaluating one of the proposed targeting solutions from Google, Federated Learning of Cohorts (FLoC). In short, this algorithm and proposal is based on computing certain properties of the user browse history locally in the browser, (possibly) sending a hashed version of this to a central store and then grouping together users in flocks of users.

To guarantee anonymity, a flock can’t have less than N users (several numbers have been proposed, N=1000 being the lowest suggested so far). This allows for broad targeting, for users that have some commonality. You can read more information about the approach in the Google proposal document here or the Google Ads blog post here

The fundamental part of the algorithm is thus creating these flocks, and the key to do so is using the SimHash similarity algorithm. Once you have a fast SimHash implementation and enough browse data you can simulate what flocks would look like and then evaluate the impact it could have on targeting and audience creation.

As part of our effort to help others and contribute to the open-source community, we have open-sourced our SimHash implementation. You can find more technical information in its README.md.

We hope this implementation serves other companies that want to investigate FLoC ahead of its release later this month, or anyone who wants to use a performant SimHash implementation in Python.