• caglararli@hotmail.com
  • 05386281520

Was Google’s unofficial spell check API part of a huge data breach?

Çağlar Arlı      -    18 Views

Was Google’s unofficial spell check API part of a huge data breach?

I have recently seen some legacy systems which used an unofficial Google API to perform spell checking. Clients would send a list of words to http://www.google.com/tbproxy/spell which would respond with spelling suggestions.

I am curious if this was part of a pretty large data breach. To me, it seems very likely that information which was presumed to be private, such as emails and drafts of blog posts, was sent to Google in massive numbers for many years. Many organizations and legal systems would consider this a data breach. And I assume that there are legacy systems which are still sending today.

Background

This API was used by the Google Toolbar. It was unofficial, undocumented, and unsupported. Apparently it was "discovered" around 2005 and used by many projects without being official offered or supported by Google. Discussions from this time indicate that people were aware that it was unofficial, there was no Terms of Service, and that there were concerns about data privacy and lack of encryption.

It was used in popular products such as Drupal (CMS), Roundcube (webmail), and TinyMCE (WYSIWYG web plugin), and many others. TinyMCE was used in a large number of projects.

Some integrations were updated to filter out data that did not look like words (for example, credit card numbers, phone numbers). This indicates that at least some people thought that there was a risk here. But this was a quick fix to a larger issue.

On July 9th 2013, Google shut down this API.

Several places such as that one have linked to some more information on Google Product Forums which unfortunately is no longer available. I have been unable to recover this using the Wayback Machine or Google's cache.

Here are some instances of projects which adapted to the shutdown, by removing the relevant feature or replacing it with other solutions: Drupal, TinyMCE, jquery-spellchecker .

While Google now returns a 404 response, some old systems are probably still sending requests to it. And some of them are probably still using unencrypted http: .

We can see from the information above that this API was in popular use from roughly 2006 to 2013. Many systems from that era (before 2014) are still running today. I would guess that the number of documents sent to this API is in the billions, and still growing. I don't see any official information from Google about this today. From the perspective of 2024, it seems clear that this dataset would be a valuable asset for machine learning. It is unclear if Google has stored it.

Questions:

  • Has this been covered as a data-security news story anywhere? (I could not find any)
  • Does it seem newsworthy?
  • Does Google have any responsibility to communicate publicly about this?
  • Should Google clarify if it stores this information today?