This week, various media outlets reported that WhatsApp exposed user data on a massive scale. Some called it the “largest data leak in history,” others said the situation to be worse than one might think at first. We somewhat disagree. A brief assessment.
Austrian researchers wrote a program that runs through all possible phone numbers and checks whether each one is linked to a WhatsApp account. If this is the case, the program stores not only the phone number but also the associated profile data (the profile picture and the About text), provided this profile data is public.
This way, the researchers were able to compile a complete directory of all phone numbers linked to a WhatsApp account, including public profile data.
According to media reports, the ability to create such a user directory entails the following undesirable consequences, which are particularly problematic from a data protection perspective:
The generated user directory reveals more about WhatsApp than Meta might be comfortable with (for competitive and regulatory reasons), such as how many individuals and companies use WhatsApp in each country or how large the user churn is.
The directory discloses personal information. Random samples suggest that around two-thirds of public profile data includes a human face as a profile picture, and About fields sometimes contain email addresses, other personal data, or references to the users’ identities.
The leaked information could be life-threatening. If authorities in countries where WhatsApp is banned are able to identify those using WhatsApp illegally, the consequences could be drastic.
The generated user directory certainly provides much more detailed information about Meta’s messaging platform than can be derived from app store charts or similar sources. However, this issue does not directly affect user privacy, especially since the information relates to the platform as a whole and not to individual users.
It is important to stress that the accessed profile data was all public. If a WhatsApp user makes their About text accessible to everyone, it can be viewed (and potentially saved) by anyone – this is not unexpected. Apparently, around 30% of all WhatsApp users have a public About text, while for the other 70%, it was not possible to retrieve the About text.
Because WhatsApp uses the phone number as unique identifier, it is also not unexpected that it must be possible to determine whether or not a WhatsApp user account is associated with a given phone number. Otherwise, it would not be possible to add contacts in WhatsApp.
In light of this, it is somewhat misleading to speak of a “data leak.” What we have here is classic “scraping”: the researchers succeeded in systematically exporting public information.
Nevertheless, the generated user directory poses a serious problem from a data protection perspective. According to random samples, for example, two-thirds of the public profile pictures show a face. So if a person’s face is known, facial recognition software could potentially be used to search the directory to obtain that person’s phone number.
This point is the most sensitive. Without doubt, it is extremely worrying if authorities in totalitarian states gain access to a directory of all WhatsApp users’ phone numbers.
However, using WhatsApp in such a scenario is highly problematic from the outset. Like any provider, WhatsApp must verify a user’s phone number via SMS. SMS messages are not end-to-end encrypted, and in totalitarian states, it must be assumed that this communication channel is monitored.
Consequently, authorities can find out that a phone number is linked to a WhatsApp user account as soon as it is registered, and no periodic scraping is required.
As has been shown, it is somewhat of an exaggeration to speak of a “data leak” in this case, given that the accessed information was public. When users make information public, it is in the nature of things that anyone can view – and potentially save – it. Having said that, many users may not have been aware of the scope of the privacy settings or what they actually mean.
Even if this is “only” scraping, it is, of course, highly problematic that Meta has not implemented effective measures to prevent scraping at this massive level.
Still, even the best anti-scraping mechanisms cannot prevent that anyone can determine whether a given phone number is associated with a WhatsApp user account. For example, it was revealed that Mark Zuckerberg uses the Signal app when a security researcher made his private phone number public.
This demonstrates the fundamental problem with using phone numbers as unique identifiers. For a whole range of reasons relating to data protection, phone numbers are not an ideal means for this purpose:
They cannot be easily changed (e.g., after a leak).
They provide information about which country they belong to.
They are not anonymous – in many countries, official identification is required for registration.
If different platforms require a phone number, users can be identified across these platforms.
It may be possible to obtain the phone number of a previous owner, which can have various problematic consequences.
For these reasons, Threema deliberately does not require a phone number and instead uses a random string of characters (the Threema ID) as unique identifier, which is completely anonymous and can be revoked at any time.