Commentary

Privacy Landmine: Pre-Trained Language Models Can Leak Email Addresses

by Ray Schultz , Columnist, May 27, 2022

Pre-trained Language Models (PLMs), those useful tools for NLP tasks like perfecting, translating and generating text, may be creating privacy risks, including exposure of email addresses, according to a new study from the University of Illinois at Urbana-Champaign.

The danger of spam attacks is remote at this point, but the potential is there due to the vast memorization of PLMs.

The authors tested GPT-Neo models on a “large public corpus,” containing text collected from 22 diverse datasets.

They determined that PLMs “truly memorize a large number of email addresses.” But they do not understand exact associations between names and email addresses and few addresses can be predicted correctly by querying with names, the authors write.

That means you can’t simply go in and ask for someone’s email address. But the danger exists that attackers can gain access.

The authors conclude that PLMs do leak personal information through memorization, but that “the risk of specific personal information being recovered by PLMs is low since they cannot associate personal information with the owner meaningfully.”

However, they continue that “some conditions, e.g., a long text pattern associated with the email address, knowledge about the owner, and scale of the model may increase the attack success rate, causing potential privacy risks.”

Moreover, attackers can use existing knowledge to gain more information about owners from PLMS. Also, the researchers warn of these threats:

“Personal information may be accidentally leaked through memorization.

“Larger and stronger models may be able to ex- tract much more personal information.”

BEC attacks Now Use Language As the Main Vector

Speaking of cyber threats, a new study from Armorblox found that language-based attacks are the new normal for business email compromise (BEC) attacks, with 74% of those efforts using language as the main attack vector.

In addition, the study notes that despite manual work and rule writing, “70% of impersonation emails slipped past native email security controls.”

Dropbox, Microsoft and Docusign were among the most impersonated brands last year.

cyber security, email, language, privacy

Next story loading

About the Author

Ray Schultz is the former editor of DM News, Chief Marketer, Direct, Circulation Management and other marketing titles.