Pre-trained Language Models (PLMs), those useful tools for NLP tasks like perfecting, translating and generating text, may be creating privacy risks, including exposure of email addresses,
according to a new study from the University of Illinois at Urbana-Champaign.
The danger of spam attacks is remote at this point, but the potential is there due to the vast memorization
of PLMs.
The authors tested GPT-Neo models on a “large public corpus,” containing text collected from 22 diverse datasets.
They determined that PLMs “truly
memorize a large number of email addresses.” But they do not understand exact associations between names and email addresses and few addresses can be predicted correctly by querying
with names, the authors write.
That means you can’t simply go in and ask for someone’s email address. But the danger exists that attackers can gain access.
The
authors conclude that PLMs do leak personal information through memorization, but that “the risk of specific personal information being recovered by PLMs is low since they cannot associate
personal information with the owner meaningfully.”
advertisement
advertisement
However, they continue that “some conditions, e.g., a long text pattern associated with the email address, knowledge about the
owner, and scale of the model may increase the attack success rate, causing potential privacy risks.”
Moreover, attackers can use existing knowledge to gain more information about
owners from PLMS. Also, the researchers warn of these threats:
“Personal information may be accidentally leaked through memorization.
“Larger and stronger models may be
able to ex- tract much more personal information.”
BEC attacks Now Use Language As the Main Vector
Speaking of cyber threats, a new study from Armorblox found that
language-based attacks are the new normal for business email compromise (BEC) attacks, with 74% of those efforts using language as the main attack vector.
In addition, the study notes
that despite manual work and rule writing, “70% of impersonation emails slipped past native email security controls.”
Dropbox, Microsoft and Docusign were among the most
impersonated brands last year.