Commentary

Gathering Data On Web Jerks: Q&A With SUNY Binghamton's Blackburn

How often do you find a blog from someone whose opinion is so outrageous that you begin to wonder, who is this, and who could they possibly expect to read their stuff?

Leave it to Jeremy Blackburn, an assistant professor in the department of computer science at Binghamton University, to find the answer. His specialty is, as he describes it, “large-scale measurement of socio-technical phenomena. In a more straightforward manner: I study jerks on the web.”

Blackburn has recently been awarded a five-year $517,484 grant from the National Science Foundation to devise ways to make online content -- particularly from emerging social media platforms -- easier to gather and sort.

Charlene Weisler: What are the biggest challenges currently in gathering and analyzing online content generally, and social media content specifically?

Jeremy Blackburn: The biggest challenge is keeping up with the explosion of features and platforms. While most people are familiar with Twitter and Facebook and maybe Reddit, there are numerous other platforms that have consistently arisen, like Gab, Parler, etc. Other platforms have been around for years, like 4chan and 8chan, but are specifically designed to be ephemeral and anonymous.

advertisement

advertisement

So, there is a very real challenge just with respect to developing systems to collect all these different data sources. The next challenge is figuring out how to fit all these different platforms into a consistent and comparable framework. Not all sites have a "retweet" concept, and some sites might mix concepts from other platforms, e.g., up/downvote and retweet features.

Weisler: Tell me about your multiplatform media dataset.

Blackburn: Most of our data is in JSON or archive HTML. We also collect multimedia (e.g., images), but the bulk (in terms of number of data points at least) is text. We have built systems to cover a variety of platforms over time. Probably the best way to get an idea of what kind of things we have going is to look at some of the data sets we have released: https://idrama.science/datasets/

Weisler: What is your sentiment rating system?

Blackburn: What we are working toward is a new formulation of the problem. For the most part, this problem domain has been thought about as looking at pieces of content in isolation. This leads to some consequences with respect to interpreting scores. At a high level, what does it mean for a piece of content exhibiting 0.01 more positive sentiment than another piece of content?

Instead, we treat the problem as a competition between two pieces of content. E.G., which of these two pieces of content has more positive sentiment? This seems pretty straightforward, but doing it this way allows us to borrow from rating systems that are very interpretable: matchmaking systems. E.G., the Elo rating system that's used in chess.

This opens up a ton of new opportunities because matchmaking systems are actually one of the largest deployed systems on the planet due to their use in online video games. These rating systems have a relatively simple interpretation. For example, I can tell you precisely what it means if one piece of content has a 1800 positive sentiment score versus one that has a 1785 score.

Weisler: Can you delineate irony? Parody? Sarcasm?

Blackburn: These are interesting sub-problems, but, for the most part, they remain open challenges. One of the issues here is that these tend to be difficult problems for humans! This is really compounded when looking at social media where there is not just a loss of nuance, but also (pseudo)-anonymity, and cross-cultural differences.

Weisler: If measuring images, how do you handle memes? Do images include video?

Blackburn: We have a fair bit of work about memes. Here are some whitepapers describing our work.

Video is an area we are actively moving into, but I don't really have any results to share yet.

Weisler: How do you collect into communities for modeling? Are you able to build new segmentations?

Blackburn: There are a variety of ways that we group communities together. For example, we have used clustering based on the things that different users talk about, as well as the type of memes they post.

Weisler: Donald Trump has been deplatformed from social media. Do you have any insights into this?

Blackburn: I have done work on what happens when communities are de-platformed. E.g., we looked at what happened when Reddit's r/The_Donald and the incels community migrated to their own stand-alone platforms after being banned from Reddit. Here, we found that while there was definitely an effect on the communities, e.g., less membership, less overall content posted, that those that remained in the community became more engaged, and in some cases showed increasing signs of radicalization.

Weisler: How will you use what you have discovered to make the internet a safer, more honest space?

Blackburn: Beyond the obvious goals of helping the general public understand the modern Internet, there are more practical applications. For example, the techniques we are developing could help us identify mis- and dis-information campaigns in something approaching real time by looking for anomalous changes in behavior.

Next story loading loading..