What is it going to reveal?
So what’s the consensus on Ederer’s paper?
-
I'm sorry. Not PGS. Paul Goldsmith-Pinkham. Who got caught plagiarizing off of another already published work. That Paul Goldsmith-Pinkham who stole a paper and then used his position to silence others, and was only called out here on EJMR. He is the guy who wants to geolocate EJMR users.
-
At a high level:
You can compute various metrics about any message posted on EJMR. Anything you can think of: average word length, average sentence length, paragraph structure, vocabulary, etc. And some metrics more complicated than that.
You can then manually tag individual messages. For example, you can assume that most of the people in the German thread are Germans and/or working in German places. You can then compute the average difference, in that set of metrics, between the patterns you found with messages tagged "German" as opposed to the entire set of messages. Bam, you now have a way to predict, in any other thread, if the author is "German" or at least "German-like".
You can do that across other threads. You can assume that any thread about e.g. Harvard will have more Harvard people than average, and so on.
Given any message, you can now extract probabilities that the author is:
- German, or Australia, or French, or British, etc. (or at such institutions)
- from MIT, Yale, Harvard, whateverYou can now build archetypes. It's pretty clear that someone doing theory at, say, MIT versus someone doing micro at a French b-school will have different ways of writing. Those are your archetypes.
-
So basically it tells you: "from a probabilistic point of view, this message is strongly associated with "native French speaker" and weakly associated with "theory", "microeconomics" and "MIT".
This is how you can, *in theory*, de-anonymize, because you can narrow than down quite a bit.
That's all there is to it.
-
And that's all you can do about what is publicly available.
Any other method involving the hash ID or cookies or whatever would be such an advance in security research that:
1. They would have a CS co-author
2. They would have published in USENIX or similarIsn’t the third guy a CS Ph.D./entrepreneur?
https://som.yale.edu/faculty-research/faculty-directory/kyle-jensen -
And that's all you can do about what is publicly available.
Any other method involving the hash ID or cookies or whatever would be such an advance in security research that:
1. They would have a CS co-author
2. They would have published in USENIX or similarIsn’t the third guy a CS Ph.D./entrepreneur?
https://som.yale.edu/faculty-research/faculty-directory/kyle-jensenSeems that way. No one paid attention to him because everyone was focusing on FE and PGP and what they can and can't do. Seems like this guy can do a lot of techy stuff.
-
Yes, that other guy actually has a CS background, so I stand corrected.
But it looks like his expertise is actually in machine learning and language processing, which confirms my intuition above.
"Natural language processing" is one of the keywords of their paper. They literally did what I described above.
-
In principle, you could extent that method outside of what is posted on EJMR.
You can build a similar profile from academic papers, and then rank a given EJMR message by the probability it has been written by a given author (due to style similarity).
None of that is new, you can search for the keyword "de-anonymization" and see that people have been using those techniques for many years to try and de-anonymize darknets, social media, etc. without varying success.
-
They will come up with shocking allegations like:
There is a 90% chance that there is a poster or posters somewhere in New York who on average is slightly aggressive to DEI policies…
Or…..
There is a non 0 percent that someone in the sacred zip code may have said something hostile towards people who voted for the former President….
We can not be 100% sure about anything but our cutesy approach with cookies, hashtags, and IP code Mumbo jumbo allows us to make such astute and critical assertions.
-
How would this work for “toxic” posts, which is what they claim to focus on? Seem not much information conveyed in posts like: “[certain ethnic group] are smelly and need to stop invading the West”
So basically it tells you: "from a probabilistic point of view, this message is strongly associated with "native French speaker" and weakly associated with "theory", "microeconomics" and "MIT".
This is how you can, *in theory*, de-anonymize, because you can narrow than down quite a bit.
That's all there is to it.