On the “Anonymity” of the Facebook Dataset - Via michaelzimmer.org :
A group of researchers have released a dataset of Facebook profile information from a group of college students for research purposes, which I know a lot of people will find quite valuable. (Thanks to Fred Stutzman for bringing it to my attention.)
Here is the description from the Berkman Center’s announcement:
The dataset comprises machine-readable files of virtually all the information posted on approximately 1,700 FB profiles by an entire cohort of students at an anonymous, northeastern American university. Profiles were sampled at one-year intervals, beginning in 2006. This first wave covers first-year profiles, and three additional waves of data will be added over time, one for each year of the cohort’s college career.
Though friendships outside the cohort are not part of the data, this snapshot of an entire class over its four years in college, including supplementary information about where students lived on campus, makes it possible to pose diverse questions about the relationships between social networks, online and offline.
Access to the dataset requires the submission of a research statement (which I haven’t yet done), but the codeebook is publicly-available.
Of course, this sounds like an AOL-search-data-release-style privacy disaster waiting to happen. Recognizing this, the researchers detail some of the steps they’ve taken to try to protect the privacy of the subjects, including:
[...]
First, as the AOL debacle taught us, one might think “all identifying information” has been deleted, but often random bits of our data trail that alone seem anonymous can be pieced together, possibly exposing clues to our identity. The fact that the dataset includes each subjects’ gender, race, ethnicity, hometown state, and major makes it increasingly possibility that individuals could be identified. For example, if the data reveals that student #746 is a white Bulgarian male from Montana, majoring in East Asian Studies, there probably aren’t that many who fit such a description. Unlikely, but not bullet-proof.
[...]
But one more thing…
Since I first saw the press release for this dataset, I’ve been bothered by the description of the date as “approximately 1,700 FB profiles by an entire cohort of students at an anonymous, northeastern American university.”
Right off the bat, the source university loses full anonymity since it is identified as being in the northeastern US. Further, according to the codebook, this is a private, co-ed institution, whose class of 2009 initially had 1640 students in it.
A quick search for schools reveals there are only 7 private, co-ed colleges in New England states (CT , ME , MA , NH , RI , VT ) with total undergraduate populations between 5000 and 7500 students (a likely range if there were 1640 in the 2006 freshman class): Tufts University, Suffolk University, Yale University, University of Hartford, Quinnipiac University, Brown University, and Harvard College. (The total bumps up to about 18 if we include NY and NJ)
Is one of these the source?
(Read Original Article - Via michaelzimmer.org .)