| |
|
Tuesday, August 8, 2006
|
|
While text mining 330,000 New York Times articles
poses an interesting challenge, it's not as interesting as sifting
through 70 million words (from over 70,000 unique documents) found in
the Congressional Record. A team of political science researchers has done just that (PDF), and found that their software was able to answer questions too difficult for humans to handle on their own.
The Congressional Record is a unique source of political
information. It contains verbatim transcripts of floor speeches made in
both the House and the Senate and provides a view of political debate
far more nuanced than the one provided by election returns, opinion
polls, and vote counts.
But how to make use of that vast treasure trove of words? The research team notes that the Record has rarely been used as a source for analysis because "it contains too much information
to absorb manually." Even with a large team of grad students at their
disposal, researchers find it difficult to tag more than a small subset
of the speeches in question, and computers have not traditionally been
useful for mining text.
12:50:44 PM
|
|
Tracking the Congressional Attention Span. Turismo writes "Ars Technica covers a new research project that uses computers to look at 70 million words from the Congressional Record. The project's goal was to track what our representatives were talking about at any given time, and researchers were able to do it without human training or intervention. From the article: '...researchers found, for instance, that "judicial nominations" have consumed steadily more Congressional attention between 1997 and 2004. In fact, the topic produced the most number of words published in a single "day" of the Congressional Record: 230,000 on November 12, 2003.' It looks like automated topic analysis has truly arrived."[Slashdot]
Editor: Just remember, not everything in the Congressional Record was actually said by that person. They are allowed to include large amounts of prepared statements in written form in such a way that you can't tell the difference.
12:48:15 PM
|
|
Did AOL Betray Its Users? In what could be a breach of federal law, AOL releases search logs on 650,000 users to researchers. While the files are down, they are in the wild and lawyers may be circling. In 27B Stroke 6. [Wired News: Top Stories]
12:27:56 PM
|
|
Ray Beckerman of Recording Industry vs. The People put together an article that explains how the RIAA's militant enforcement arm
legal team find, obtain records on and sue ISP account holders who may
or may not have ever been users of P2P applications. It's a great
reference, but (no offense intended to Ray) it's dry like a
bread-sandwitch.
I decided to take a stab at rewriting it in something closer to English than lawyer. In hopes that it would be more accessible.
So, with thanks to Ray Beckerman, let's take a look at The RIAA vs. John Doe, in what I hope serves as a layperson's guide to filesharing lawsuits.
11:24:35 AM
|
|
We probably shouldn't hold our breath waiting for the civil
liberties implications of this to dawn on Gordon, but the complexities
and impracticalities of actually doing it will likely come to his
attention sooner. How would the check be set up? Would warrants on the
police national computer be matched by an automatic flagging of the
individual on the NIR? No, because the police don't necessarily want
everybody to know who they're looking for, and the 'automagic' linking
would be a pig to set up, considering the current state of police
systems. What would happen when a fugitive was IDed at POS? Tricky one
this - you can't safely alert the checkout operative, or the
potentially dangerous terrorist currently buying a kumquat. So it has
to be an alert tripped at the NIR level and then a further alert has to
go to the police response centre covering the area, then a patrol
vehicle has to be alerted... Need we go on? By the time it gets to the
response centre you need to have time, location, name and nature of the
suspect, and he'll be long gone.
Aside from the obvious technical issues, there's the problem of
convincing businesses - what's in it for them? Identity fraud, the
Government keeps telling us, is a major concern (but apparently not
major enough to warrant the Government measuring it properly)
and needs to be fought. Banks, credit card companies and major
retailers however aren't automatically going to line up behind 'rock
solid ID' at any cost, and nor will their customers. Yes, ID
fraud is a cost to business and an inconvenience for the victims, but
the costs are bearable, and the more security you have in a system, the
more inconvenient it's likely to become. So there's a pretty strong
argument that businesses think that they've got just about the right
level of security now, and that they can keep losses within boundaries
and absorb them as a cost of business. If an ID check at POS didn't
take any time and was 100 per cent reliable and didn't require new hardware investment and cost virtually nothing, then maybe they'd see it as useful. Otherwise?
In addition to this, businesses aren't likely to want to trust the
accuracy, reliability and security of Government systems. The banks and
credit card companies have run customer databases for years, generally
fairly effectively and with relatively few security breaches. More
recently the supermarkets have got fairly cute at running loyalty
schemes, and while these can be vaguely sinister, they're voluntary,
and there are limits to what the supermarkets can do with them without
triggering massive PR disasters. Government, on the other hand, has
shown itself incapable of getting absentee parents to pay for their
children's upkeep, while Gordon Brown's own department is the one that
gives away money on the Internet after massive ID theft from a Government department.
Really, no sensible business that knows what it's doing as regards
networks and personal data is going to want to play with these people
unless the law forces it to.
11:20:17 AM
|
|
The UK's Total Surveillance. Budenny writes "The Register has a story in its ongoing coverage of the UK ID Card story. This one suggests, with links to a weekend news story, that the Prime Minister in waiting has bought the idea that all electronic transactions in the UK should be linked to a central government/police database. Every cash withdrawal, every credit card purchase, ever loyalty card use ... And that data should flow back from the police database to (eg) a loyalty card use. So, for example, not only would the government know what books you were buying, but the bookstore would also know if you had an outstanding speeding ticket!" [Slashdot: Your Rights Online]
11:14:01 AM
|
|
Will AOL Flap Help Privacy Awareness? Might AOL[base ']s release of the logs of nearly 20 million web searches documenting three months of activity by 650,000 AOL users serve to raise awareness of the privacy concerns with web search surveillance (that I[base ']ve been writing about forever)? Seth Finkelstein hopes so, but also warns that the potential abuse of the released data by hackers and big business might be even worse than what we were concerned about when the DOJ asked for it:
AOL has just given us the world[base ']s biggest real-world experiment as to whether privacy invasion can be done from search-engine data. Previously, when discussing the Google Search subpoena, all people could do was speculate - the data might have this, it could include that, maybe possibly someone could do this from it. Now we have both a huge amount of data, and many interested geeks playing with it and mining it.
I joked we[base ']ll now see a huge distributed reverse-engineering collaborative effort to track down as many anonymous user ID[base ']s as possible. At least, I hope that was joke. Maybe it wasn[base ']t.
Note this data is being far, far, more widely released than the subpoena data, which would have been under confidentiality agreements and protective orders. Worrying about Big Government can be a distraction over far worse Big Corporations.
[michaelzimmer.org]
10:59:24 AM
|
|
|
© Copyright 2006 Paul Hardwick.
Last update: 9/2/06; 4:21:33 AM.
|
|
|