Friday, August 4, 2006
Topic Mining: Digging Value from News Archives
Sorting unstructured information by topic is harder than you might think. People do it well, but so far machines have had trouble with this task.
|
Univ. CA-Irvine
Dr. Padhraic Smyth, one of UC-Irvine's topic-mining researchers. |
According to the popular tech blog
Ars Technica, researchers at the Univ. of Calif., Irvine recently used software to sort 330,000 archived New York Times from 2000 to 2002.
This software was designed "to find patterns of words which occurred together. ...Once these word patterns were indexed, the software then turned them into topics and was able to construct a map of such topics over time. The team's example is a set of words that tended to appear in the same article: rider, bike, race, and Lance Armstrong. The topic for this story would obviously be the Tour de France, and the software could use its word patterns to chart how often the bike race was discussed in the newspaper."
According to a Univ. Calif. press release, "UCI researchers didn't invent topic modeling, but they developed a technique that allows the technology to be used on huge document collections. They also are among the first to demonstrate its ease and effectiveness by applying it to a newspaper archive."
What's not clear is how accurate this automated sorting was compared to human-compiled topic maps of the same content. Still, even if this technique is less accurate than what librarians can do, it could be a time-saving starting point for indexing large document collections. Like Podzinger and Podscope (mentioned earlier).
Now I wonder if we could tweak this software to comb through the Federal Register, Congressional Record, or the Thomas legislative database to quickly locate all the buried riders and clauses on particular issues, regardless of how cryptically they're phrased... Well, I can dream...
(UPDATE AUG. 7: It turns out that another team of researchers is trying to mine the Congressional Record.)
E-mail this item |
Add/View Feedback (1) |
QuickLink this item: A105707
E-Media Tidbits Archive
MAIN
|
Back to Top