After running the Billy the Kid piece through the ringer a few times, we extracted some promising patterns about simulacra, biblical imagery, lost pilots and the rugged individualism of the american psyche. Now let's use those patterns as the basis for a set of filters on the rest of the data set, i.e. the full text of Daumier, by Donald Barthelme.
By treating the first excerpt as a representative element of the whole space (text) we can focus tightly on behaviors that we suspect might turn up throughout the data.
Wednesday, February 11, 2009
Subscribe to:
Post Comments (Atom)
3 comments:
As a side note, the extraction of these patterns automatically is part of the machine learning task known as generative clustering. We start out with a blank state - assuming nothing about the data set - and from there, using only the properties of the data (in this case text), we figure out what the patterns and reoccurring themes that bind the whole thing together are.
One of the filters I am working on now does the same thing-build an orthogonal basis with no assumption of behavior or shape (e.g. sines and cosines for fourier). Are you guys using singular spectral analysis or something similar?
The method I'm using is known as the Chinese Restaurant Franchise formulation of the Hierarchical Dirichlet Process (long name, I know. I could probably think of something shorter, but I'm busy.) The primary reference can be found here It's all about doing unsupervised clustering of raw data and inferring the correct number of clusters, rather than assuming them a priori.
Post a Comment