Wednesday, February 11, 2009

The rest of the text.

After running the Billy the Kid piece through the ringer a few times, we extracted some promising patterns about simulacra, biblical imagery, lost pilots and the rugged individualism of the american psyche. Now let's use those patterns as the basis for a set of filters on the rest of the data set, i.e. the full text of Daumier, by Donald Barthelme.

By treating the first excerpt as a representative element of the whole space (text) we can focus tightly on behaviors that we suspect might turn up throughout the data.

3 comments:

Matthew said...

As a side note, the extraction of these patterns automatically is part of the machine learning task known as generative clustering. We start out with a blank state - assuming nothing about the data set - and from there, using only the properties of the data (in this case text), we figure out what the patterns and reoccurring themes that bind the whole thing together are.

Josh said...

One of the filters I am working on now does the same thing-build an orthogonal basis with no assumption of behavior or shape (e.g. sines and cosines for fourier). Are you guys using singular spectral analysis or something similar?

Matthew said...

The method I'm using is known as the Chinese Restaurant Franchise formulation of the Hierarchical Dirichlet Process (long name, I know. I could probably think of something shorter, but I'm busy.) The primary reference can be found here It's all about doing unsupervised clustering of raw data and inferring the correct number of clusters, rather than assuming them a priori.