ABSTRACT
Recent events have made it clear that some kinds of technical
texts, generated by machine and essentially meaningless,
can be confused with authentic, technical texts written by
humans. We identify this as a potential problem, since no
existing systems for, say the web, can or do discriminate on
this basis. We believe that there are subtle, short- and longrange
word or even string co-occurrences extant in human
texts, but not in many classes of computer generated texts,
that can be used to discriminate based on meaning. In
this paper we employ the universal lossless source coding
algorithms to generate features in a high-dimensional space
and then apply support vector machines to discriminate
between the classes of authentic and inauthentic texts.
Compression profiles for the two kinds of text are distinct—
the authentic texts being bounded by various classes of more
compressible or less compressible texts that are computer
generated. This in turn led to the high prediction accuracy
of our models which support our conjecture that there exists
a relationship between meaning and compressibility. Our
results show that the learning algorithm based upon the
compression profile outperformed standard term-frequency
text categorization schemes on several non-trivial classes of
inauthentic texts.