What is tm.plugin.koRpus?

While the koRpus package focusses mostly on analysis steps of individual texts, tm.plugin.koRpus adds several new object classes and respective methods, which can be used to analyse complete text corpora in a single step. These classes are also a first step to combine object classes of both, the koRpus and tm packages.

At the core of this package there is one particular object class – kRp.hierarchy – which can be used to construct hierarchically nested object trees. That is, you are able to categorize corpora on as many levels as you need. The examples in this vignette use two levels, one being different topics the texts in the sample corpus deal with, and the other different sources the texts come from.

If you don’t need these hierarchical levels, you can just use the function readCorpus() to create a corpus object. It simply represents a collection of texts. To distinguish texts which came from different sources or deal with different topics, use the hierarchy argument, which will generate sub-corpora for each source and topic given. These objects will only be valid if there are texts of each topic from each source.

Now, if this still confuses you, let’s look at a small example.

Tokenizing corpora

As with koRpus, the first step for text analysis is tokenizing and possibly POS tagging. This step is performed by the readCorpus() function mentioned above. The package includes four sample texts taken from Wikipedia1 in its tests directory which we can use for a demonstration:

library(tm.plugin.koRpus)
library(koRpus.lang.de)
# set the root path to the sample files
sampleRoot <- file.path(path.package("tm.plugin.koRpus"), "tests", "testthat", "samples")
# the next call uses "hierarchy" to describe the directory structure
# and its meaning; see below
sampleTexts <- readCorpus(
  dir=sampleRoot,
  hierarchy=list(
    Topic=c(
      C3S="C3S SCE",
      GEMA="GEMA e.V."
    ),
    Source=c(
      Wikipedia_alt="Wikipedia (alt)",
      Wikipedia_neu="Wikipedia (neu)"
    )
  ),
  tagger="tokenize",
  lang="de"
)
processing corpus 
  Topic "C3S SCE"
    Source "Wikipedia (alt)", 1 text...
    Source "Wikipedia (neu)", 1 text...
  Topic "GEMA e.V."
    Source "Wikipedia (alt)", 1 text...
    Source "Wikipedia (neu)", 1 text...

The hierarchy argument

The hierarchy argument describes our corpus in a very condensed format. It is a named list of named character vectors, where each list entry represents a hierarchical level. In this case, the top level is called “Topics”, below that is the level “Source”. These hierachical levels must also be represented by the directory structure of the texts to parse, and the names of the character vectors must be identical to the directory names below the root directory specified by dir.2

So on your file system, what the hierarchy argument above describes is the following layout:

.../samples/
  C3S/
    Wikipedia_alt/
      Text01.txt
      Text02.txt
      ...
    Wikipedia_neu/
      Text03.txt
      Text04.txt
      ...
  GEMA/
    Wikipedia_alt/
      Text05.txt
      Text06.txt
      ...
    Wikipedia_neu/
      Text07.txt
      Text08.txt
      ...

Since we’re using the koRpus package for all actual analysis, you can also setup your environment with set.kRp.env() and POS-tag all texts with TreeTagger3.

Analysing corpora

After the initial tokenizing, we can analyse the corpus by calling the provided methods, for instance lexical diversity:

sampleTexts <- lex.div(sampleTexts)
corpusSummary(sampleTexts)
                    corpus     Topic          Source              doc_id
C3SWikipedia_alt01           C3S SCE Wikipedia (alt)  C3SWikipedia_alt01
C3SWikipedia_neu01           C3S SCE Wikipedia (neu)  C3SWikipedia_neu01
GEMAWikipedia_alt01        GEMA e.V. Wikipedia (alt) GEMAWikipedia_alt01
GEMAWikipedia_neu01        GEMA e.V. Wikipedia (neu) GEMAWikipedia_neu01
                    stopwords    a    C CTTR   HDD     K lgV0 MATTR MSTTR
C3SWikipedia_alt01         NA 0.16 0.95 6.13 38.14 49.92 6.21  0.81  0.79
C3SWikipedia_neu01         NA 0.17 0.94 6.82 38.05 54.88 6.10  0.82  0.76
GEMAWikipedia_alt01        NA 0.17 0.94 7.07 37.61 65.08 6.11  0.80  0.78
GEMAWikipedia_neu01        NA 0.16 0.94 7.13 37.87 60.14 6.24  0.81  0.79
                      MTLD MTLDMA     R    S  TTR     U
C3SWikipedia_alt01  100.16     NA  8.68 0.93 0.78 39.92
C3SWikipedia_neu01  123.01     NA  9.65 0.92 0.73 36.46
GEMAWikipedia_alt01 106.94    192 10.00 0.92 0.71 35.96
GEMAWikipedia_neu01 111.64     NA 10.08 0.92 0.73 37.47

As you can see, corpusSummary() returns a data.frame object with the summarised results of all texts.

There are quite a number of corpus*() getter/setter methods for slots of these objects, e.g., corpusReadability() to get the readability() results from objects of class kRp.hierarchy.

As we explained earlier, the nested S4 object classes used by tm.plugin.koRpus are rather complex. Two methods can be especially helpful for further analysis. The first one is tif_as_tokens_df() and returns a data.frame including all texts of the tokenized corpus in a format that is compatible with Text Interchange Formats standards.

The second one is a family of [, [<-, [[ and [[<- shorcuts to directly interact with the data.frame object you get via corpusSummary(). Here’s an example how to use this to plot interactions:

library(sciplot)
lineplot.CI(
  x.factor=sampleTexts[["Source"]],
  response=sampleTexts[["MTLD"]],
  group=sampleTexts[["Topic"]],
  type="l",
  main="MTLD",
  xlab="Media source",
  ylab="Lexical diversity score",
  col=c("grey", "black"),
  lwd=2
)