A pre-processor to convert the Reuters-21578 dataset to TSV from SGM format according to the ApteMod test/train splits. This method returns the documents that belong to at least one of the categories that have at least one document in both the training and the test sets. The dataset has 90 categories with a training set of 7769 documents and a test set of 3019 documents.
Contains code snippets from ankailou/reuters-preprocessing.