Data Engineering project.
We have a collection of N documents. In our case, we have 1 file per document and the file name is simply the document's index number, as shown in figure 1.
We want a dictionary that performs a ** “match” ** of each word contained in the documents with a unique identifier ** “word_id” **, as shown in figure 2.
Using both the data set and the dictionary, we were able to build a reverse index that gives, for each word, the list of documents in which the word is contained, see figure 3.2.
We want a solution that works for a large data set.
These are the 4 steps of the algorithm:
- Read the documents and retrieve each pair (** wordId **, ** documentId **)
- Sort each pair by ** worldId ** and ** documentId **
- For each ** worldId **, group the pairs, to have the list of documents containing the word
- Merge the intermediate results to have the final reverse index
See figure 4, steps of the algorithm
** Attention **: the index must be ordered by the word identifiers and for each word identifier, order its respective list of document identifiers.
You will find the document dataset in the * dataset * directory of the repository on GitHub.
- Implement a * Job * to build the dictionary. Each line of the output file must be the word followed by its identifier (** world_id **). In documents stored in the * dataset * directory, words are separated by one or more spaces.
- Implement one or more * Jobs * to build the reverse index
You must use either ** PySpark or Pure Python ** for these * Jobs *