The program will extract keywords from an article, using techniques from information retrieval and computational linguistics. First, the terms frequencies are found and then the top-ranking words are selected as the keywords. Next, the inverse document frequencies are incorporated to penalize common words. Finally, the TF-IDF score is calculated in order to extract the keywords from the article.
Data Set: The data set consists of 40 arbitrarily selected articles extracted from the English Wikipedia. The articles are in 40 different text files, numbered \1.txt", \2.txt", ... \40.txt".