This is the source code for the first assigment of University of Southern Maines Text Mining course.
First step is to clone the repository
git clone https://github.com/jgore077/COS-470-Assignment-1
To run the program firstly you will need to install the requirements
pip install -r requirements.txt
After the requirements have been installed run the scraping.py
python scraping.py
Then after all the data has been collected run analysis.py
python analysis.py
Also if you want to see what the wordpiece tokenizer does to the supplied senetence run this command
python wordpiece.py
After doing this you will have 2 html files which are the scripts to the movies, 2 text files which contain Michael Corleone's dialgoue, 2 text files containing the words he has spoken, and 2 wordclouds for each movie.