This is a repository for DigHum 101, a course that I took in Summer 2020. This course equips students with Data Science tools and knowledge to solve humanities centered problems in a data-driven manner, and also forces them to think about what implications their data-driven solutions have on society.
Exploring the biases and assumptions influencing big data in the Digital Humanities : Group Project
The whole purpose of this group project was to explore the various biases and assumptions in the realm of data. As the world grows technologically, socially and economically every year, there is a plethora of data that is being spewed out every moment. And it is up to us humans to make sense of that data through various tools and techniques and decipher what story it is trying to tell us, or what problems it is allowing us to pinpoint and solve. However, we are unfortunately prone to confirmation bias, which is the tendency to process and analyze information in such a way that It supports one’s pre-existing ideas and convictions. And unfortunately this phenomena plays out very often in our conclusions and solutions based off the data we analyze! So my group and I wanted to explore the issue by making an effort to answer the research questions “To what extent are human bias and assumptions observed within the field of big data and data analytics?”, “What are the short term and long term consequences of algorithmic bias and data misrepresentation?” and “In what ways can researchers prevent the incorporation of assumptions or bias in order to make reliable conclusions from data?”. We used readings like Sculley’s “Meaning and Mining: the Impact of Implicit Assumptions in Data Mining for the Humanities”, Owen’s, “Defining Data for Humanists: Text, Artifact, Information, or Evidence” and Boyd’s , “Critical Questions for Big Data” as reference and went on to connect our conclusions to Nan Z. Da’s, “The Digital Humanities Debacle”. Through our research , we explore how to employ a critical lens while analyzing data, how to extensively explore the data we’re working with, how to consider the context of our data and how it is important to go with the notion that just because our results are interpretable doesn’t mean that it is accurate.
Overall, doing this project in a group and discussing about the problem with my teammates helped me gain clarity on how important it is to approach and accept data analysis results with a skeptical mindset and not let personal judgement or bias taint our paradigm while coming to conclusions about the data.
ML techniques for classifying Phishing websites : Individual Project
The premise of this individual project is to explore efficient machine learning driven algorithms in order to detect phishing websites. I thought it was crucial for me to explore this problem, because the COVID-19 era has turned all of us to use the internet more than ever. And unfortunately, as a result, bad actors on the internet are resorting to scam people of their personal information through using phishing websites that look legitimate! Thus, it is up to cybersecurity professionals to employ sophisticated tools and methods to protect their confidentiality and their perceived integrity of the internet! So, in this project I test the efficiency of using individual machine learning algorithms like logistic regression, decision trees and random forest against the “super learner model” which is an ensemble machine learning algorithm that combines all the models and model configurations that you might investigate for a predictive modeling problem. I also decide to use the ROC-AUC metric as a measure of accuracy as opposed to the accuracy percentage itself. Through these techniques I hope to explore methods of finding a convenient ML based solution to detecting phishing websites and hence contribute to the bigger pool of ML research in the domain of phishing.