rottingresearch / rottingresearch Goto Github PK

View Code? Open in Web Editor NEW

14.0 4.0 19.0 76.34 MB

A project devoted to helping academics and researchers provide robust citations and mitigate link rot.

Home Page: https://rottingresearch.org

License: GNU General Public License v3.0

Python 8.95% CSS 16.31% HTML 67.56% Procfile 0.06% JavaScript 6.78% Dockerfile 0.34%

hacktoberfest flask-application python linkrot flask academia content drift link research

rottingresearch's Introduction

Introduction

A project devoted to helping academics and researchers provide robust citations and mitigate link rot. Visit rottingresearch.org to see it in action.

Mission

Link rot is an established phenomenon that affects everyone who uses the internet. Researchers looking at individual subjects have recently addressed the extent of link rot’s influence on scholarly publications. One recent study found that 36% of all links in research articles were broken. 37% of DOIs, once seen as a tool to prevent link rot, were broken (Miller, 2022).

Rotting Research allows academics and researchers to upload their work and check the reliability of their citations. It extracts all of the links from the document and then checks to see if the link is accessible to the public.

Check out our website at rottingresearch.org.

The status of our services can be observed at status.rottingresearch.org/status/rr.

Installation

Requirements

Python3 (3.10+)
Pip3
Redis

Docker Instructions

Local Development

Set the APP_SECRET_KEY="RANDOM_SECRET_KEY"
Run the docker container using docker-compose up --build. You can use the -d flag to run the containers in 'detached' mode.
Open 127.0.0.1:8080 in your browser.

As docker volume is used, any changes made are reflected immediately. To view the container logs you can use docker logs -f rottingresearch. The -f flag is used for following the logs.

Building Image

Build the docker image docker build --tag rottingresearch .
Run image docker run -d -p 8080:8080 rottingresearch

Linux/Mac

Clone Repository: git clone https://github.com/rottingresearch/rottingresearch
Change directory to rottingresearch - cd rottingresearch
Run source setup.sh - the script will automatically install the packages and setup the environment variables

Windows

Clone Repository: git clone https://github.com/rottingresearch/rottingresearch
Change directory to rottingresearch - cd rottingresearch
Install Python Packages: pip3 install -r requirements.txt
Edit app.py and set app.config['UPLOAD_FOLDER'] to a valid temporary folder.
Set APP_SECRET_KEY environment variable - setx APP_SECRET_KEY "random"
Set ENV running environment variable setx ENV "DEV"
Run redis redis-server
Set REDIS_URL environment setx REDIS_URL "redis://localhost:6379"
Run app python3 app.py
Run Celery worker celery -A app:celery_app worker -B
Open 127.0.0.1:8080 on your browser.

Code of Conduct

For our code of conduct, please visit our Code of Conduct page.

License

This program is licensed with a GPLv3 License.

rottingresearch's People

Contributors

Stargazers

Watchers

Forkers

aditirao7 anmolag10 joaovictor3g jayeclark ananth-p m-faheem-khan mailtodanish detoxmango furawu vintello vladimirsosnitskiy rajdeep1311 greenyng gpuligundla timcrob anshikjain18 blncmusa c0d33ngr

rottingresearch's Issues

Streamline Celery Process to Easily allow for multiple workers.

Create Docker Image

Creating a Docker Image would allow us to test and deploy this app much faster. I'd prefer to use Docker over other containerization solutions, but am open to ideas.

Flask sessions and cookies for multiple users

Enable proper usage for multiple users using flask sessions and cookies.
Currently, 2 different users analysing the same pdf and downloading references raises an error because files already exist in the downloads folder.

Setup BrowserStack Testing

I will set up a GitHub action to deploy the testing app in browserstack.

Add Link Archiving

I'd like to add a feature that takes all links that are verified to be active and add them to the Internet Archive Wayback Machine to preserve them in time. Ideally, this would be added to the parent project, Linkrot.

The basic concept is that if you navigate to https://web.archive.org/save/{url} the service automatically archives that page. So after verifying that it returns a valid code, we would just connect to all of those sites, and it would create a snapshot. I'd love for this to be an option on the results page. So after all the links are checked, you have the option to archive the valid ones only. This way it is optional, and we don't take more resources than we need.

Another option would be to use ArchiveNow. This repository isn't updated regularly, so I'd prefer using Linkrot, but if this is the easiest way to achieve the desired outcome, I am OK with that.

Anyone able to complete this task, please take a stab at it.

Add Redis

A lot of new features are being added, like link archiving and multiple file upload. It has become apparent that we will need to scale and Redis seems like the logical next step. Open to other suggestions.

Fix code scanning alert - Flask app is run in debug mode

Tracking issue for:

https://github.com/marshalmiller/rottingresearch/security/code-scanning/1

Revise DOI and ARXIV sanitation URLS

Right now they are pulling URLS with doi.org or arxiv.org. The concern would be that it would have a false negative if used in Internet Archive URLs that contain the URL that is archived. I will Address ASAP.

Project Banner

I need a Project Banner so that it shows up when you share this repo. Will eventually use it on pip and App as well.

Deployment Issue

I'm trying to setup the continuous integration for the permanent home for the app but I seem to be having issues with the secret. I'm sure it's me not having a full understanding of how it works in Flask. I'm trying to use the GitHub secrets function. I made one called Heroku_Secret and put it in the code but the app still appears to not recognize it has a value. @aditirao7 Any ideas? You've deployed it on your test site. What am I missing?

Create Rotting Research Branding

I am looking for some sort of branding for the app so that it is attractive to users.

AttributeError: 'NoneType' object has no attribute 'findall'

Sometimes the App crashes with this error. It is not common. It appears to be triggered by the PDF itself.

Add comments to all files

Make Download Report a PDF and Visually Appealing

Right now, the download report saves an HTML version of the page with unformatted text and looks pretty bad. It would be great to be able to have this report look as good as the site does.

Few links have incorrect status

Describe the bug
A few of the links show status N/A although their redirect URLs lead to 200 status pages.

To Reproduce
Steps to reproduce the behavior:

Go to rottingresearch.org
Upload 2206.00785.pdf
Number of arxiv and DOI references shows 0

Expected behavior
Should show 200 status for the last 2 arxiv links.

Screenshots
If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information):

OS: Linux (Ubuntu 20.04)
Browser: Chrome
Version: 117.0.5938.92

Refactor Code & Adopt Standard Code Linter

The HTML code stinks mainly the upload.html because there is no code consistency and the Javascript is all over the place. I believe refactoring it would improve readability and code hygiene.

Solution: Standard Code Lint configuration (Prettier)

Inconsistent Indentation & Closing of Tags
https://github.com/marshalmiller/rottingresearch/blob/cf3181de677964a656c41debaf366599f2a7c137/templates/upload.html#L86-L126

Javascript Optimzation - JS code can be moved to the bottom(order or priority)
https://github.com/marshalmiller/rottingresearch/blob/cf3181de677964a656c41debaf366599f2a7c137/templates/upload.html#L29-L50

Security Policy and Reporting

Update the Security.MD file to include a full policy.

Add Option to Download Results

Maybe instead of having them download the file itself, maybe they could download the report instead?

Create App Logo

I am looking for a distinct logo to use for this app. Something that shows the theme of the app and is attractive.

Create Functional Tests

Create tests to test the overall functionality of the app.

Incorrectly parsing mailto URI as http://mailto:[email protected]

Describe the bug
While testing the application I noticed that the application is parsing mailto URIs incorrectly. The link is supposed to be mailto://[email protected] but is being shown as http://mailto:[email protected].

PDF: Machine learning and the physical sciences

The issue exists because the sort_ref(ref) function in the app.py appends all references of type URL to the URLs array a simple check to see if the URI is a mailto URI can fix this.

Code: app.py/sort_refs

To Reproduce
Steps to reproduce the behavior:

Start the Application
Make sure the APP_SECRET_KEY environment variable is set
Upload the Machine learning and the physical sciences PDF
See URL References Section on the right

Expected behavior
Either the mailto URI should not be shown at all OR be moved under Linkrot Summary

Screenshots
Screenshot of the mailto URI being parsed as HTTP URI

Possibly move mailing references to its own section under the Linkrot Summary section

Desktop (please complete the following information):

OS: Ubuntu 20.04 LTS x64
Browser: chromium
Version: 101.0.4951.64 (Official Build) snap (64-bit)

Create Database to store results

I think it would be extremely helpful if we were to store the results from each analysis in a database. I do not want to store user data. But things like the date written (pulled from metadata), and data submitted to us. Overall links, DOI links, Error Codes, and such. This would allow us to keep long-term stats for further research. Even how the same documents change over time. We could then create a dashboard to share those results.

Update ReadME File

The ReadMe file is still from the linkrot project.

Getting 500 errors on Test Site

Hey @aditirao7 I'm getting these errors when trying to test out the test site. Both when I use a URL or upload a PDF. Is there something I am doing wrong?

Update CSS to add a few pixels of space between buttons

On the analysis page, the 2 download buttons have no space between them which can be noticed on hovering over them. I think adding some space in between would make the page look much better. This could be a good first issue to pick up.

Number of arxiv and DOI references not updating

Describe the bug
The number of arxiv and DOI issue do not seem to reflect on the top of the report.

To Reproduce
Steps to reproduce the behavior:

Go to rottingresearch.org
Upload 2206.00785.pdf
Number of arxiv and DOI references shows 0

Expected behavior
Should show 3 and 2 respectively.

Screenshots

Desktop (please complete the following information):

OS: Linux (Ubuntu 20.04)
Browser: Chrome
Version: 117.0.5938.92

Upload Multiple Files at Once

A great feature would be if we could upload more than one file at a time. I know this is a major step, but I believe that this would be of great use to many.

Timeout Issues

Sometimes, with larger files, the process can time out and result in a 500 server error.

Remove Paste a Link Feature

We should remove the ability to paste a link and limit it to just uploading a file. The reason is because of copyright issues. Publishers tend to be highly litigious. With the link feature, the website downloads a copy and analyzes it, whereas when someone uploads a file, they are bound to the user agreement and release the site of any responsibility. I know this is pretty cynical, but I think it's an unfortunate reality. If someone wants to run it on remote files, they can use the Python app, Linkrot.

Redesign Analysis Page

Redesign the Analysis report page.

Make better-looking javascript alerts

Alerts are generated when the user submits without uploading a file/url or when the file/url is not a pdf.

The alerts are generated using the standard window.alert() and don't look that great. Improve alert design.

Fix code scanning alert - Incomplete URL substring sanitization

Tracking issue for:

https://github.com/marshalmiller/rottingresearch/security/code-scanning/4

Add loading animation when download button is pressed

Create a loading animation while files are being generated for download.

Agreement Checkbox

We should add some sort of agreement checkbox. We can add like a tool tip about how the uploader agrees that they have the rights to upload the document or whatever. Just to cover certain liabilities.

Add Summary to Analysis

Can we add the summary to the report that it generates? The original script generates something like this:

Fix code scanning alert - Incomplete URL substring sanitization

Tracking issue for:

https://github.com/marshalmiller/rottingresearch/security/code-scanning/5

Add Checkback and Internet Archive Funcitonality

The linkrot project (https://github.com/rottingresearch/linkrot), which is the foundation of this project, has added the ability to add valid URLs to the Internet Archive using the -a tag. I would like to add this ability to this project by adding a check box of some sort to enable the archiving of valid links found in the PDFs uploaded.

Add DOI and ArXiv articles to results summary.

It would be amazing to add number of DOI links and ArXiv references to the link summary page.

Merge Celery Container with Main Container

Right now, the app can spin up three containers—one for the app, one for the celery workers, and one for Redis. Since you can always manually add Redis with the environments and official image, it seems strange to maintain one ourselves. So now we have two. We can't do a docker pull because it will never have the celery image. If possible, I would like to find a way to incorporate the celery worker into the main image. I see many advantages to this, which I can expand on if you'd like to.

(Update) CodeQL Github action to run only when change in Python

I think we should update the CodeQL/Analyze (python) to run only when there has been changes made to python code. As now whenever changes to any none .py code is made the CodeQL for python is still run.

We can also remove the Autobuild section from the action as there is no code compilation being done.
https://github.com/marshalmiller/rottingresearch/blob/cf3181de677964a656c41debaf366599f2a7c137/.github/workflows/codeql-analysis.yml#L56-L57

Incomplete Analysis

Describe the bug
The status codes for many of the links do not get updated on the analysis page. Only noticed this bug on my mobile, desktop seems fine. It looks like the link checking is aborted halfway for some reason.

To Reproduce
Steps to reproduce the behavior:

Go to https://rottingresearch.org/
Check link https://arxiv.org/pdf/2206.00785.pdf
Many links are N/A status code even if they work

Expected behavior
All links should have a valid status code.

Screenshots

Smartphone (please complete the following information):

Device: Pixel 3a
OS: Android
Browser: Chrome
Version: 102.0.5005.125

Unecessary scroll on landing page (desktop view)

The landing page is overflowing and causing unnecessary horizontal and vertical scroll on desktop view. Screenshot attached below.

Add setup instructions for contributors

Maybe contributing guidelines can be added with setup instructions on how to run the website on localhost for those not familiar with flask.

Add loading page while report is being generated

Create a loading page while the report is being generated instead of staying on the landing page.

Resources to get started:
Flask loading page

Create Electron App

I was looking at several Electron Wrappers and thought it would be cool to generate an Electron app for this project. It would work the same as the web version but would provide packages for Linux, Windows, and Mac.

I thought we could implement something like this project: https://github.com/samuelmeuli/action-electron-builder

Or if we could convert this project to a GitHub action for deployment: https://github.com/nativefier/nativefier

Create Simulated Tests with Playwright

In order to make sure every build works correctly and connects with both the celery and Redis servers, we need to create a test that spins up a live server and tests the functions using simulation. In my research, it appears that Playwright would be the best solution. It works with Pytest and Flask to accomplish this. Link below:

https://playwright.dev/python/