Coder Social home page Coder Social logo

augustwester / searchthearxiv Goto Github PK

View Code? Open in Web Editor NEW
95.0 2.0 8.0 140 KB

The code powering searchthearxiv.com, a simple semantic search engine for more than 300,000 ML papers on arXiv.

Home Page: https://searchthearxiv.com

License: GNU General Public License v3.0

Dockerfile 1.41% Python 36.41% JavaScript 20.93% CSS 24.75% HTML 14.75% Shell 1.75%
arxiv machine-learning semantic-search web-app word-embeddings deep-learning scientific-papers search-engine

searchthearxiv's Introduction

searchthearXiv

This repo contains the implementation of searchthearxiv.com, a simple semantic search engine for more than 300,000 ML papers on arXiv (and counting). The code is separated into two parts, app and data. app contains the implementation of both the frontend and backend of the web app, while data is responsible for updating the database at regular intervals using OpenAI and Pinecone. Both app and data contain a Dockerfile for easy deployment to cloud platforms. I don't expect (or encourage) anyone to run a clone of the project on their own (that would be weird), but it might serve as inspiration for people building a similar type of semantic search engine.

In order to run the code, you need to supply the following list of environment variables:

KAGGLE_USERNAME=your_kaggle_username
KAGGLE_API_KEY=your_kaggle_api_key
OPENAI_API_KEY=your_openai_api_key
PINECONE_API_KEY=your_pinecone_api_key
PINECONE_INDEX_NAME=your_pinecone_index_name

The Kaggle username and API key are required to fetch the arXiv metadataset, maintained (and updated weekly) by Cornell University. The OpenAI API key is used to embed new papers using the text-embedding-ada-002 model. The Pinecone API key and index name are used to connect to the index (i.e. vector database) hosted on Pinecone.

If you are only interested in the embeddings, I have published the full dataset on Kaggle. The current size is around 10GB but grows slightly every week as new papers are added.

If, for some reason, you still want to embed the papers on your own, you can run embed.py in data after downloading the metadataset from Kaggle, setting the environment variables, and creating a Pinecone index. If you don't want to use Pinecone, you are free to modify the code however you want. Since the index will initially be empty, the script will embed all ML papers (again, more than 250,000). However, before doing so, it will estimate a price using OpenAI's tiktoken tokenizer and ask you to confirm. You can skip this step by running python3 embed.py --no-confirmation.

If you like searchthearxiv.com and would like to see something improved, feel free to submit a pull request ๐Ÿค—

searchthearxiv's People

Contributors

augustwester avatar dependabot[bot] avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

searchthearxiv's Issues

Option for SPECTER2 embeddings?

Hi,

I recently started to use Semantic Scholar's SPECTER2 model to create visualisations of BibTex files. The model is specialised to scientific work, so it seems a good option for the proximity search that searchthearxiv.com offers. While there are gaps in what papers have an associated embedding in their database, its scope extends beyond the ArXiv.

I was thinking to create a service "What was that paper again?" that would

  • Take in a description of the user with length of one to ten sentences
  • Embed this description using SPECTER2
  • Do a proximity search
  • Return matching candidates

In summary, the advantages would be

  • broader scope beyond ArXiv
  • potentially longer queries
  • a more accurate backend model

It would be exciting to see this functionality integrated into SearchTheArXiv, I would be very willing to do a prototype!

Server down?

The https://searchthearxiv.com/search?query= endpoint seems to be currently returning 500 Internal Server Error.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.