Coder Social home page Coder Social logo

fredriko / metacurate-regularly Goto Github PK

View Code? Open in Web Editor NEW
0.0 1.0 0.0 11.04 MB

Finding the top news stories of 2022 among 54,000+ news on AI, ML, NLP, data science and related fields.

Python 0.13% HTML 99.87%
clustering machine-learning ml natural-language-processing nlp sentence-embeddings visualization data-science hdbscan sentence-transformers

metacurate-regularly's Introduction

metacurate-regularly: clustering of news headlines.

TL;DR: This repository contains an experiment for embedding and clustering news headlines, as well as for describing the resulting clusters, and plotting them on a timeline.

The screenshot below shows the output of the clustering exercise: the top 50 news in 2022 regarding AI, machine learning, data science, and related fields based on data collected by metacurate.io. Here is the live graph showing the top 50 news stories, and here is a list of the 200 top stories, including all constituent headlines.

Top 50 AI/ML/data science news 2022 according to metacurate.io

In 2022, my hobby project metacurate.io collected 54k+ news items from sources related to artificial intelligence, machine learning, natural language processing, data science, and other tech news. This repository contains code for experimenting with the clustering of headlines, and describing the clusters.

The input data is available in data/metacurate_news_2022.csv. Example output is available in data/output/2022_1/. The output folder contains:

Installation with virtualenv

Requirements:

  • git
  • python 3.9 or newer (it might work with earlier versions, but it has not been tested)
  • pip
  • virtualenv
  • An API key from Cohere
  • Optional: Plotly Chart Studio credentials

Set up and activate a virtual Python environment by executing the following commands at a terminal prompt:

mkdir ~/venv
virtualenv -p python3 ~/venv/metacurate-regularly/
source ~/venv/metacurate-regularly/bin/activate

Clone the source code to your local machine and install its dependencies:

git clone [email protected]:fredriko/metacurate-regularly.git
cd metacurate-regularly
pip install -r requirements.txt

Get and set up a Cohere API Key

In order to use Topically to describe the clusters, you need to have an API key from cohere. Get an API key by following the instructions in the Topically repository. Take note of the key, and set the environment variable COHERE_API_KEY like so:

export COHERE_API_KEY=<your_key>

Optional: Get and set up Plotly Chart Studio credentials

In order to publish the generated Plotly plot to the web (Plotly Chart studio), you need to have an account and set up the credentials locally. Follow the instructions for getting an account here and edit the file set_up_plotly_credentials.py to include your username and api_key.

Run the file:

python src/set_up_plotly_credentials.py

to generate and store the credentials. This only has to be done once.

Run the code

To run the code, simply issue the following:

python main.py -c configs/metacurate_news_2022_1.json

NOTE that this is a long-running process: the vectorization step will take a long time if you're running on a CPU, and the clustering takes quite some time too.

metacurate-regularly's People

Contributors

fredriko avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.