Coder Social home page Coder Social logo

newsblaster's Introduction

NewsBlaster

Newsblaster is a system that helps users find the news that is of the most interest to them. The system automatically collects, clusters, categorizes, and summarizes news from several sites on the web (CNN, Reuters, Fox News, etc.) on a daily basis.

This is the group space being used to improve and further develop Columbia's NewsBlaster system.


NewsBlaster at Columbia University Project

Installing

  1. Clone the repository

    git clone https://github.com/kedz/newsblaster.git

  2. Go to the NewsBlaster directory

    cd newsblaster/

  3. Execute the install script

    ./install.sh

This will install all dependencies and required files in your home directory by default. To override this please set NB_HOME . Example export NB_HOME=/tmp/newsblaster

Running

  1. Start NewsBlaster

    ./newsblaster.sh start

  2. Check for news articles . See Current Usage for details

Crawls are currently configured to run on a 30 minutes to 3 hours schedule for some spiders. As a result you will not have articles until at least a minimum of 30 minutes. This can be configured by changing the Celery schedule.

  1. Stopping NewsBlaster

./newsblaster stop

Current Usage

Documentation will be updated and changed as we continue to improve and build out NewsBlaster

All summaries are currently stored in the MongoDB database automatically installed and configured when you deployed NewsBlaster.

Example a multi document summary can be seen below.

> db.summaries.findOne()
{
	"_id" : ObjectId("573353f03168e60986fa1ac0"),
	"date" : ISODate("2016-05-11T11:46:56.488Z"),
	"summary_type" : "lexrank",
	"cluster_id" : ObjectId("573351a33168e60606eec00a"),
	"sentences" : [
		{
			"text" : "Brian Sandoval said Thursday he was \"incredibly grateful\" to be mentioned in the conversation over who President Obama would possibly select to replace Justice Antonin Scalia, but that he does \"not wish to be considered at this time\" for a spot on the U.S. Supreme Court.",
			"sentence_id" : 2,
			"article_id" : ObjectId("573347feb44be453490c5dde")
		},
		{
			"text" : "An intense political fight has erupted since the Feb. 13 death of long-serving conservative Justice Antonin Scalia, as Republicans maneuver to foil Obama's ability to choose a replacement who could tilt the court to the left for the first time in decades.",
			"sentence_id" : 2,
			"article_id" : ObjectId("57334808b44be453490c5e01")
		},
		{
			"text" : "The U.S. presidential election is set for Nov. 8 and Republicans want the next president to fill Scalia's vacancy, hoping a Republican will be elected.",
			"sentence_id" : 12,
			"article_id" : ObjectId("57334808b44be453490c5e01")
		},
		{
			"text" : "It's a duty that I take seriously, and one that I will fulfill in the weeks ahead,\" Obama, sounding undeterred by the Republican-led Senate's opposition, wrote in a blog post on the independent SCOTUSblog.com website.",
			"sentence_id" : 5,
			"article_id" : ObjectId("57334808b44be453490c5e01")
		},
		{
			"text" : "\"In the meantime, the American people are going to have the ability to gauge whether the person I've nominated is well within the mainstream, is a good jurist, is somebody who's worthy to sit on the Supreme Court,\" Obama told reporters in the Oval Office.",
			"sentence_id" : 30,
			"article_id" : ObjectId("573347feb44be453490c5dde")
		}
	]
}

API access to query summaries and related meta data will be exposed at a later date.

Papers

newsblaster's People

Contributors

kedz avatar loswojos avatar rmzi avatar skillachie avatar yanrongwo avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

newsblaster's Issues

Remove ElasticSearch

Replace ElasticSearch with MongoDB since it can be used as both the database and celery backend. Less dependencies the better

Flatten ES Schema

Flatten the ES schema according to previous discussion with Chris

Create Messaging Module

This module will be imported and reused by all spiders and components that require communicating with the broker.

Make light

-Remove all old dependencies
-Clean up startup script and legacy dependencies for running on old CS machines

Add other RSS for The Peoples Daily

http://english.people.com.cn/98373/98471/

Also the reason why the authors are not being extracted for those pages is because we currently do not have code that looks for that particular div class (wb_6 clear)

<div class="wb_6 clear"><span><a href="mailto:[email protected]">Email</a>|<a href="#" onclick="doPrint()">Print</a>|<a href="#liuyan">Comments</a><img src="/img/2011english/images/icon19.gif" onclick="AXzhz('AX')" style="cursor:pointer;"></span><em>(Editor:王欣、张茜)</em></div>

Automated Installation ,Start & Stop Script

Install script that will be able to deploy and run news blaster from end to end

1)Initialize database
2)Configure RabbitMQ
3)Start workers
4)Start spiders

Script to shutdown all process cleanly

1)Disable spiders
2)Stop all running services associated with newsblaster

Replace sleep.

Replace sleep with checking the logs of rabbitmq to identify when the process has fully started

Pipe Line

Will be used to

  1. Will be used to further extract and process raw html.
    2.Send processed content to queue on Broker(RabbitMQ)

Create DB Worker

Worker that will be responsible for consuming messages from the broker and inserting/updating the items in the datababse

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.