Coder Social home page Coder Social logo

hengqujushi / pulsar Goto Github PK

View Code? Open in Web Editor NEW

This project forked from platonai/pulsarrpa

0.0 1.0 0.0 3.95 MB

Turn large Web sites into tables and charts using simple SQLs.

License: Apache License 2.0

Shell 0.61% Java 78.19% Kotlin 4.70% JavaScript 0.28% HTML 16.10% PLSQL 0.09% CSS 0.04%

pulsar's Introduction

Pulsar README

Pulsar is a full featured Web crawler as well as a Web mining framework.

Features

  • Web SQL: Do all Web mining jobs using SQL
  • BI Integration: Turn Web sites into tables and charts using just one simple SQL
  • Ajax support: Access the Web automatically, behave like humans
  • Web site monitoring: monitor news sites, e-commerce sites out-of-box
  • Highly extensible and scalable: runs on Hadoop/Spark, and other big data infrastructure
  • Various database support: Store data in your favourite database, MongoDB/HBase, etc

Web SQL

Turn a Web page into a table:

SELECT
    DOM_TEXT(DOM) AS TITLE,
    DOM_ABS_HREF(DOM) AS LINK
FROM
    LOAD_AND_SELECT('https://en.wikipedia.org/wiki/Topology', '.references a.external');

The SQL above downloads a Web page from wikipedia, find out the references section and extract all external reference links.

Check sql-history.sql to see more example SQLs. All SQL functions can be found under fun.platonic.pulsar.ql.h2.udfs.

BI Integration

Use the exiting customized BI tool Metabase to write Web SQLs and turn Web sites into tables and charts immediately. Everyone in your company can ask questions and learn from WEB DATA now, for the first time.

Build & Run

Build from source

git clone [email protected]:platonai/pulsar.git
cd pulsar && mvn -Pthird -Pplugins

Install dependencies

bin/tools/install-depends.sh

Install mongodb

You can skip this step, in such case, all data will lose after pulsar shutdown. Ubuntu/Debian:

sudo apt-get install mongodb

Start the pulsar server

bin/pulsar

Use Web console

Web console http://localhost:8082 is already open in your browser now, enjoy playing with Web SQL.

Execute a single Web SQL

bin/pulsar sql -sql "SELECT DOM_TEXT(DOM) AS TITLE, DOM_ABS_HREF(DOM) AS LINK FROM LOAD_AND_SELECT('https://en.wikipedia.org/wiki/Topology', '.references a.external')"

Use GUI-free console

bin/pulsar sql

Use advanced BI tool

Download Metabase Web SQL edition, and run:

-- coming soon ..
java -jar metabase.jar

Large scale Web spider

Crawl the open Web from seeds, and index text content using solr, run script:

-- coming soon ..
bin/crawl.sh default false awesome_crawl_task http://master:8983/solr/awesome_crawl_task/ 1

Enterprise Edition:

Pulsar Enterprise Edition comes with lots of exciting features:

Advanced AI to do Web content mining:

1. Extract large scale Web pages with above human-level accuracy using advanced AI
2. Learn and generate SQLs for sites

Full featured Web SQL:

1. Any source, any format, any volume, ETL the data and turn it into a table by just one simple SQL
2. Monitor a Web site and turn it into a table by just one simple SQL
3. Integrated argorithms for Web extraction, data mining, NLP, Knowldge Graph, maching learning, etc
4. Do business intelligence on unstructured data

Enterprise Edition will be open sourced step by step.

Coming soon ...

============

Cloud Edition:

Write your own Web SQLs to create data products anywhere, anytime, to share, or for sale

Coming soon ...

pulsar's People

Contributors

galaxyeye avatar insidegalaxyeye avatar

Watchers

Yuanyuan Zhang avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.