Coder Social home page Coder Social logo

pap-search's Introduction

PAP search

Description

Scrapes public administration publications information and stores it in an ElasticSearch Instance. Currently supports Diario oficial de Galicia (DOGA) publications

Setup

Create a Virtual Environment

python -m venv papenv # On Mac/Linux use Python3

Activate your Virtual Environment

papenv\Scripts\activate # On Windows
source papenv/bin/activate # On Mac/Linux

Install project dependencies

pip install -r requirements.txt

Get a list of initial pages to configure the crawler. You could use this script to generate pages from the current year.

python define_start_urls.py # On Mac/Linux use Python3 

it will store a bunch of urls inside "data/start_urls.json" to access current year DOGa documents

Crawl

To execute the crawler run the following command:

scrapy crawl doga_spider

It will crawl the seed url's from "data/DOGA_start_urls.json". After its execution, you could find the file "data/TMP_output.json" containing a dictionary of elements You'll have to manually rename this file to "data/DOGA_output.json".

Store data

The options to deploy a development setup are:

  1. Execute a Elastic Search container
docker run -d --name elasticsearch -p 9200:9200 -p 9300:9300 -e "discovery.type=single-node" docker.elastic.co/elasticsearch/elasticsearch:7.10.0
  1. Run a Elastic search instance

    Download ElasticSearch

    Since version 8 uses https by default, this could be modified editing the configuration file config/elasticsearch.yml and adding to the bottom the following directives.

xpack.security.enabled: false
xpack.security.transport.ssl.enabled: false
xpack.security.http.ssl.enabled: false

To store the scrapped documents in ElasticSearch run the command:

python bulk_post_documents.py # On Mac/Linux use Python3 

Run webapp

There's also a client to consume the stored data, check the PAP Search Client repository for instructions of how to execute it !!

scrapy genspider boe_spider boe.es

pap-search's People

Contributors

pablomarino avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.