Coder Social home page Coder Social logo

choco-up / sg-law-archive-data Goto Github PK

View Code? Open in Web Editor NEW

This project forked from hueyy/lacuna-db

0.0 0.0 0.0 90.78 MB

legal data in machine-readable form

License: Other

JavaScript 0.70% Clojure 79.63% CSS 0.13% HTML 4.72% Dockerfile 2.16% Python 12.66%

sg-law-archive-data's Introduction

LacunaDB

This repository contains Singapore legal data obtained various public sources and converted into a machine-readable format, including the following:

You can view and query the data using this Datasette instance.

The code and configuration files in this repository are licensed under the EUPL-1.2 as set out in the LICENCE file.

The data remains owned by its respective owners. This repository is not affiliated with the Singapore Academy of Law, Singapore Courts, Law Society, or any government agency, and is provided for convenience only.

Architecture

flowchart LR
  subgraph pipeline["Data Pipeline"]
    subgraph /input/ scripts
    Website-->data["/data/ (JSON files)"]
    end
    subgraph build_script["build_db.bb script"]
    data-->sqlite["SQLite DB (/data/data.db)"]
    end
  end
  subgraph backend["Backend"]
    Datasette
  end
  build_script-->backend
  subgraph frontend["Frontend"]
    html["HTML templates"]
    cljs["CLJS scripts"]
  end
  frontend-- served by -->backend
Loading

Frontend

See /app/README.md for frontend development.

Data pipeline

In the data pipeline, everything is just a script (aka a microservice™). Although most of the scripts are Babashka scripts written in Clojure, new scripts can be in any language.

The data is obtained periodically via scheduled GitHub action workflows and committed to this repository. Each Github Action runs one of the input scripts in the /input folder. Each input script stores the data obtained in a JSON file in the /data folder. Each JSON file is just a snapshot in time, i.e. it contains only the data obtained in the last run of the respective script as opposed to all data ever obtained using that script.

The /.github/workflows/deploy.yml runs the /scripts/build_db.bb script which uses the git-history tool to create a SQLite database from the historical data across all the commits in this repository. The script then builds a Datasette Docker image and deploys that via Fly.io.

Some of the scripts in the /scripts folder run Python tools. This project uses Poetry to manage its Python dependencies, so do install Poetry and the dependencies before running those scripts.

Setup

Make sure you have Babashka, Python, and Poetry installed.

Install the Poetry dependencies by running poetry install --no-root.

This project uses various CLI utilities, which you will need to install to run the input scripts:

pdftotext

pdftotext is used to extract text from PDFs. It is bundled within poppler.

On Ubuntu/Debian:

sudo apt install poppler-utils

On macOS, you can install it using Homebrew:

brew install poppler
ocrmypdf

ocrmypdf is used to run OCR on PDFs. It is a Poetry dependency already, but it does require tesseract and ghostscript to be installed.

On Ubuntu/Debian:

sudo apt install tesseract-ocr ghostscript

On macOS:

brew install tesseract ghostscript

Local development

After cloning this repository and following the setup steps above, you can generate the SQLite database on your machine by running the /scripts/build_db.bb script.

If you do not have SQLite installed, you will need to install it.

On Ubuntu/Debian:

sudo apt install sqlite3

On macOS:

brew install sqlite3

This may take some time (possibly >1h) as there have been many commits to this repository. The build_db.bb script also does some processing on the data, e.g. it creates and populates certain columns for ease of use based on the raw data (see e.g. /scripts/computed_columns.bb). Alternatively, you can download a copy of the database from law-archive-data.fly.dev.

Backend

Once you have the SQLite data, you can analyse it by running Datasette locally. You can use the /scripts/dev_docker.bb script.

cd lacunadb
bb /scripts/dev_docker.bb

It may be helpful to refer to the Docker images or the GitHub actions for a better idea of how the project functions and how to run certain scripts.

sg-law-archive-data's People

Contributors

hueyy avatar adrian-choco-up avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.