Coder Social home page Coder Social logo

komachi / giant Goto Github PK

View Code? Open in Web Editor NEW

This project forked from guardian/giant

0.0 0.0 0.0 10.64 MB

Platform for journalists to search, analyse, categorise and share unstructured data

License: Apache License 2.0

Shell 0.62% JavaScript 15.69% Ruby 0.01% Java 0.03% Scala 52.87% TypeScript 27.40% CSS 0.14% HTML 0.10% SCSS 3.13%

giant's Introduction

Giant

Giant makes it easier for journalists to search, analyse, categorise and share unstructured data. It takes many file formats, indexes them (including converting images to text using OCR) and provides a UI for search. Users can upload their own files but it also scales up to terabytes of data.

Screenshot of Giant search

Giant is part of the Guardian's "Platform for Investigations" suite, you will see references to pfi in the code. Under development since 2017, it's written in Scala and Typescript and is maintained by the Investigations & Reporting team.

If Giant doesn't fit your needs, check out Aleph from the OCCRP and Datashare from the ICIJ.

(Users) Getting started

(Developers) Getting started

Giant has the following pre-requisites for local development:

Giant uses three databases, run locally in Docker through docker-compose.yaml:

There are two optional dependencies:

  • Tesseract
    • To extract text from images (OCR).
    • brew install tesseract
  • Libre Office
    • To convert and preview Microsoft Office documents in the UI
  • wkhtmltopdf
    • To preview html files (such as emails)
    • brew install wkhtmltopdf

Elasticsearch requires Docker to have at least 4GB of memory from the preferences menu otherwise it will exit with no log output and error 137.

For Guardian developers:

  • Janus credentials are not required to run Giant locally.
  • The Giant Runbook

Select the correct version of node:

nvm use

Then run the setup script:

./scripts/setup.sh

Seed the configuration:

./scripts/cluster-setup.sh

Run the Scala backend:

./scripts/start-backend.sh

This will also automatically launch the databases in the background by running docker-compose up -d.

In a separate terminal, run the Create React App frontend:

./scripts/start-frontend.sh

The frontend script will wait for the backend to start before launching Giant at http://localhost:3000.

Once Giant has started, follow the admin quickstart guide.

dev-nginx proxy

You can use dev-nginx to more easily access Giant and the backing databases whilst running locally.

dev-nginx setup-app util/nginx-mapping.yml

Running Tests

To run all unit tests:

sbt test

To run all integration tests:

sbt int:test

To run a specific integration test:

sbt 'int:testOnly controllers.api.WorkspacesITest'

Stopping databases

To terminate the databases without losing data:

docker-compose down

To terminate and delete data:

docker-compose down -v

Contributing

The Guardian welcomes contributions to Giant. We do not yet have a publicly accessible CI server but please ensure all tests pass by running the build script locally:

./scripts/teamcity.sh

We do not yet publish deployment templates for Giant in either cloud hosts or locally. If you are interested in deploying Giant please get in touch by raising a GitHub issue on this repository.

Architecture

architecture diagram for uploading files

https://docs.google.com/drawings/d/1wcTY9KLhkYqxmwzsyZ3DsWcc0v-ax5kMKWtYb4HZgF0

Licensing

Giant uses the Apache 2.0 licence. Some libraries used are licensed separately:

Supported file formats

  • .rar archives (v4 and below)
  • .zip archives
  • .eml RFC 5322 emails
  • .mbox email archives
  • .msg Outlook email files
  • .pst Outlook email archives
  • .olm Outlook for Mac email archives/backups
  • .png, .jpg, .tiff images (including OCR)
  • .pdf (including OCR)
  • Microsoft Office Word, Excel and Powerpoint files
  • Various plain text files (see DocumentBodyExtractor)
  • Audio files
    • fully supported
      • .wav
      • .mpeg
      • .opus
      • .caf
      • .mp4
      • .aac (tika sometimes has trouble detecting these)
    • transcribed but preview doesn't work
      • .aff
      • .amr
      • .wma
  • Video files
    • fully supported
      • .mov, .qt
      • .m4v
      • .3gpp
      • .mp4
    • transcribed but preview doesn't work
      • .flv
      • .wmv
      • .msvideo
      • .mpeg

Experimental features

Experimental features are enabled through feature flags in the Settings page:

  • New UI: a simplified UI implemented using the Elastic UI toolkit
  • Page Viewer: a unified document viewer showing text, OCR and search highlights inline on the original document

Credits

In addition to any contributors named in this repository, the following contributed to Giant whilst it was closed source at the Guardian:

giant's People

Contributors

gtrufitt avatar hoyla avatar itsibitzi avatar joelochlann avatar kenoir avatar marjisound avatar marsavar avatar mbarton avatar mchv avatar philmcmahon avatar samanthagottlieb avatar snyk-bot avatar srbd avatar zekehuntergreen avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.