Coder Social home page Coder Social logo

greed2411 / tokyo Goto Github PK

View Code? Open in Web Editor NEW
18.0 1.0 0.0 20 KB

tokyo, a REST API, when given any type of document ๐Ÿ“„, Identifies mime-type ๐Ÿง. Suggests extension ๐Ÿฆ”. Alas Extracts text ๐Ÿ’ช.

License: Eclipse Public License 2.0

Dockerfile 5.27% Clojure 94.73%
document-processing apache-tika clojure ring mime-types extension text-parsing text-parser extract-text filetype

tokyo's Introduction

tokyo

greed2411

When you hit rock-bottom, you still have a way to go until the abyss.- Tokyo, Netflix's "Money Heist" (La Casa De Papel)



image belongs to teepublic

When one is limited by the technology of the time, One resorts to Java APIs using Clojure.

This is my first attempt on Clojure to have a REST API which when uploaded a file, identifies it's mime-type, extension and text if present inside the file and returns information as JSON. This works for several type of files. Including the ones which require OCR, thanks to Tesseract. Complete list of supported file formats by Tika.

Uses ring for Clojure HTTP server abstraction, jetty for actual HTTP server, pantomime for a clojure abstraction over Apache Tika and also optionally served using traefik acting as reverse-proxy.

Installation

Two options:

  1. Download openjdk-11 and install lein. Followed by lein uberjar
  2. Use the Dockerfile (Recommended)

Building

  1. You can obtain the .jar file from releases (if it's available).
  2. Else build the docker image using Dockerfile.
docker build ./ -t tokyo
docker run tokyo:latest

Note: the server defaults to running on port 80, because it has been exposed in the docker image. You can change the port number by setting an enviornment variable TOKYO_PORT inside the Dockerfile, or in your shell prompt to whichever port number you'd like when running the .jar file.

I've also added a docker-compose.yml which uses traefik as reverse proxy. use docker-compose up.

Usage

  1. the /file route. make a POST request by uploading a file.

    • the command line approach using curl
    curl -XPOST  "http://localhost:80/file" -F file=@/path/to/file/sample.doc
    
    {"mime-type":"application/msword","ext":".bin","text":"Lorem ipsum \nLorem ipsum dolor sit amet, consectetur adipiscing elit. Nunc ac faucibus odio."}
    >>> import requests
    >>> import json
    
    >>> url = "http://localhost:80/file"
    >>> files = {"file": open("/path/to/file/sample.doc")}
    >>> response = requests.post(url, files=files)
    >>> json.loads(response.content)
    
    {'mime-type': 'application/msword', 'ext': '.bin', 'text': 'Lorem ipsum \nLorem ipsum dolor sit amet, consectetur adipiscing elit. Nunc ac faucibus odio.'}

    the general API response,json-schema is of the form:

    :mime-type (string) - the mime-type of the file. eg: application/msword, text/plain etc.
    :ext       (string) - the extension of the file. eg: .txt, .jpg etc.
    :text      (string) - the text content of the file.
    

Note: The files being uploaded are stored as temp files, in /tmp and removed after an hour later. (assuming the jvm is still running for that hour or so).

  1. just a /, GET request returns Hello World as plain text. to act as ping.

If going down the path of using docker-compose. The request gets altered to

curl -XPOST  -H Host:tokyo.localhost http://localhost/file -F file=@/path/to/file/sample.doc

{"mime-type":"application/msword","ext":".bin","text":"Lorem ipsum \nLorem ipsum dolor sit amet, consectetur adipiscing elit. Nunc ac faucibus odio."}

and

>>> response = requests.post(url, files=files, headers={"Host": "tokyo.localhost"})

where tokyo.localhost has been mentioned in docker-compose.yml

Why?

I had to do this because neither Python's filetype (doesn't identify .doc, .docx, plain text), textract (hacky way of extracting text, and one needs to know the extension before extracting) are as good as Tika. The Go version, filetype didn't support a way to extract text. So I resorted to spiraling down the path of using Java's Apache Tika using the Clojure pantomime library.

License

Copyright ยฉ 2020 greed2411/tokyo

This program and the accompanying materials are made available under the terms of the Eclipse Public License 2.0 which is available at http://www.eclipse.org/legal/epl-2.0.

This Source Code may also be made available under the following Secondary Licenses when the conditions for such availability set forth in the Eclipse Public License, v. 2.0 are satisfied: GNU General Public License as published by the Free Software Foundation, either version 2 of the License, or (at your option) any later version, with the GNU Classpath Exception which is available at https://www.gnu.org/software/classpath/license.html.

tokyo's People

Contributors

greed2411 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.