Coder Social home page Coder Social logo

azita-abdollahi / pdf2word Goto Github PK

View Code? Open in Web Editor NEW
2.0 1.0 1.0 11 KB

CRUD REST API file server and Convert pdf files to word files using Expressjs, MongoDB, Nginx, Mongo-Express

Dockerfile 4.32% JavaScript 95.68%
docker docker-compose expressjs mongodb mongoose multer multer-storage nginx nodejs mongo-express

pdf2word's Introduction

A Dockerized PDF to Word converter using Express.JS & MongoDB

An Application to convert text and scanned PDF files to word document. We use MongoDB as database for store files and metadata.

Used packages:

  • Mongoose (This package will translate the node.JS code to MongoDB)
  • Config (It lets you define a set of default parameters, and extend them for different deployment environments.
  • Express (You’ll need this package for any HTTP requests you want to run)
  • BodyParser (This package lets you receive content from HTML forms)
  • Multer (This package enables easy file upload into MongoDB
  • Gridfs-stream (Easily stream files to and from MongoDB GridFS.)
  • Multer-gridfs-storage (You need this package to implement the MongoDB GridFS feature with multer).
  • pdf-extract (Node PDF is a set of tools that takes in PDF files and converts them to usable formats for data processing. The library supports both extracting text from searchable pdf files as well as performing OCR on pdfs which are just scanned images of text.)
  • pdf-parse (Pure javascript cross-platform module to extract texts from PDFs.)
  • stream-to-array (Concatenate a readable stream's data into a single array. The data that we fetch from the database is in the form of a stream, it is necessary to buffer the data to convert the stream to PDF.)
  • cors (CORS is a node.js package for providing a Connect/Express middleware that can be used to enable CORS with various options.)
  • officegen (Creating Office Open XML files (Word, Excel and Powerpoint) for Microsoft Office 2007 and later without external tools, just pure Javascript.)

Pdf-extract prerequisites:

  • pdftk

    pdftk splits multi-page pdf into single pages.

  • pdftotext

    pdftotext is used to extract text out of searchable pdf documents

  • ghostscript

    ghostscript is an ocr preprocessor which convert pdfs to tif files for input into tesseract

  • tesseract

    tesseract performs the actual ocr on your scanned images

More explanations for installing each of these packages on any operating system are written here I have written these prerequisites in the docker file.

Dockerfile:

FROM node:14
RUN apt update
RUN apt install -y pdftk poppler-utils ghostscript tesseract-ocr tesseract-ocr-fas
RUN apt autoclean && apt autoremove
RUN mkdir /app
WORKDIR /app
COPY package*.json ./
RUN npm install 
COPY . .
EXPOSE 3000
CMD ["npm", "run", "start"]

NOTE Install tesseract-ocr-fas for support persian language, Visit this Github project for more information on using your preferred language.

docker-compose.yml:

version: "3"
services:
  backend-file-server:
    image: file-server
    container_name: file-server-container
    build:
      context: .
    restart: on-failure
    volumes:
      - "./word/:/app/word/"
    depends_on: 
      - mongodb
    networks:
      - file-net
    ports: 
      - "3000:3000"
  mongodb:
    image: mongo:4.2
    container_name: mongodb
    restart: on-failure
    env_file: ./mongo_env
    volumes: 
      - ./mongo-data:/data/db
    networks:
      - file-net
  mongo-express:
    image: mongo-express:0.54.0
    container_name: mongo-express
    depends_on:
      - mongodb
    networks:
      - file-net
    env_file: ./mongo-express_env 
  nginx:
    image: nginx:1.21
    container_name: nginx_proxy
    restart: on-failure
    depends_on:
      - backend
    networks:
      - file-net
    ports:
      - "8080:8080"
      - "8081:8081"
    volumes:
      - ./conf.d/:/etc/nginx/conf.d/
networks:
  file-net:

pdf2word's People

Contributors

azita-abdollahi avatar

Stargazers

 avatar  avatar

Watchers

 avatar

Forkers

worldsech

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.