Coder Social home page Coder Social logo

chew's Introduction

chew logo

Go Report Card GoDoc Maintainability codecov License

A Go library for processing various content types into markdown/plaintext..

About

Chew is a Go library that processes various content types into markdown or plaintext. It supports multiple content types, including HTML, PDF, CSV, JSON, YAML, DOCX, PPTX, Markdown, Plaintext, MP3, FLAC, and WAVE.

Installation

go get github.com/mmatongo/chew

Usage

Here's a basic example of how to use Chew:

package main

import (
    "context"
    "fmt"
    "log"
	"time"

    "github.com/mmatongo/chew"
)

func main() {
    urls := []string{
        "https://example.com",
    }

	// The context is optional
	ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
	defer cancel()

    chunks, err := chew.Process(urls, ctx)
    if err != nil {
		if err == context.DeadlineExceeded {
			log.Println("Operation timed out")
		} else {
			log.Printf("Error processing URLs: %v", err)
		}
		return
    }

    for _, chunk := range chunks {
        fmt.Printf("Source: %s\nContent: %s\n\n", chunk.Source, chunk.Content)
    }
}

Output

Source: https://example.com
Content: Example Domain

Source: https://example.com
Content: This domain is for use in illustrative examples in documents. You may use this domain in literature without prior coordination or asking for permission.

Source: https://example.com
Content: More information...

You can find more examples in the examples directory as well as instructions on how to use Chew with Ruby and Python.

Contributing

Contributions are welcome! Feel free to open an issue or submit a pull request if you have any suggestions or improvements.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Logo

The logo was made by the amazing MariaLetta.

Similar Projects

docconv

Roadmap

The roadmap for this project is available here. It's meant more as a guide than a strict plan because I only work on this project in my free time.

chew's People

Contributors

mmatongo avatar

Stargazers

Maciej Szalewicz avatar Mark Fontenot avatar ik5 avatar Clayton Kehoe avatar  avatar TheInvincible avatar Bander Alsulami avatar  avatar  avatar houboxue avatar Derek Perkins avatar godlaugh avatar banshan avatar ccoVeille avatar nated.eth avatar Suguru Namura avatar Thomas Harr avatar Naohiro avatar Andrei Surugiu avatar Can Evgin avatar Joshua Rich avatar  avatar Nick Nunes avatar  avatar Jimmy Ruska avatar  avatar  avatar

Watchers

ccoVeille avatar  avatar  avatar

chew's Issues

Reorganise internal

  • Refactor and reorganise the code under internal
    Currently the folder is a bit of a mess (subjective?) and so this is mostly housekeeping to reorganise and restructure it

Add an open source transcription option

Right now the only options are Google and OpenAI (under development), this doesn't leave much room for people that would rather not use these.

Task

  • Implement a generic interface to allow chew to work with other open source alternatives for transcription

Blocker

  • I need to first implement the Whisper intergration so that I can have an idea of how I can work around this.

Notes

  • Whisper is open source (research)

For context this is my first public project with golang and considering that I am still learning most of my process involves a lot of iterative refinement

Enhance selection for HTML and EPUB

Currently selectors are hardcoded making it tricky for anyone who wants to target specific selectors.

	doc.Find("p, h1, h2, h3, h4, h5, h6, li").Each(func(_ int, s *goquery.Selection) {
		text := strings.TrimSpace(s.Text())
		if text != "" {
			chunks = append(chunks, common.Chunk{Content: text, Source: url})
		}
	})

This is far from ideal and doesn't allow for much flexibility so users should be able to specify selectors

Implement transcribing using the whisper api

I tested out whisper and wow, it's much faster and much more accurate the google's cloud speech api. Implementing this will require a rework and reorganisation of the speech module so as to avoid spaghettification.

Fix flakey process test

{
	name: "multiple URLs",
	args: args{
		urls: []string{server.URL + "/text", server.URL + "/html"},
	},
	want: []common.Chunk{
		{Content: "A plain text file.", Source: server.URL + "/text"},
		{Content: "An HTML file.", Source: server.URL + "/html"},
	},
	wantErr: false,
},

test fails on the returned order
it can be text/html or html/text

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.