Coder Social home page Coder Social logo

nlp4all / nlp4all Goto Github PK

View Code? Open in Web Editor NEW
1.0 1.0 0.0 25.47 MB

NLP4All is a learning platform for educational institutions to help students that are not in data-oriented fields to understand natural language processing techniques and applications.

Shell 0.20% JavaScript 2.12% Python 72.65% CSS 1.41% HTML 23.03% Dockerfile 0.50% Mako 0.10%
education educational-institutions learning-platform nlp educational-resources language learning platform machine-learning natural-language-processing nlp-pipeline nlp-pipelines

nlp4all's Introduction

nlp4all

This is the repository for NLP4All. NLP4All is built on Flask.

nlp4all's People

Contributors

arthurhjorth avatar dependabot[bot] avatar emilroenn avatar franciscoabenza avatar rbkhb avatar telmaau avatar zeyus avatar

Stargazers

 avatar

Watchers

 avatar

nlp4all's Issues

Refactor data import

The new implementation reads the schema and imports the data straight after upload.

It will need to be benchmarked but it's likely that even though reading from the filesystem is slow, it may just be quicker to read the schema by iterating over all the rows, and then only import the selected fields, because right now importing 400k tweets takes about 40 minutes and then the subsequent delete query (pre index, indexed version is being tested now) takes an additional 20 minutes if done in a single query, and > 60 minutes if done in individual queries.

While we're at it, consider using MongoDB for document storage, and join with a unique key (or the document ID)

See also: https://github.com/NLP4ALL/nlp4all/wiki/Performance

If we go this route (probably more performant) that will require hooks on the init-db and drop-db as well as when deleting and adding data sources.

Update
Version with gin index on the document column actually takes longer both for import and for property deletion. This makes sense as it actually has to update more information at each step, and probably the indexing doesn't extend to such deep nesting (it could, if the structure was consistent). Seems like MongoDB may be the way to go.

UPDATE 2

MongoDB has now been implemented, which is now a 3 minute import. Key deletion still takes around 8 minutes, but that just leaves one remaining task. Process the schema BEFORE import, and only import the required keys. the whole process will be much quicker and probably total around the same (3 min)

data importer

(user uploads) Are we only doing CSV and/or json? or did Arthur mention something else?

create style library

things like default buttons, page layouts, what does import page look like, dropdowns, etc etc

rename tweet* to text* in models

this is part of the larger body of work of getting the original functionality back into the site, especially changing from the old way of loading data to the new way

move datasourcemanager to postgres

I think the best option here, because we now have postgres, is to use a defined structure for required columns that we can pull in
id, data_source_id, text and category (maybe(?)), then user_data

the data source manager can then have an id, name, description, project(?), and structure which describes the fields within the jsonb user_data column.

Postgres supports JSONB format, which can be used for querying data within json columns, we can validate app-side before it goes to the database, but we have the structure for that data source defined.

See: https://medium.com/geekculture/postgres-jsonb-usage-and-performance-analysis-cdbd1242a018

also: https://docs.sqlalchemy.org/en/20/dialects/postgresql.html#sqlalchemy.dialects.postgresql.JSONB

fix linting errors

let's get the build passing so we have a nice start and a bright future

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.