Coder Social home page Coder Social logo

dataiku / dss-plugin-nlp-embedding Goto Github PK

View Code? Open in Web Editor NEW
0.0 21.0 3.0 407 KB

Dataiku DSS plugin to extract vector embeddings from text data ๐Ÿ‘พ

Home Page: https://www.dataiku.com/product/plugins/sentence-embedding/

License: Apache License 2.0

Makefile 7.40% Python 92.60%
dataiku dss-plugin nlp embeddings natural-language-processing deep-learning text-embedding

dss-plugin-nlp-embedding's Introduction

dss-plugin-nlp-embedding's People

Contributors

alexcombessie avatar dependabot-preview[bot] avatar dependabot[bot] avatar du-phan avatar mhham avatar muennighoff avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

dss-plugin-nlp-embedding's Issues

How to use sentence embedding in machine learning model?

Here is what I did: I uploaded a datasheet with a column that has different row of corpus. I used the DSS natural language processing to clean the text in that column. I then passed this column as an input to the sentence embedding plugin using pre-trained GLOVE word embeddings with the results shown in the new dataset. In the new datatable, I saw a column with the word vectors.

My question is I want to perform a text clustering/classification on that column using the word vectors so I can get an analysis to determine what different categories I have in the text.
What I did was performed the DSS Kmeans algorithm against the vectorized column, but it kept giving me list out of index errors.

Can you give me advice on how to perform unsupervised learning to clustering the text data using sentence embedding and have it display the correct categories?

Recent versions of smart open breaking Gensim 2.7

Hello there,

First time posting an issue anywhere on Git so I hope I do this right !

I'm currently implementing the plugin in our DSS environment and I've stumbled upon several issues related to libraries included in the code environment created for the plugin:

File "/var/opt/data/dataiku/datadir/code-envs/python/plugin_dss_sentence_embedding_managed/lib/python2.7/site-packages/smart_open/smart_open_lib.py", line 24, in
[11:59:18] [INFO] [dku.utils] - import urllib.parse
[11:59:18] [INFO] [dku.utils] - ImportError: No module named parse

This seems to be related to this issue regarding Gensim running on 2.7 python:
piskvorky/gensim#2786

A workaround I found was to include in the requirements.txt the library smart-open and specify it to use an older version than 1.11.0 (1.10.1 works for me).

Cheers,
Pierre.

Plugin doesn't work for 3.7

Hi Team,

I had issues with integrating this plugin into the python code-envs environment using python 3.7 but I kept getting NoModuleErrors due to importing the abstract_model_language.py as a module. For the life of me I couldn't figure it out, I spent 12 hours trying to fix this. I tried downloading 20 GB worth of different pre-trained embeddings, re-installing the Plugin 15 times even versions with Github, I reinstalled Dataiku, and even tried in Mac, Windows-linux, and Linux machines. I must have clicked all the links on Google.
So then I saw this comment where a code was updated to make it work for python 2.7. This made me want to install python 2.7 code envs, but it didn't work because the requirements.txt specified numpy 1.18.2 and tensorflow 1.15.2 which isn't supported. So I downgraded numpy to 1.16 and upgraded tensorflow to 2.0 and it finally worked. I thought I'd share my pain that possibly something you fixed in python 2.7 made it stop working for python 3.7.
Please look into this issue.
Other than that I really like how the plugin works and excited to use it in ML models.

Error in OCR due to deprecated 'triu'

Describe the bug
In Python 3.9, Here is the error that I have just encountered when attempting to run both the "Compute sentence embeddings" Recipe as well as the "Compute sentence similarity" Recipe:

Error in Python process: At line 4: <class 'ImportError'>: cannot import name 'triu' from 'scipy.linalg' (/data/dataiku/dss_data/code-envs/python/plugin_sentence-embedding_managed/lib64/python3.9/site-packages/scipy/linalg/init.py)

To Reproduce
Steps to reproduce the behavior:
See: https://design.analytics.ondku.net/public-webapps/support/ticket/67516

Expected behavior
No errors.

Screenshots
If applicable, add screenshots to help explain your problem.

Additional context
Add any other context about the problem here.
DSS 12.6.1
Python 3.9
scipy 1.13.1

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.