Coder Social home page Coder Social logo

sciwing's People

Contributors

abhinavkashyap avatar honhaochen avatar kzzj217 avatar yajingyang avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

sciwing's Issues

ElmoEmbedder deprecated in AllenNLP

I see that this was included in the code (modules/elmo_lstm_encoder.py):
@deprecated( reason="ElmoEmbedder will be deprecated " "Please use concat embedder and lstm 2 vec module to achieve the same thing. " "This will be removed in version 1", version=0.1, )

It is is still being called in the current version of sciwing:
from sciwing.modules.embedders.elmo_embedder import ElmoEmbedder

Which gives the module not found error when calling:
ModuleNotFoundError: No module named 'allennlp.commands.elmo'

The fix seems to be and since you were already considering in your code the deprecation of ElmoEmbedder, to maybe change these lines in modules/embedders/bow_elmo_embedder.py:

Line 2:
from allennlp.commands.elmo import ElmoEmbedder

for :
from sciwing.modules.embedders.elmo_embedder import ElmoEmbedder

Line 71:
self.elmo = ElmoEmbedder(cuda_device=self.cuda_device_id)

for:
self.elmo = ElmoEmbedder()

After making that correction, the following error comes up:
RuntimeError: Error(s) in loading state_dict for RnnSeqCrfTagger: Missing key(s) in state_dict: "rnn2seqencoder.embedder.embedder_elmo.elmo.elmo._elmo_lstm._token_embedder._char_embedding_weights", "rnn2seqencoder.embedder.embedder_elmo.elmo.elmo._elmo_lstm._token_embedder.char_conv_0.weight", "rnn2seqencoder.embedder.embedder_elmo.elmo.elmo._elmo_lstm._token_embedder.char_conv_0.bias", "rnn2seqencoder.embedder.embedder_elmo.elmo.elmo._elmo_lstm._token_embedder.char_conv_1.weight", "rnn2seqencoder.embedder.embedder_elmo.elmo.elmo._elmo_lstm._token_embedder.char_conv_1.bias", "rnn2seqencoder.embedder.embedder_elmo.elmo.elmo._elmo_lstm._token_embedder.char_conv_2.weight", "rnn2seqencoder.embedder.embedder_elmo.elmo.elmo._elmo_lstm._token_embedder.char_conv_2.bias", "rnn2seqencoder.embedder.embedder_elmo.elmo.elmo._elmo_lstm._token_embedder.char_conv_3.weight", "rnn2seqencoder.embedder.embedder_elmo.elmo.elmo._elmo_lstm._token_embedder.char_conv_3.bias", "rnn2seqencoder.embedder.embedder_elmo.elmo.elmo._elmo_lstm._token_embedder.char_conv_4.weight", "rnn2seqencoder.embedder.embedder_elmo.elmo.elmo._elmo_lstm._token_embedder.char_conv_4.bias", "rnn2seqencoder.embedder.embedder_elmo.elmo.elmo._elmo_lstm._token_embedder.char_conv_5.weight", "rnn2seqencoder.embedder.embedder_elmo.elmo.elmo._elmo_lstm._token_embedder.char_conv_5.bias", "rnn2seqencoder.embedder.embedder_elmo.elmo.elmo._elmo_lstm._token_embedder.char_conv_6.weight", "rnn2seqencoder.embedder.embedder_elmo.elmo.elmo._elmo_lstm._token_embedder.char_conv_6.bias", "rnn2seqencoder.embedder.embedder_elmo.elmo.elmo._elmo_lstm._token_embedder._highways._layers.0.weight", "rnn2seqencoder.embedder.embedder_elmo.elmo.elmo._elmo_lstm._token_embedder._highways._layers.0.bias", "rnn2seqencoder.embedder.embedder_elmo.elmo.elmo._elmo_lstm._token_embedder._highways._layers.1.weight", "rnn2seqencoder.embedder.embedder_elmo.elmo.elmo._elmo_lstm._token_embedder._highways._layers.1.bias", "rnn2seqencoder.embedder.embedder_elmo.elmo.elmo._elmo_lstm._token_embedder._projection.weight", "rnn2seqencoder.embedder.embedder_elmo.elmo.elmo._elmo_lstm._token_embedder._projection.bias", "rnn2seqencoder.embedder.embedder_elmo.elmo.elmo._elmo_lstm._elmo_lstm.forward_layer_0.input_linearity.weight", "rnn2seqencoder.embedder.embedder_elmo.elmo.elmo._elmo_lstm._elmo_lstm.forward_layer_0.state_linearity.weight", "rnn2seqencoder.embedder.embedder_elmo.elmo.elmo._elmo_lstm._elmo_lstm.forward_layer_0.state_linearity.bias", "rnn2seqencoder.embedder.embedder_elmo.elmo.elmo._elmo_lstm._elmo_lstm.forward_layer_0.state_projection.weight", "rnn2seqencoder.embedder.embedder_elmo.elmo.elmo._elmo_lstm._elmo_lstm.backward_layer_0.input_linearity.weight", "rnn2seqencoder.embedder.embedder_elmo.elmo.elmo._elmo_lstm._elmo_lstm.backward_layer_0.state_linearity.weight", "rnn2seqencoder.embedder.embedder_elmo.elmo.elmo._elmo_lstm._elmo_lstm.backward_layer_0.state_linearity.bias", "rnn2seqencoder.embedder.embedder_elmo.elmo.elmo._elmo_lstm._elmo_lstm.backward_layer_0.state_projection.weight", "rnn2seqencoder.embedder.embedder_elmo.elmo.elmo._elmo_lstm._elmo_lstm.forward_layer_1.input_linearity.weight", "rnn2seqencoder.embedder.embedder_elmo.elmo.elmo._elmo_lstm._elmo_lstm.forward_layer_1.state_linearity.weight", "rnn2seqencoder.embedder.embedder_elmo.elmo.elmo._elmo_lstm._elmo_lstm.forward_layer_1.state_linearity.bias", "rnn2seqencoder.embedder.embedder_elmo.elmo.elmo._elmo_lstm._elmo_lstm.forward_layer_1.state_projection.weight", "rnn2seqencoder.embedder.embedder_elmo.elmo.elmo._elmo_lstm._elmo_lstm.backward_layer_1.input_linearity.weight", "rnn2seqencoder.embedder.embedder_elmo.elmo.elmo._elmo_lstm._elmo_lstm.backward_layer_1.state_linearity.weight", "rnn2seqencoder.embedder.embedder_elmo.elmo.elmo._elmo_lstm._elmo_lstm.backward_layer_1.state_linearity.bias", "rnn2seqencoder.embedder.embedder_elmo.elmo.elmo._elmo_lstm._elmo_lstm.backward_layer_1.state_projection.weight", "rnn2seqencoder.embedder.embedder_elmo.elmo.elmo.scalar_mix_0.gamma", "rnn2seqencoder.embedder.embedder_elmo.elmo.elmo.scalar_mix_0.scalar_parameters.0", "rnn2seqencoder.embedder.embedder_elmo.elmo.elmo.scalar_mix_0.scalar_parameters.1", "rnn2seqencoder.embedder.embedder_elmo.elmo.elmo.scalar_mix_0.scalar_parameters.2".

I guess the elmo_embedder.py is not generating the embedder the needed way or it was not intended to be used the way I was trying to use it to correct the issue with the deprecated module of ElmoEmbedder in AllenNLP

10 fold CV support for Engine

Some tasks have very small datasets and would require 10 fold CV. where the average performance over the 10 folds is reported. Write now, this has to be done manually where the Engine is setup 10 times and 10 experiments are run and the average of 10 experiments is reported.

Should there be a way where 10 fold CV is supported for certain tasks?? A wrapper around the engine where the cross validation is handled automatically. This is open for discussion

Need an abstract class for Classification Dataset

ParsectDataset and GenericSectDataset share some common functionality and so will many Classification Dataset. There should be an abstract class implementation for these with few concretely implemented methods and some compulsory methods.

Cli for exploring different datasets

There are different classification datasets that are part of the repo now. A cli to explore the different datasets would be a nice feature to have. Get stats is already part of the interface.

The cli should

  1. Ask which dataset to explore
  2. The user should be able to see basic vocab stats of the dataset (Number of distinct words. Most popular words)
  3. Other information about the dataset, like number of training examples, validation examples, the max length of instances

usage on google collab

I am not able to get sciwing to run properly on google colab

I suspect that it may be due to unclear dependencies during installation

here is what Idid

! pip install sciwing

this works (with warnings) but
from sciwing.models.neural_parscit import NeuralParscit
gives the error
ModuleNotFoundError: No module named 'allennlp.commands.elmo'

following this issue #22
I installed allennlp==0.9.0 before installing sciwing

! pip install allennlp==0.9.0

but now it gives a spacy error

KeyError: 'PUNCTSIDE_FIN'

installing the version given in the requirements.txt (or a newer version of spacy ) did not help

Suggestion for feature improvement: Return dictionary with parsed references

Hello all:

A suggestion for an improvement in NeuralParscit is to return a dictionary with the tokens with the same label. For example:

After parsing the reference:
"Calzolari, N. (1982) Towards the organization of lexical definitions on a database structure. In E. Hajicova (Ed.), COLING '82 Abstracts, Charles University, Prague, pp.61-64."

the result would be:
'author author date title title title title title title title title title title editor editor editor editor booktitle booktitle booktitle editor institution location pages'

However, it could rather return a dictionary like this:
{'author': ['Calzolari,', 'N.'], 'date': ['(1982)'], 'title': ['Towards', 'the', 'organization', 'of', 'lexical', 'definitions', 'on', 'a', 'database', 'structure.'], 'editor': ['In', 'E.', 'Hajicova', '(Ed.),', 'Charles'], 'booktitle': ['COLING', "'82", 'Abstracts,'], 'institution': ['University,'], 'location': ['Prague,'], 'pages': ['pp.61-64.']}

The dictionary above could later be used for detokenize the lists and get something like:
{'author': 'Calzolari, N.',
'date': '(1982)',
'title': 'Towards the organization of lexical definitions on a database structure.',
'editor': 'In E. Hajicova (Ed.), Charles',
'booktitle': "COLING '82 Abstracts,",
'institution': 'University,',
'location': 'Prague,',
'pages': 'pp.61-64.'}

The code to get something like the above would be:
`result_parsing = neural_parscit.predict_for_text(text=reference, show=False)
result_parsing = [t for t in result_parsing.split(" ")]

result_dict = {}

for token, token_label in zip(reference_tokenized, result_parsing):
if token_label not in result_dict.keys():
result_dict[token_label] = []

result_dict[token_label].append(token)

detokenize everything

result_dict = {k:md.detokenize(v) for k,v in result_dict.items()}`

The detokenizer used is MosesDetokenizer, which is in the library sacremoses

Other languages

Hi, is there a way to select other language to analysis documents?

exists pre-trained models in other language? Im interested in Spanish and French

Enhancement: Use loguru for logging

I am using json-logging for logging information to files and its a mess in the file engine.py. I came across loguru (log-guru) which is slick library for logging. There are other logging libraries like zerolog. Evaluate the different libraries and change the code to include something that is easier.

Different emb loader for word and characters are not needed

There are two different classes for WordEmbLoader and CharEmbLoader. This is not needed.
We can have just one embedding loader. The EmbeddingLoader abstraction does not care, if it is a word or if it is a character. Provide the appropriate embedding type and the tokens will be instantiated with appropriate values

Error at run API

Hi,

When I try to run:

uvicorn api:app --reload

I get this message:

import sciwing.api.conf as config
ModuleNotFoundError: No module named 'sciwing'

I am in sciwing/api folder

RuntimeError in parscit example

I'm getting RuntimeError: size mismatch, when trying to run parscit.sh file from examples folder.

The complete traceback:
parscit_error

Outdated packages

The packages in requirements.txt are outdate. In particular torch is listed at 1.5, which is not available for Python 3.10

Initial Update

The bot created this issue to inform you that pyup.io has been set up on this repo.
Once you have closed it, the bot will open pull requests for updates as soon as they are available.

Preprocessing - Parsect sentences are half complete

  • The dataset has incomplete sentences. For example "Pre-trained word embeddings from bidirection-" where "bidirection-" has continuation in the next line as "al".
  • Replace "bidirection-" with "bidirectional" both in the current line and the next line

Cli for letting users configure different models in the system

  • Right now, there is no easy way to configure the different models in an interactive manner
  • Provide a cli to configure/experiment with different models
  • Different models require different configurations. So this will be an error free way to enable users to configure the different models either with default options or certain allowed options

Decorators for datasets .. Help in easy creation of datasets

Easy Creation of Datasets

  • Creation of datasets should be easy for the user
  • There should be flexibility for the user and cater to different file formats, different classes of datasets.
  • But some of the common features for all datasets should be provided to the user without much hassle

Suggestion

  • Provide common requirements like word_vocab, char_vocab for free when the user requests for it using decorators
  • The user can specify what is needed for the current dataset and is readily available from the framework

Inference code can break due to not flushing buffers during output

From @honhaochen 's report, there could be an issue where the inference code runs into memory issues because the inference code does not buffer read and write testing instances. Instead, it currently loads the entire data into memory, does inference and then write to disk. A better solution is to batch read smaller amounts of data and flush the inference output to disk so that memory use can be more constant.

Tests: Setting up tests best practices

There are a million places where different datasets are created.
For example, dataset is created in test_parsect_dataset, test_engine and other places. Write common fixtures that can be used by different test scripts.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.