abhinavkashyap / sciwing Goto Github PK

View Code? Open in Web Editor NEW

61.0 61.0 15.0 12.87 MB

SciWING is a modern toolkit for scientific document processing from WING-NUS

Home Page: https://www.sciwing.io

License: MIT License

Python 98.22% Perl 1.78%

sciwing's People

Contributors

Stargazers

Watchers

Forkers

wing-nus sean-dingxu dragomirradev lizzylazy yyht cs4248-team9 rcmoraleshernandez georgepwhuang honhaochen choudhurym brahimmade jerry-terrasse

sciwing's Issues

ElmoEmbedder deprecated in AllenNLP

I see that this was included in the code (modules/elmo_lstm_encoder.py):
@deprecated( reason="ElmoEmbedder will be deprecated " "Please use concat embedder and lstm 2 vec module to achieve the same thing. " "This will be removed in version 1", version=0.1, )

It is is still being called in the current version of sciwing:
from sciwing.modules.embedders.elmo_embedder import ElmoEmbedder

Which gives the module not found error when calling:
ModuleNotFoundError: No module named 'allennlp.commands.elmo'

The fix seems to be and since you were already considering in your code the deprecation of ElmoEmbedder, to maybe change these lines in modules/embedders/bow_elmo_embedder.py:

Line 2:
from allennlp.commands.elmo import ElmoEmbedder

for :
from sciwing.modules.embedders.elmo_embedder import ElmoEmbedder

Line 71:
self.elmo = ElmoEmbedder(cuda_device=self.cuda_device_id)

for:
self.elmo = ElmoEmbedder()

After making that correction, the following error comes up:
RuntimeError: Error(s) in loading state_dict for RnnSeqCrfTagger: Missing key(s) in state_dict: "rnn2seqencoder.embedder.embedder_elmo.elmo.elmo._elmo_lstm._token_embedder._char_embedding_weights", "rnn2seqencoder.embedder.embedder_elmo.elmo.elmo._elmo_lstm._token_embedder.char_conv_0.weight", "rnn2seqencoder.embedder.embedder_elmo.elmo.elmo._elmo_lstm._token_embedder.char_conv_0.bias", "rnn2seqencoder.embedder.embedder_elmo.elmo.elmo._elmo_lstm._token_embedder.char_conv_1.weight", "rnn2seqencoder.embedder.embedder_elmo.elmo.elmo._elmo_lstm._token_embedder.char_conv_1.bias", "rnn2seqencoder.embedder.embedder_elmo.elmo.elmo._elmo_lstm._token_embedder.char_conv_2.weight", "rnn2seqencoder.embedder.embedder_elmo.elmo.elmo._elmo_lstm._token_embedder.char_conv_2.bias", "rnn2seqencoder.embedder.embedder_elmo.elmo.elmo._elmo_lstm._token_embedder.char_conv_3.weight", "rnn2seqencoder.embedder.embedder_elmo.elmo.elmo._elmo_lstm._token_embedder.char_conv_3.bias", "rnn2seqencoder.embedder.embedder_elmo.elmo.elmo._elmo_lstm._token_embedder.char_conv_4.weight", "rnn2seqencoder.embedder.embedder_elmo.elmo.elmo._elmo_lstm._token_embedder.char_conv_4.bias", "rnn2seqencoder.embedder.embedder_elmo.elmo.elmo._elmo_lstm._token_embedder.char_conv_5.weight", "rnn2seqencoder.embedder.embedder_elmo.elmo.elmo._elmo_lstm._token_embedder.char_conv_5.bias", "rnn2seqencoder.embedder.embedder_elmo.elmo.elmo._elmo_lstm._token_embedder.char_conv_6.weight", "rnn2seqencoder.embedder.embedder_elmo.elmo.elmo._elmo_lstm._token_embedder.char_conv_6.bias", "rnn2seqencoder.embedder.embedder_elmo.elmo.elmo._elmo_lstm._token_embedder._highways._layers.0.weight", "rnn2seqencoder.embedder.embedder_elmo.elmo.elmo._elmo_lstm._token_embedder._highways._layers.0.bias", "rnn2seqencoder.embedder.embedder_elmo.elmo.elmo._elmo_lstm._token_embedder._highways._layers.1.weight", "rnn2seqencoder.embedder.embedder_elmo.elmo.elmo._elmo_lstm._token_embedder._highways._layers.1.bias", "rnn2seqencoder.embedder.embedder_elmo.elmo.elmo._elmo_lstm._token_embedder._projection.weight", "rnn2seqencoder.embedder.embedder_elmo.elmo.elmo._elmo_lstm._token_embedder._projection.bias", "rnn2seqencoder.embedder.embedder_elmo.elmo.elmo._elmo_lstm._elmo_lstm.forward_layer_0.input_linearity.weight", "rnn2seqencoder.embedder.embedder_elmo.elmo.elmo._elmo_lstm._elmo_lstm.forward_layer_0.state_linearity.weight", "rnn2seqencoder.embedder.embedder_elmo.elmo.elmo._elmo_lstm._elmo_lstm.forward_layer_0.state_linearity.bias", "rnn2seqencoder.embedder.embedder_elmo.elmo.elmo._elmo_lstm._elmo_lstm.forward_layer_0.state_projection.weight", "rnn2seqencoder.embedder.embedder_elmo.elmo.elmo._elmo_lstm._elmo_lstm.backward_layer_0.input_linearity.weight", "rnn2seqencoder.embedder.embedder_elmo.elmo.elmo._elmo_lstm._elmo_lstm.backward_layer_0.state_linearity.weight", "rnn2seqencoder.embedder.embedder_elmo.elmo.elmo._elmo_lstm._elmo_lstm.backward_layer_0.state_linearity.bias", "rnn2seqencoder.embedder.embedder_elmo.elmo.elmo._elmo_lstm._elmo_lstm.backward_layer_0.state_projection.weight", "rnn2seqencoder.embedder.embedder_elmo.elmo.elmo._elmo_lstm._elmo_lstm.forward_layer_1.input_linearity.weight", "rnn2seqencoder.embedder.embedder_elmo.elmo.elmo._elmo_lstm._elmo_lstm.forward_layer_1.state_linearity.weight", "rnn2seqencoder.embedder.embedder_elmo.elmo.elmo._elmo_lstm._elmo_lstm.forward_layer_1.state_linearity.bias", "rnn2seqencoder.embedder.embedder_elmo.elmo.elmo._elmo_lstm._elmo_lstm.forward_layer_1.state_projection.weight", "rnn2seqencoder.embedder.embedder_elmo.elmo.elmo._elmo_lstm._elmo_lstm.backward_layer_1.input_linearity.weight", "rnn2seqencoder.embedder.embedder_elmo.elmo.elmo._elmo_lstm._elmo_lstm.backward_layer_1.state_linearity.weight", "rnn2seqencoder.embedder.embedder_elmo.elmo.elmo._elmo_lstm._elmo_lstm.backward_layer_1.state_linearity.bias", "rnn2seqencoder.embedder.embedder_elmo.elmo.elmo._elmo_lstm._elmo_lstm.backward_layer_1.state_projection.weight", "rnn2seqencoder.embedder.embedder_elmo.elmo.elmo.scalar_mix_0.gamma", "rnn2seqencoder.embedder.embedder_elmo.elmo.elmo.scalar_mix_0.scalar_parameters.0", "rnn2seqencoder.embedder.embedder_elmo.elmo.elmo.scalar_mix_0.scalar_parameters.1", "rnn2seqencoder.embedder.embedder_elmo.elmo.elmo.scalar_mix_0.scalar_parameters.2".

I guess the elmo_embedder.py is not generating the embedder the needed way or it was not intended to be used the way I was trying to use it to correct the issue with the deprecated module of ElmoEmbedder in AllenNLP

10 fold CV support for Engine

Some tasks have very small datasets and would require 10 fold CV. where the average performance over the 10 folds is reported. Write now, this has to be done manually where the Engine is setup 10 times and 10 experiments are run and the average of 10 experiments is reported.

Should there be a way where 10 fold CV is supported for certain tasks?? A wrapper around the engine where the cross validation is handled automatically. This is open for discussion

Need an abstract class for Classification Dataset

ParsectDataset and GenericSectDataset share some common functionality and so will many Classification Dataset. There should be an abstract class implementation for these with few concretely implemented methods and some compulsory methods.

Cli for exploring different datasets

There are different classification datasets that are part of the repo now. A cli to explore the different datasets would be a nice feature to have. Get stats is already part of the interface.

The cli should

Ask which dataset to explore
The user should be able to see basic vocab stats of the dataset (Number of distinct words. Most popular words)
Other information about the dataset, like number of training examples, validation examples, the max length of instances

Custom Tokenizer for Parsect Dataset

Parsect Dataset uses the default WordTokenizer class. There should be a way to provide custom tokenizer

Deploying: CI-CD for installing the package on ay remote/cloud based services

You can use Travis/Circle CI to deploy the code on remote servers.
This helps in easy training on Amazon cloud or other remote servers

usage on google collab

I am not able to get sciwing to run properly on google colab

I suspect that it may be due to unclear dependencies during installation

here is what Idid

! pip install sciwing

this works (with warnings) but
from sciwing.models.neural_parscit import NeuralParscit
gives the error
ModuleNotFoundError: No module named 'allennlp.commands.elmo'

following this issue #22
I installed allennlp==0.9.0 before installing sciwing

! pip install allennlp==0.9.0

but now it gives a spacy error

KeyError: 'PUNCTSIDE_FIN'

installing the version given in the requirements.txt (or a newer version of spacy ) did not help

Suggestion for feature improvement: Return dictionary with parsed references

Hello all:

A suggestion for an improvement in NeuralParscit is to return a dictionary with the tokens with the same label. For example:

After parsing the reference:
"Calzolari, N. (1982) Towards the organization of lexical definitions on a database structure. In E. Hajicova (Ed.), COLING '82 Abstracts, Charles University, Prague, pp.61-64."

the result would be:
'author author date title title title title title title title title title title editor editor editor editor booktitle booktitle booktitle editor institution location pages'

However, it could rather return a dictionary like this:
{'author': ['Calzolari,', 'N.'], 'date': ['(1982)'], 'title': ['Towards', 'the', 'organization', 'of', 'lexical', 'definitions', 'on', 'a', 'database', 'structure.'], 'editor': ['In', 'E.', 'Hajicova', '(Ed.),', 'Charles'], 'booktitle': ['COLING', "'82", 'Abstracts,'], 'institution': ['University,'], 'location': ['Prague,'], 'pages': ['pp.61-64.']}

The dictionary above could later be used for detokenize the lists and get something like:
{'author': 'Calzolari, N.',
'date': '(1982)',
'title': 'Towards the organization of lexical definitions on a database structure.',
'editor': 'In E. Hajicova (Ed.), Charles',
'booktitle': "COLING '82 Abstracts,",
'institution': 'University,',
'location': 'Prague,',
'pages': 'pp.61-64.'}

The code to get something like the above would be:
`result_parsing = neural_parscit.predict_for_text(text=reference, show=False)
result_parsing = [t for t in result_parsing.split(" ")]

result_dict = {}

for token, token_label in zip(reference_tokenized, result_parsing):
if token_label not in result_dict.keys():
result_dict[token_label] = []

result_dict[token_label].append(token)

detokenize everything

result_dict = {k:md.detokenize(v) for k,v in result_dict.items()}`

The detokenizer used is MosesDetokenizer, which is in the library sacremoses

Other languages

Hi, is there a way to select other language to analysis documents?

exists pre-trained models in other language? Im interested in Spanish and French

Enhancement: Use loguru for logging

I am using json-logging for logging information to files and its a mess in the file engine.py. I came across loguru (log-guru) which is slick library for logging. There are other logging libraries like zerolog. Evaluate the different libraries and change the code to include something that is easier.

Different emb loader for word and characters are not needed

There are two different classes for WordEmbLoader and CharEmbLoader. This is not needed.
We can have just one embedding loader. The EmbeddingLoader abstraction does not care, if it is a word or if it is a character. Provide the appropriate embedding type and the tokens will be instantiated with appropriate values

Error at run API

Hi,

When I try to run:

uvicorn api:app --reload

I get this message:

import sciwing.api.conf as config
ModuleNotFoundError: No module named 'sciwing'

I am in sciwing/api folder

Break the precision recall and fmeasure `calc_metric` into smaller functions

The PrecisionRecallFMeasure has a long unweildy calc_metric function. Worst, the functionality within are repeated in different methods. Compose the calc_metric method into multiple functions containing some private functions that are re-usable within the class.

RuntimeError in parscit example

I'm getting RuntimeError: size mismatch, when trying to run parscit.sh file from examples folder.

The complete traceback:

ELMO Bag of Words Encoder (non pretrained ) + linear classifier

Elmo pretrained word embeddings can be obtained from allenlp
The parameters of bi-lstm itself are not trained (as done in the paper)
The embeddings are then use to train a linear classifier

Installing the pip version result in errors

Installing the pip verison results in a lot of errors

Outdated packages

The packages in requirements.txt are outdate. In particular torch is listed at 1.5, which is not available for Python 3.10

Refractor numericalize_instance from Numericalize class

The method pads the input to desired length
The padding of the input to desired length can be refactored to a separate method

The padding of the input to desired length can be done in a more elegant way.

Initial Update

The bot created this issue to inform you that pyup.io has been set up on this repo.
Once you have closed it, the bot will open pull requests for updates as soon as they are available.

Preprocessing - Parsect sentences are half complete

The dataset has incomplete sentences. For example "Pre-trained word embeddings from bidirection-" where "bidirection-" has continuation in the next line as "al".
Replace "bidirection-" with "bidirectional" both in the current line and the next line

PDF Reader returns blank file

This in turn causes inference with SectLabel to crash.

Micro and Macro Measures

The PrecisionRecallFMeasure does not report a micro precision and macro precision score. Add calculation of micro-precision and macro-precision @abhinavkashyap .

Remove 'return_instances' from parsect_dataset class.

A flag called return_instances was used before a refractoring. The flag is lying around without much use. Refractor and remove it

Cli for letting users configure different models in the system

Right now, there is no easy way to configure the different models in an interactive manner
Provide a cli to configure/experiment with different models
Different models require different configurations. So this will be an error free way to enable users to configure the different models either with default options or certain allowed options

SciWing Could not take a PDF as an input

Is there any possibility to pass a PDF document as an input to sciwing rather than a Reference String within that PDF?

Enhancement: Use jsonnet or other data tempalting packages

The script files to run experiments are becoming unwieldy. We can refer to jsonnet or other data templating languages that help in setting up experiments. I think even allen-nlp uses jsonnet.

Decorators for datasets .. Help in easy creation of datasets

Easy Creation of Datasets

Creation of datasets should be easy for the user
There should be flexibility for the user and cater to different file formats, different classes of datasets.
But some of the common features for all datasets should be provided to the user without much hassle

Suggestion

Provide common requirements like word_vocab, char_vocab for free when the user requests for it using decorators
The user can specify what is needed for the current dataset and is readily available from the framework

Inference code can break due to not flushing buffers during output

From @honhaochen 's report, there could be an issue where the inference code runs into memory issues because the inference code does not buffer read and write testing instances. Instead, it currently loads the entire data into memory, does inference and then write to disk. A better solution is to batch read smaller amounts of data and flush the inference output to disk so that memory use can be more constant.

Tests: Setting up tests best practices

There are a million places where different datasets are created.
For example, dataset is created in test_parsect_dataset, test_engine and other places. Write common fixtures that can be used by different test scripts.

Documentation and Read the docs

Fixing all the documentation
Automatic Documentation (sphynx and readthedocs)