abhinavkashyap / sciwing Goto Github PK
View Code? Open in Web Editor NEWSciWING is a modern toolkit for scientific document processing from WING-NUS
Home Page: https://www.sciwing.io
License: MIT License
SciWING is a modern toolkit for scientific document processing from WING-NUS
Home Page: https://www.sciwing.io
License: MIT License
I see that this was included in the code (modules/elmo_lstm_encoder.py):
@deprecated( reason="ElmoEmbedder will be deprecated " "Please use concat embedder and lstm 2 vec module to achieve the same thing. " "This will be removed in version 1", version=0.1, )
It is is still being called in the current version of sciwing:
from sciwing.modules.embedders.elmo_embedder import ElmoEmbedder
Which gives the module not found error when calling:
ModuleNotFoundError: No module named 'allennlp.commands.elmo'
The fix seems to be and since you were already considering in your code the deprecation of ElmoEmbedder, to maybe change these lines in modules/embedders/bow_elmo_embedder.py:
Line 2:
from allennlp.commands.elmo import ElmoEmbedder
for :
from sciwing.modules.embedders.elmo_embedder import ElmoEmbedder
Line 71:
self.elmo = ElmoEmbedder(cuda_device=self.cuda_device_id)
for:
self.elmo = ElmoEmbedder()
After making that correction, the following error comes up:
RuntimeError: Error(s) in loading state_dict for RnnSeqCrfTagger: Missing key(s) in state_dict: "rnn2seqencoder.embedder.embedder_elmo.elmo.elmo._elmo_lstm._token_embedder._char_embedding_weights", "rnn2seqencoder.embedder.embedder_elmo.elmo.elmo._elmo_lstm._token_embedder.char_conv_0.weight", "rnn2seqencoder.embedder.embedder_elmo.elmo.elmo._elmo_lstm._token_embedder.char_conv_0.bias", "rnn2seqencoder.embedder.embedder_elmo.elmo.elmo._elmo_lstm._token_embedder.char_conv_1.weight", "rnn2seqencoder.embedder.embedder_elmo.elmo.elmo._elmo_lstm._token_embedder.char_conv_1.bias", "rnn2seqencoder.embedder.embedder_elmo.elmo.elmo._elmo_lstm._token_embedder.char_conv_2.weight", "rnn2seqencoder.embedder.embedder_elmo.elmo.elmo._elmo_lstm._token_embedder.char_conv_2.bias", "rnn2seqencoder.embedder.embedder_elmo.elmo.elmo._elmo_lstm._token_embedder.char_conv_3.weight", "rnn2seqencoder.embedder.embedder_elmo.elmo.elmo._elmo_lstm._token_embedder.char_conv_3.bias", "rnn2seqencoder.embedder.embedder_elmo.elmo.elmo._elmo_lstm._token_embedder.char_conv_4.weight", "rnn2seqencoder.embedder.embedder_elmo.elmo.elmo._elmo_lstm._token_embedder.char_conv_4.bias", "rnn2seqencoder.embedder.embedder_elmo.elmo.elmo._elmo_lstm._token_embedder.char_conv_5.weight", "rnn2seqencoder.embedder.embedder_elmo.elmo.elmo._elmo_lstm._token_embedder.char_conv_5.bias", "rnn2seqencoder.embedder.embedder_elmo.elmo.elmo._elmo_lstm._token_embedder.char_conv_6.weight", "rnn2seqencoder.embedder.embedder_elmo.elmo.elmo._elmo_lstm._token_embedder.char_conv_6.bias", "rnn2seqencoder.embedder.embedder_elmo.elmo.elmo._elmo_lstm._token_embedder._highways._layers.0.weight", "rnn2seqencoder.embedder.embedder_elmo.elmo.elmo._elmo_lstm._token_embedder._highways._layers.0.bias", "rnn2seqencoder.embedder.embedder_elmo.elmo.elmo._elmo_lstm._token_embedder._highways._layers.1.weight", "rnn2seqencoder.embedder.embedder_elmo.elmo.elmo._elmo_lstm._token_embedder._highways._layers.1.bias", "rnn2seqencoder.embedder.embedder_elmo.elmo.elmo._elmo_lstm._token_embedder._projection.weight", "rnn2seqencoder.embedder.embedder_elmo.elmo.elmo._elmo_lstm._token_embedder._projection.bias", "rnn2seqencoder.embedder.embedder_elmo.elmo.elmo._elmo_lstm._elmo_lstm.forward_layer_0.input_linearity.weight", "rnn2seqencoder.embedder.embedder_elmo.elmo.elmo._elmo_lstm._elmo_lstm.forward_layer_0.state_linearity.weight", "rnn2seqencoder.embedder.embedder_elmo.elmo.elmo._elmo_lstm._elmo_lstm.forward_layer_0.state_linearity.bias", "rnn2seqencoder.embedder.embedder_elmo.elmo.elmo._elmo_lstm._elmo_lstm.forward_layer_0.state_projection.weight", "rnn2seqencoder.embedder.embedder_elmo.elmo.elmo._elmo_lstm._elmo_lstm.backward_layer_0.input_linearity.weight", "rnn2seqencoder.embedder.embedder_elmo.elmo.elmo._elmo_lstm._elmo_lstm.backward_layer_0.state_linearity.weight", "rnn2seqencoder.embedder.embedder_elmo.elmo.elmo._elmo_lstm._elmo_lstm.backward_layer_0.state_linearity.bias", "rnn2seqencoder.embedder.embedder_elmo.elmo.elmo._elmo_lstm._elmo_lstm.backward_layer_0.state_projection.weight", "rnn2seqencoder.embedder.embedder_elmo.elmo.elmo._elmo_lstm._elmo_lstm.forward_layer_1.input_linearity.weight", "rnn2seqencoder.embedder.embedder_elmo.elmo.elmo._elmo_lstm._elmo_lstm.forward_layer_1.state_linearity.weight", "rnn2seqencoder.embedder.embedder_elmo.elmo.elmo._elmo_lstm._elmo_lstm.forward_layer_1.state_linearity.bias", "rnn2seqencoder.embedder.embedder_elmo.elmo.elmo._elmo_lstm._elmo_lstm.forward_layer_1.state_projection.weight", "rnn2seqencoder.embedder.embedder_elmo.elmo.elmo._elmo_lstm._elmo_lstm.backward_layer_1.input_linearity.weight", "rnn2seqencoder.embedder.embedder_elmo.elmo.elmo._elmo_lstm._elmo_lstm.backward_layer_1.state_linearity.weight", "rnn2seqencoder.embedder.embedder_elmo.elmo.elmo._elmo_lstm._elmo_lstm.backward_layer_1.state_linearity.bias", "rnn2seqencoder.embedder.embedder_elmo.elmo.elmo._elmo_lstm._elmo_lstm.backward_layer_1.state_projection.weight", "rnn2seqencoder.embedder.embedder_elmo.elmo.elmo.scalar_mix_0.gamma", "rnn2seqencoder.embedder.embedder_elmo.elmo.elmo.scalar_mix_0.scalar_parameters.0", "rnn2seqencoder.embedder.embedder_elmo.elmo.elmo.scalar_mix_0.scalar_parameters.1", "rnn2seqencoder.embedder.embedder_elmo.elmo.elmo.scalar_mix_0.scalar_parameters.2".
I guess the elmo_embedder.py is not generating the embedder the needed way or it was not intended to be used the way I was trying to use it to correct the issue with the deprecated module of ElmoEmbedder in AllenNLP
Some tasks have very small datasets and would require 10 fold CV. where the average performance over the 10 folds is reported. Write now, this has to be done manually where the Engine is setup 10 times and 10 experiments are run and the average of 10 experiments is reported.
Should there be a way where 10 fold CV is supported for certain tasks?? A wrapper around the engine where the cross validation is handled automatically. This is open for discussion
ParsectDataset and GenericSectDataset share some common functionality and so will many Classification Dataset. There should be an abstract class implementation for these with few concretely implemented methods and some compulsory methods.
There are different classification datasets that are part of the repo now. A cli to explore the different datasets would be a nice feature to have. Get stats is already part of the interface.
The cli should
Parsect Dataset uses the default WordTokenizer class. There should be a way to provide custom tokenizer
I am not able to get sciwing to run properly on google colab
I suspect that it may be due to unclear dependencies during installation
here is what Idid
! pip install sciwing
this works (with warnings) but
from sciwing.models.neural_parscit import NeuralParscit
gives the error
ModuleNotFoundError: No module named 'allennlp.commands.elmo'
following this issue #22
I installed allennlp==0.9.0 before installing sciwing
! pip install allennlp==0.9.0
but now it gives a spacy error
KeyError: 'PUNCTSIDE_FIN'
installing the version given in the requirements.txt (or a newer version of spacy ) did not help
Hello all:
A suggestion for an improvement in NeuralParscit is to return a dictionary with the tokens with the same label. For example:
After parsing the reference:
"Calzolari, N. (1982) Towards the organization of lexical definitions on a database structure. In E. Hajicova (Ed.), COLING '82 Abstracts, Charles University, Prague, pp.61-64."
the result would be:
'author author date title title title title title title title title title title editor editor editor editor booktitle booktitle booktitle editor institution location pages'
However, it could rather return a dictionary like this:
{'author': ['Calzolari,', 'N.'], 'date': ['(1982)'], 'title': ['Towards', 'the', 'organization', 'of', 'lexical', 'definitions', 'on', 'a', 'database', 'structure.'], 'editor': ['In', 'E.', 'Hajicova', '(Ed.),', 'Charles'], 'booktitle': ['COLING', "'82", 'Abstracts,'], 'institution': ['University,'], 'location': ['Prague,'], 'pages': ['pp.61-64.']}
The dictionary above could later be used for detokenize the lists and get something like:
{'author': 'Calzolari, N.',
'date': '(1982)',
'title': 'Towards the organization of lexical definitions on a database structure.',
'editor': 'In E. Hajicova (Ed.), Charles',
'booktitle': "COLING '82 Abstracts,",
'institution': 'University,',
'location': 'Prague,',
'pages': 'pp.61-64.'}
The code to get something like the above would be:
`result_parsing = neural_parscit.predict_for_text(text=reference, show=False)
result_parsing = [t for t in result_parsing.split(" ")]
result_dict = {}
for token, token_label in zip(reference_tokenized, result_parsing):
if token_label not in result_dict.keys():
result_dict[token_label] = []
result_dict[token_label].append(token)
result_dict = {k:md.detokenize(v) for k,v in result_dict.items()}`
The detokenizer used is MosesDetokenizer, which is in the library sacremoses
Hi, is there a way to select other language to analysis documents?
exists pre-trained models in other language? Im interested in Spanish and French
I am using json-logging for logging information to files and its a mess in the file engine.py
. I came across loguru (log-guru) which is slick library for logging. There are other logging libraries like zerolog. Evaluate the different libraries and change the code to include something that is easier.
There are two different classes for WordEmbLoader and CharEmbLoader. This is not needed.
We can have just one embedding loader. The EmbeddingLoader
abstraction does not care, if it is a word or if it is a character. Provide the appropriate embedding type and the tokens will be instantiated with appropriate values
Hi,
When I try to run:
uvicorn api:app --reload
I get this message:
import sciwing.api.conf as config
ModuleNotFoundError: No module named 'sciwing'
I am in sciwing/api folder
The PrecisionRecallFMeasure
has a long unweildy calc_metric
function. Worst, the functionality within are repeated in different methods. Compose the calc_metric
method into multiple functions containing some private functions that are re-usable within the class.
Installing the pip verison results in a lot of errors
The packages in requirements.txt
are outdate. In particular torch is listed at 1.5, which is not available for Python 3.10
The padding of the input to desired length can be done in a more elegant way.
The bot created this issue to inform you that pyup.io has been set up on this repo.
Once you have closed it, the bot will open pull requests for updates as soon as they are available.
This in turn causes inference with SectLabel to crash.
The PrecisionRecallFMeasure
does not report a micro precision and macro precision score. Add calculation of micro-precision and macro-precision @abhinavkashyap .
A flag called return_instances was used before a refractoring. The flag is lying around without much use. Refractor and remove it
Is there any possibility to pass a PDF document as an input to sciwing rather than a Reference String within that PDF?
The script files to run experiments are becoming unwieldy. We can refer to jsonnet or other data templating languages that help in setting up experiments. I think even allen-nlp uses jsonnet.
word_vocab
, char_vocab
for free when the user requests for it using decoratorsFrom @honhaochen 's report, there could be an issue where the inference code runs into memory issues because the inference code does not buffer read and write testing instances. Instead, it currently loads the entire data into memory, does inference and then write to disk. A better solution is to batch read smaller amounts of data and flush the inference output to disk so that memory use can be more constant.
There are a million places where different datasets are created.
For example, dataset is created in test_parsect_dataset
, test_engine
and other places. Write common fixtures that can be used by different test scripts.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.