bionlplab / bioc Goto Github PK

View Code? Open in Web Editor NEW

28.0 2.0 11.0 453 KB

Data structures and code to read/write BioC XML and Json.

License: MIT License

Python 100.00%

bioc xml json reader writer bionlp

bioc's People

Contributors

Stargazers

Watchers

Forkers

rileynwong textiohq emsrc beira-bf datummd smartniz ftahabi mart1nro lisaterumi jakelever

bioc's Issues

Brat parser: AssertionError: Illegal format: M

First of all, thank you for this incredibly useful library.

I am trying to parse a brat file of this resource but I get the following error:

[ins] In [26]: a2 = brat.load_ann("devel/PMC-3333881-05-MATERIALS_AND_METHODS.a1")
---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
Input In [26], in <module>
----> 1 a2 = brat.load_ann("devel/PMC-3333881-05-MATERIALS_AND_METHODS.a1")

File ~/.venv/bigbio/lib/python3.9/site-packages/bioc/brat/decoder.py:214, in load_ann(fp, docid)
    212     doc.add_annotation(loads_brat_note(line))
    213 if line[0] == 'A' or line[0] == 'M':
--> 214     doc.add_annotation(loads_brat_attribute(line))
    215 if line[0] == '*':
    216     doc.add_annotation(loads_brat_equiv(line))

File ~/.venv/bigbio/lib/python3.9/site-packages/bioc/brat/decoder.py:16, in loads_brat_attribute(s)
     12 """
     13 ID [tab] TYPE REFID [FLAG1 FLAG2 ...]
     14 """
     15 toks = s.split('\t')
---> 16 assert len(toks) == 2, 'Illegal format: %s' % s
     18 att = BratAttribute()
     19 att.id = toks[0]

AssertionError: Illegal format: M

Cannot read ’annotation‘ in biocxml format

I’m writing a python script, to convert biocxml file into pubtator file.
I did not find similar script, so all I can do is to write one on my own.

The bioc files are downloaded from :
https://ftp.ncbi.nlm.nih.gov/pub/lu/BioRED/BioRED.zip

I tried to read the "Test.BioC.XML" in two ways:
1:
with open(fpath, 'r') as fp:
collection = biocxml.load(fp)
docs = collection.documents
2:
with biocxml.iterparse(fpath) as reader:
collection_info = reader.get_collection_info()
for doc in reader:

It is strange to find that all annotations are missing, but relations are corrected parsed.

Any idea why this happens？

lxml dependency

First of all, great library. Thank you for you work.

I'm wondering is there a strict requirement for lxml=4.4.1. Or could it be more flexible like lxml>=4.4.1. It doesn't seem like lxml introduced breaking changes https://lxml.de/4.5/changes-4.5.0.html, but it would be really helpful in projects with multiple dependencies which can conflict.

Incrementally decoding the BioC Json of `.tar.gz`-collection

PubMed Central provides their Open Access articles in the BioC JSON-format (see API and Bulk Download). I downloaded one portion with
wget https://ftp.ncbi.nlm.nih.gov/pub/wilbur/BioC-PMC/PMC095XXXXX_json_ascii.tar.gz
and want to document-wise apply a filter (need to save memory). I tried following code:

from tqdm import tqdm
import gzip
import io

keyword = 'diabetes'
my_doi_list = []
path_file_PMC = '/content/PMC095XXXXX_json_ascii.tar.gz'
path_file_PMC_filtered = '/content/result'

with gzip.open(path_file_PMC, 'rb') as gz, open(path_file_PMC_filtered, 'wb') as f_out:
    f = io.BufferedReader(gz)
    for line in tqdm(f.readlines()):
        record = json.loads(line)
        # doi = record['documents'][0]['passages'][0]['infons']['article-id_doi']
        if keyword in record['documents'][0]['passages'][0]['text']: 
            # my_doi_list.append(doi)
            f_out.write(line)

But face an error:

JSONDecodeError: Expecting value: line 1 column 1 (char 0)

0%|          | 0/95046 [00:00<?, ?it/s]
---------------------------------------------------------------------------
JSONDecodeError                           Traceback (most recent call last)
[<ipython-input-21-cc459cfaf959>](https://localhost:8080/#) in <module>
     19     # f = gz
     20     for line in tqdm(f.readlines()):
---> 21         record = json.loads(line)
     22         doi = record['documents'][0]['passages'][0]['infons']['article-id_doi']
     23         if keyword in record['documents'][0]['passages'][0]['text']:   # TODO: <<< change this to your filter

2 frames
[/usr/lib/python3.7/json/__init__.py](https://localhost:8080/#) in loads(s, encoding, cls, object_hook, parse_float, parse_int, parse_constant, object_pairs_hook, **kw)
    346             parse_int is None and parse_float is None and
    347             parse_constant is None and object_pairs_hook is None and not kw):
--> 348         return _default_decoder.decode(s)
    349     if cls is None:
    350         cls = JSONDecoder

[/usr/lib/python3.7/json/decoder.py](https://localhost:8080/#) in decode(self, s, _w)
    335 
    336         """
--> 337         obj, end = self.raw_decode(s, idx=_w(s, 0).end())
    338         end = _w(s, end).end()
    339         if end != len(s):

[/usr/lib/python3.7/json/decoder.py](https://localhost:8080/#) in raw_decode(self, s, idx)
    353             obj, end = self.scan_once(s, idx)
    354         except StopIteration as err:
--> 355             raise JSONDecodeError("Expecting value", s, err.value) from None
    356         return obj, end

JSONDecodeError: Expecting value: line 1 column 1 (char 0)

Then I found your python package, and I thought I could use one of the codes provided here: https://bioc.readthedocs.io/en/latest/biocjson.html

However, I don't want to unzip or untar, but from your code examples it is not clear which format the file at filename has. Is it possible to use functions of your package but use .tar.gz. as input or do I need to unzip (w/o untar)?

syntax error in import bioc

Hi,

I am attempting to use BioC on MacOS X and when I try to import bioc in Python 2.7.10 I get the following error:

>>> import bioc
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Library/Python/2.7/site-packages/bioc/__init__.py", line 4, in <module>
    from .bioc import BioCCollection, BioCDocument, BioCPassage, BioCSentence, BioCAnnotation, \
  File "/Library/Python/2.7/site-packages/bioc/bioc.py", line 24
    def __init__(self, refid: str, role: str):
                            ^
SyntaxError: invalid syntax

I have updated to the most recent version:

sudo pip install bioc
Password:
Requirement already satisfied: bioc in /Library/Python/2.7/site-packages (1.3.1)
Requirement already satisfied: docutils==0.14 in ./Library/Python/2.7/lib/python/site-packages (from bioc) (0.14)
Requirement already satisfied: lxml==4.2.5 in /Library/Python/2.7/site-packages (from bioc) (4.2.5)
Requirement already satisfied: jsonlines==1.2.0 in /Library/Python/2.7/site-packages (from bioc) (1.2.0)
Requirement already satisfied: six in ./Library/Python/2.7/lib/python/site-packages (from jsonlines==1.2.0->bioc) (1.11.0)

Thank you in advance for your help!
Karyn

Normalized datasets fail during parsing

For datsaets like tmVar v2.0, pubtator fails due to the presence of the "RSID: xxxxxxxxx" at the end of the annotations. Example file here: https://www.ncbi.nlm.nih.gov/CBBresearch/Lu/Demo/tmTools/download/tmVar/tmVar.Normalization.txt

Relations in pubtator not showing

I'm trying to read the BioRED file from here: https://ftp.ncbi.nlm.nih.gov/pub/lu/BioRED/
As you will see, it has pubtator formats. The annotations are read in propely, but when I try to read the relations, it doesn't provide any output
The relations are written as follows:

14510914	Association	D050033	D007454	No
14510914	Positive_Correlation	p|DEL|439_443|	C564766	Novel
14510914	Positive_Correlation	p|DEL|439_443|	D003409	Novel
14510914	Association	C564766	D007454	No

Is this something bioc.pubtator supports?

import biocjson error for python 2.7

We are using bioc (1.2.3) version for python 2.7.15. We are not able to import biocjson for py2.
We checked the biocjson code and it seems like it is compatible with py3.
Could you please give us solution to use biocjson for py2 as we are using negbio + BLLIPParser ?

Can't use 'with' with BioCXMLDocumentReader

Hi, the new release has stopped the use of BioCXMLDocumentReader using a 'with' statement. It looks like the __enter__ and __exit__ methods were removed in commit 8174c1d. The documentation still suggests that you can do it that way. It'd be really useful if this could be reintroduced.

Here's a short bit of example code that gives the error below it.

import bioc
with bioc.BioCXMLDocumentReader('test.bioc.xml') as reader:
        pass

Traceback (most recent call last):
  File "itertest.py", line 3, in <module>
    with bioc.BioCXMLDocumentReader('test.bioc.xml') as reader:
AttributeError: __enter__

Convert BioC to brat?

I have a BioC formatted dataset that I'd like to be able to use in brat. I looked at the code for brat2bioc and it looks like there's no way to convert in the opposite direction; however, in the brat2bioc tech report it says that the original code in java allowed conversion in both directions ("that translates annotations originally in brat format into BioC and vice versa").

Is there a way to bring this functionality into the python module?

EDIT: spelling

Brat export of documents without entities

Hi,

I used the Brat export function of a protected corpus of a given BioC-XML-file, but I have an error
AttributeError: 'BioCDocument' object has no attribute 'entities'
Is it possible, to create BioC files without the definition of 'entities'?

I created the entities by my self:
` for passage in doc.passages:
i = i + 1
annotations = []

    for ann in passage.annotations:
        off = ann.locations
        key = len(annotations)
        start = off[0].offset
        end = off[0].offset + off[0].length
        ann = 'T' + str(key) + '\t' + ann.infons['type'] + ' ' + str(start) + ' ' + str(end) + '\t' + passage.text[off[0].offset:(off[0].offset + off[0].length)]
        annotations.append(ann)

Do you have an idea?

Best regards, Christina

bionlplab / bioc Goto Github PK

bioc's People

Contributors

Stargazers

Watchers

Forkers

bioc's Issues

Brat parser: AssertionError: Illegal format: M

Cannot read ’annotation‘ in biocxml format

lxml dependency

Incrementally decoding the BioC Json of `.tar.gz`-collection

syntax error in import bioc

Normalized datasets fail during parsing

Relations in pubtator not showing

import biocjson error for python 2.7

Can't use 'with' with BioCXMLDocumentReader

Convert BioC to brat?

Brat export of documents without entities

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent