Coder Social home page Coder Social logo

bioc's People

Contributors

dependabot[bot] avatar jakelever avatar mart1nro avatar yfpeng avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

bioc's Issues

Brat parser: AssertionError: Illegal format: M

First of all, thank you for this incredibly useful library.

I am trying to parse a brat file of this resource but I get the following error:

[ins] In [26]: a2 = brat.load_ann("devel/PMC-3333881-05-MATERIALS_AND_METHODS.a1")
---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
Input In [26], in <module>
----> 1 a2 = brat.load_ann("devel/PMC-3333881-05-MATERIALS_AND_METHODS.a1")

File ~/.venv/bigbio/lib/python3.9/site-packages/bioc/brat/decoder.py:214, in load_ann(fp, docid)
    212     doc.add_annotation(loads_brat_note(line))
    213 if line[0] == 'A' or line[0] == 'M':
--> 214     doc.add_annotation(loads_brat_attribute(line))
    215 if line[0] == '*':
    216     doc.add_annotation(loads_brat_equiv(line))

File ~/.venv/bigbio/lib/python3.9/site-packages/bioc/brat/decoder.py:16, in loads_brat_attribute(s)
     12 """
     13 ID [tab] TYPE REFID [FLAG1 FLAG2 ...]
     14 """
     15 toks = s.split('\t')
---> 16 assert len(toks) == 2, 'Illegal format: %s' % s
     18 att = BratAttribute()
     19 att.id = toks[0]

AssertionError: Illegal format: M

Cannot read ’annotation‘ in biocxml format

I’m writing a python script, to convert biocxml file into pubtator file.
I did not find similar script, so all I can do is to write one on my own.

The bioc files are downloaded from :
https://ftp.ncbi.nlm.nih.gov/pub/lu/BioRED/BioRED.zip

I tried to read the "Test.BioC.XML" in two ways:
1:
with open(fpath, 'r') as fp:
collection = biocxml.load(fp)
docs = collection.documents
2:
with biocxml.iterparse(fpath) as reader:
collection_info = reader.get_collection_info()
for doc in reader:

It is strange to find that all annotations are missing, but relations are corrected parsed.

image

Any idea why this happens?

lxml dependency

First of all, great library. Thank you for you work.

I'm wondering is there a strict requirement for lxml=4.4.1. Or could it be more flexible like lxml>=4.4.1. It doesn't seem like lxml introduced breaking changes https://lxml.de/4.5/changes-4.5.0.html, but it would be really helpful in projects with multiple dependencies which can conflict.

Incrementally decoding the BioC Json of `.tar.gz`-collection

PubMed Central provides their Open Access articles in the BioC JSON-format (see API and Bulk Download). I downloaded one portion with
wget https://ftp.ncbi.nlm.nih.gov/pub/wilbur/BioC-PMC/PMC095XXXXX_json_ascii.tar.gz
and want to document-wise apply a filter (need to save memory). I tried following code:

from tqdm import tqdm
import gzip
import io

keyword = 'diabetes'
my_doi_list = []
path_file_PMC = '/content/PMC095XXXXX_json_ascii.tar.gz'
path_file_PMC_filtered = '/content/result'

with gzip.open(path_file_PMC, 'rb') as gz, open(path_file_PMC_filtered, 'wb') as f_out:
    f = io.BufferedReader(gz)
    for line in tqdm(f.readlines()):
        record = json.loads(line)
        # doi = record['documents'][0]['passages'][0]['infons']['article-id_doi']
        if keyword in record['documents'][0]['passages'][0]['text']: 
            # my_doi_list.append(doi)
            f_out.write(line)

But face an error:

JSONDecodeError: Expecting value: line 1 column 1 (char 0)

0%|          | 0/95046 [00:00<?, ?it/s]
---------------------------------------------------------------------------
JSONDecodeError                           Traceback (most recent call last)
[<ipython-input-21-cc459cfaf959>](https://localhost:8080/#) in <module>
     19     # f = gz
     20     for line in tqdm(f.readlines()):
---> 21         record = json.loads(line)
     22         doi = record['documents'][0]['passages'][0]['infons']['article-id_doi']
     23         if keyword in record['documents'][0]['passages'][0]['text']:   # TODO: <<< change this to your filter

2 frames
[/usr/lib/python3.7/json/__init__.py](https://localhost:8080/#) in loads(s, encoding, cls, object_hook, parse_float, parse_int, parse_constant, object_pairs_hook, **kw)
    346             parse_int is None and parse_float is None and
    347             parse_constant is None and object_pairs_hook is None and not kw):
--> 348         return _default_decoder.decode(s)
    349     if cls is None:
    350         cls = JSONDecoder

[/usr/lib/python3.7/json/decoder.py](https://localhost:8080/#) in decode(self, s, _w)
    335 
    336         """
--> 337         obj, end = self.raw_decode(s, idx=_w(s, 0).end())
    338         end = _w(s, end).end()
    339         if end != len(s):

[/usr/lib/python3.7/json/decoder.py](https://localhost:8080/#) in raw_decode(self, s, idx)
    353             obj, end = self.scan_once(s, idx)
    354         except StopIteration as err:
--> 355             raise JSONDecodeError("Expecting value", s, err.value) from None
    356         return obj, end

JSONDecodeError: Expecting value: line 1 column 1 (char 0)

Then I found your python package, and I thought I could use one of the codes provided here: https://bioc.readthedocs.io/en/latest/biocjson.html

However, I don't want to unzip or untar, but from your code examples it is not clear which format the file at filename has. Is it possible to use functions of your package but use .tar.gz. as input or do I need to unzip (w/o untar)?

syntax error in import bioc

Hi,

I am attempting to use BioC on MacOS X and when I try to import bioc in Python 2.7.10 I get the following error:

>>> import bioc
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Library/Python/2.7/site-packages/bioc/__init__.py", line 4, in <module>
    from .bioc import BioCCollection, BioCDocument, BioCPassage, BioCSentence, BioCAnnotation, \
  File "/Library/Python/2.7/site-packages/bioc/bioc.py", line 24
    def __init__(self, refid: str, role: str):
                            ^
SyntaxError: invalid syntax

I have updated to the most recent version:

sudo pip install bioc
Password:
Requirement already satisfied: bioc in /Library/Python/2.7/site-packages (1.3.1)
Requirement already satisfied: docutils==0.14 in ./Library/Python/2.7/lib/python/site-packages (from bioc) (0.14)
Requirement already satisfied: lxml==4.2.5 in /Library/Python/2.7/site-packages (from bioc) (4.2.5)
Requirement already satisfied: jsonlines==1.2.0 in /Library/Python/2.7/site-packages (from bioc) (1.2.0)
Requirement already satisfied: six in ./Library/Python/2.7/lib/python/site-packages (from jsonlines==1.2.0->bioc) (1.11.0)

Thank you in advance for your help!
Karyn

Relations in pubtator not showing

I'm trying to read the BioRED file from here: https://ftp.ncbi.nlm.nih.gov/pub/lu/BioRED/
As you will see, it has pubtator formats. The annotations are read in propely, but when I try to read the relations, it doesn't provide any output
The relations are written as follows:

14510914	Association	D050033	D007454	No
14510914	Positive_Correlation	p|DEL|439_443|	C564766	Novel
14510914	Positive_Correlation	p|DEL|439_443|	D003409	Novel
14510914	Association	C564766	D007454	No

Is this something bioc.pubtator supports?

import biocjson error for python 2.7

We are using bioc (1.2.3) version for python 2.7.15. We are not able to import biocjson for py2.
We checked the biocjson code and it seems like it is compatible with py3.
Could you please give us solution to use biocjson for py2 as we are using negbio + BLLIPParser ?
image

Can't use 'with' with BioCXMLDocumentReader

Hi, the new release has stopped the use of BioCXMLDocumentReader using a 'with' statement. It looks like the __enter__ and __exit__ methods were removed in commit 8174c1d. The documentation still suggests that you can do it that way. It'd be really useful if this could be reintroduced.

Here's a short bit of example code that gives the error below it.

import bioc
with bioc.BioCXMLDocumentReader('test.bioc.xml') as reader:
        pass
Traceback (most recent call last):
  File "itertest.py", line 3, in <module>
    with bioc.BioCXMLDocumentReader('test.bioc.xml') as reader:
AttributeError: __enter__

Convert BioC to brat?

I have a BioC formatted dataset that I'd like to be able to use in brat. I looked at the code for brat2bioc and it looks like there's no way to convert in the opposite direction; however, in the brat2bioc tech report it says that the original code in java allowed conversion in both directions ("that translates annotations originally in brat format into BioC and vice versa").

Is there a way to bring this functionality into the python module?

EDIT: spelling

Brat export of documents without entities

Hi,

I used the Brat export function of a protected corpus of a given BioC-XML-file, but I have an error
AttributeError: 'BioCDocument' object has no attribute 'entities'
Is it possible, to create BioC files without the definition of 'entities'?

I created the entities by my self:
` for passage in doc.passages:
i = i + 1
annotations = []

    for ann in passage.annotations:
        off = ann.locations
        key = len(annotations)
        start = off[0].offset
        end = off[0].offset + off[0].length
        ann = 'T' + str(key) + '\t' + ann.infons['type'] + ' ' + str(start) + ' ' + str(end) + '\t' + passage.text[off[0].offset:(off[0].offset + off[0].length)]
        annotations.append(ann)

`

Do you have an idea?

Best regards, Christina

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.