Coder Social home page Coder Social logo

scheduled-bots's Introduction

scheduled-bots

These bots were running on Jenkins, which were hosted on AWS.

Bot Creation Guidelines

Data Sources

Data used by a bot, that is regularly updated by an external source, should be handled by our instance of the Biothings.api. The bot should access the data from the mongodb server which is running on the same instance as Jenkins.

Bots

See Bot Status

Wikidata - Disease Ontology Feedback Loop

Code for detecting changes and creating robot templates is located here: https://github.com/SuLab/scheduled-bots/blob/master/scheduled_bots/disease_ontology/robot/run.py

Installation

Install files from requirements.txt and WikiDataIntegrator via pip. Install scheduled-bots by changing into the directory with the setup.py file and install it with pip install -e .

To import from scheduled-bots, you may also need to add that same directory to your system path with sys.path.insert(0, '<path>').

scheduled-bots's People

Contributors

alexanderpico avatar andrawaag avatar andrewsu avatar ariutta avatar egonw avatar gtsueng avatar jacobsonmt avatar sebotic avatar stuppie avatar turoger avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

scheduled-bots's Issues

investigate annotations to deprecated GO terms

There are currently 80 items that have a BP annotation to single organismal cell-cell adhesion, despite that term being deprecated by GO. I spot checked a few (e.g., https://www.uniprot.org/uniprot/P35222) and those annotations appear to no longer exist. I would have expected the GO bot to delete these annotations then. We should investigate whether this is the intended bot behavior for reasons I'm forgetting, or whether this is a bug / edge case that needs to be dealt with.

(side note -- would be interested to know how many deprecated GO terms are in WD currently... that would give us a sense of the scale of the issue here.)

GeneBot_microbes fails

The GeneBot_microbes failed on its last regular attempt.

The error-message is:
File "GeneBot.py", line 858, in <module> metadata = mgd.get_metadata()['src_version'] KeyError: 'src_version'

misplaced : in the MeSH identifiers

As per https://www.wikidata.org/wiki/User_talk:ProteinBoxBot#MeSH_descriptor_ID_(P486)_edits_again (diff)

Looking at dystonia (Q906492) I see three bad edits on the MeSH property P486. Two are recent and by ProteinBoxBot, and one is by User:Andrawaag from earlier in 2020.

The edits are all technically incorrect, from the point of view of adding incorrect IDs.

See also the previous discussion above. All those edits are sourced to Disease Ontology. Two relate to the MeSH term "Dystonic Disorders" with ID D020821. This is distinct from "Dystonia" which is ID D004421. I thought we had discussed exhaustively why DO referencing should not be used to introduce this sort of database constraint violation here.

A couple of hundred such bot edits with the ":" prefix have appeared.

I would like to comment also that there is a WikiCite e-scholarship that has been given for work on the MeSH statements. To support the developer working on that project, I have been bearing down on the P486 constraint violations, because the project will rely on there being no avoidable duplications. The number of duplications applying to the D-numbers and logged at Wikidata:Database reports/Constraint violations/P486 had been reduced to about a dozen. It is really not acceptable, after the discussion above, and the one I had with Andra in Berlin last year, that this issue should recur. Charles Matthews (talk) 13:09, 8 December 2020 (UTC)

A merge between Q55950049 and Q861224 blocks GeneDiseaseBot to update the WD item on DOID:626

the console output from the GeneDiseaseBot http://jenkins.sulab.org/job/GeneDiseaseBot/332/console indicates an issue with updating Disease Ontology ID 626. It seems that a merge caused this behaviour

Error while writing to Wikidata
{'error': {'code': 'modification-failed', 'info': '[[Q55950049|Q55950049]] not found', 'messages': [{'name': 'wikibase-validator-no-such-entity', 'parameters': ['[[Q55950049|Q55950049]]'], 'html': {'': 'Q55950049 not found'}}], '': 'See https://www.wikidata.org/w/api.php for API usage. Subscribe to the mediawiki-api-announce mailing list at <https://lists.wikimedia.org/mailman/listinfo/mediawiki-api-announce> for notice of API deprecations and breaking changes.'}, 'servedby': 'mw1348'}
Error while writing to Wikidata
{'error': {'code': 'modification-failed', 'info': '[[Q55950049|Q55950049]] not found', 'messages': [{'name': 'wikibase-validator-no-such-entity', 'parameters': ['[[Q55950049|Q55950049]]'], 'html': {'': 'Q55950049 not found'}}], '': 'See https://www.wikidata.org/w/api.php for API usage. Subscribe to the mediawiki-api-announce mailing list at <https://lists.wikimedia.org/mailman/listinfo/mediawiki-api-announce> for notice of API deprecations and breaking changes.'}, 'servedby': 'mw1225'}
Error while writing to Wikidata
{'error': {'code': 'modification-failed', 'info': '[[Q55950049|Q55950049]] not found', 'messages': [{'name': 'wikibase-validator-no-such-entity', 'parameters': ['[[Q55950049|Q55950049]]'], 'html': {'': 'Q55950049 not found'}}], '': 'See https://www.wikidata.org/w/api.php for API usage. Subscribe to the mediawiki-api-announce mailing list at <https://lists.wikimedia.org/mailman/listinfo/mediawiki-api-announce> for notice of API deprecations and breaking changes.'}, 'servedby': 'mw1346'}
Begin creating Wikidata Disease items with new relationships
P2293 not found in fastrun
Error while writing to Wikidata
{'error': {'code': 'modification-failed', 'info': '[[Q55790461|Q55790461]] not found', 'messages': [{'name': 'wikibase-validator-no-such-entity', 'parameters': ['[[Q55790461|Q55790461]]'], 'html': {'': 'Q55790461 not found'}}], '': 'See https://www.wikidata.org/w/api.php for API usage. Subscribe to the mediawiki-api-announce mailing list at <https://lists.wikimedia.org/mailman/listinfo/mediawiki-api-announce> for notice of API deprecations and breaking changes.'}, 'servedby': 'mw1346'}

[clingen bot]: refactor to use mygene.info instead of downloading flat file directly

From slack discussion with @sabahzero

"refactor to use mygene.info": the intent here was to change your bot to get data from mygene.info, rather than directly downloading the csv file. This call http://mygene.info/v3/query?q=_exists_:clingen will get you the list of gene IDs with clingen data, and calls like this http://mygene.info/v3/gene/144568?fields=clingen will get all the info your bot needs. This might be a bit cleaner because the mygene.info team will make sure the parsing stays up to date with any changes in the file format.

the potential disadvantage is that it introduces another dependency (on mygene.info), but obviously that's something we're committed to for the long term...

Gene disease bot fails

The Gene Disease Bot just failed. (log).

The reason given is
KeyError: 'item' Build step 'Execute shell' marked build as failure Archiving artifacts

This is consistent with a shex bot I am currently developing. In that bot I fixed it by storing the SPARQL results as pandas dataframes. I need to figure out if this should be the sollution here as well, but given it failes on scheduled bots, it might make sense to address it in WDI.

GO OWL Bot

Some notes

Things we may want to add (that aren't being used now)

trans-synaptic signaling by endocannabinoid, modulating synaptic transmission (GO_0099553) ->
mediated_by (GOREL_0001007) ->
cannabinoid (CHEBI_67194)

query to get regulates (and other owl:Restriction values)

PREFIX owl: <http://www.w3.org/2002/07/owl#>
PREFIX oboInOwl: <http://www.geneontology.org/formats/oboInOwl#>
PREFIX obo: <http://purl.obolibrary.org/obo/>
 select * where {
 ?_id rdfs:subClassOf ?obj .
 ?obj owl:onProperty ?op .
 ?obj owl:someValuesFrom ?svf
 }

New namespace from identifiers.org in Wikipathways

In initially Wikipathways used http://identifiers.org/.... as the prefix in IRIs, currently, this has been changed to "https". This causes the bot to fail on those pathways that use the new prefix.
Two things need to be done.
[ ] update the bot to parse the new IRI prefix for identifiers.org (eg. here, here, here and here
[ ] Decide on how to deal with the old prefix of identifiers.org. Do we keep the old mappings or do update all previous mappings to http://idnetifiers.org as well?

Missing some CIViC evidence statements in Wikidata

Hi guys, I've just realized that some evidence statements coming from CIViC are not echoed in Wikidata at all. E.g.,

EGFR AMPLIFICATION in Wikidata => 0 statements
EGFR AMPLIFICATION in CIViC => 9 accepted evidence statements

ERBB2 AMPLIFICATION in Wikidata => 0 statements
ERBB2 AMPLIFICATION in CIViC => 58 accepted evidence statements

Taking a quick look at the bot, it seems that it gets all variants:

r = requests.get('https://civic.genome.wustl.edu/api/variants?count=999999999')

then, it gets statements for each variant, here

for record in tqdm(records):
    try:
        run_one(record['id'], retrieved, fast_run, write, login)
    except Exception as e:
        traceback.print_exc()
        wdi_core.WDItemEngine.log("ERROR", wdi_helpers.format_msg(
            record['id'], PROPS['CIViC Variant ID'], None, str(e), type(e)))

Later it processes data and it loads them in Wikidata. Theoretically speaking, it should get also the aforementioned ones but they are not present.

CGI bot does not run because of duplicates for combination therapies on Wikidata

The CGI bot fails repeatedly. This is caused when duplicate items are created for the same combination therapy. Here are two examples:

buparlisib / paclitaxel / carboplatin combination therapy (Q58644763) and carboplatin / buparlisib / paclitaxel combination therapy ((Q88405264).

The issue is fixed when the two items are merged into one item and the bot runs again.

The issue emerges with the following error message:
Traceback (most recent call last): File "normalize_drugs.py", line 136, in <module> assert len(combo_qid) == len(qid_combo)

Identifying which items are the culprit in this case, is a bit tedious. I wrote a script to identify the to be merged items. Running that script identifies the two items. In the future we could include that in the bot, so the bot does not fail, but just fix in the process.

For now, I prefer to do the fixing manually to keep an eye on the merging process.

After emerging the above-mentioned items, the bot ran successfully.

CGI bot is disabled

CGI is taken of the production tab.

The bot chokes on a how combinatorial drugs are being described in Wikidata.

We had the same issue on build 32. Back then this was fixed by normalizing the combinatorial drugs manually.

Documentation for the CGI bot is missing.

Jenkins is maintaining a variety of bots. The documentation of the individual bots could be improved. This issue is to work towards better documentation about the CGI bot, which could also act as a template for the other bots.
The documentation should include a(mong other things) a description of the bot, the schema on Wikidata, mapping between the schema and the original source.

Gene Bot: Ensembl only genes

These currently get skipped:

{'_id': 'ENSG00000269044',
 '_score': 1.55,
 'ensembl': {'gene': 'ENSG00000269044',
  'transcript': 'ENST00000598112',
  'translation': []},
 'genomic_pos': {'chr': '19',
  'end': 16635269,
  'start': 16633797,
  'strand': -1},
 'genomic_pos_hg19': {'chr': '19',
  'end': 16746080,
  'start': 16744614,
  'strand': -1},
 'symbol': 'CTC-429P9.3',
 'taxid': 9606}

zero size edits by WDI

Since the fast run functionality was relying heavily on the WDQS it should no longer be used in the context of the new WDQS countermeasures to deal with the heavy overall use of the WDQS.
This means that our zero-size edits are back and requires an alternative approach.

I suggest looking for a solution in the MediaWiki API calls

BRAF has no Ensembl Gene ID anymore

Hey guys, I noticed that BRAF lost its Ensembl Gene ID. Probably, it was removed on Aug, 16 by ProteinBoxBot. AFAIK this is the last item version with the Ensembl ID.

I'm not sure it is normal. It could be a symptom of a more general problem which may compromise mappings/queries based on Ensembl Gene IDs.

Possible broken items

Excuse me if the title is not much informative but it seems like something weird is going on in some items annotated with Ensembl Gene ID (P594).

According to Wikidata data model, we should have this kind of graph pattern:

wd:item wdt:property ?value .
wd:item p:property ?statement .
?statement ps:property ?value .

Sometimes it seems like such a pattern does not exists at all. E.g.,

ASK WHERE
{
  wd:Q18047295 wdt:P594 ?ensembl.
  wd:Q18047295 p:P594 ?statement .
  ?statement ps:P594 ?ensembl .
}

# False

Try it!

Indeed, if I try:

SELECT * WHERE
{
  wd:Q18047295 p:P594 ?statement .
  ?statement ps:P594 ?ensembl . 
}

Try it!

I get no matching records found. Interestingly, If I try the same queries after a copule of minutes, the result does change. It seems like the result depends on the server node I'm actually hitting (or on its caching system), therefore if the query hits the one(s) with broken items, I get no data.

Could it be a kind of problem like the one discussed in SuLab/WikidataIntegrator#65?

OBOGraph based bots don't behave as expected

Upon restart of the Jenkins after implementing required throttling measure, the OBOGraph bots (Disease- and Gene Ontology), don't work as expected. The disease ontology bots only update a very small subset of statements. The Gene Ontology bot does not respect the fastrun.
Need to inspect and fix.

Jenkins/scheduled-bot cleanup needed

Both the microbial and mammalian Genebots are consolidated into a single bot.
Both the geneprotein folder in scheduled-bots and jenkins requires some reordering to make this more visible.
Also before this consolidation of genebots, multiple bots on mammalian genes, e.g. rat and mouse ran, which don't run now. They need to resume, to maintain the content already added.

Bot best practices

Dealing with label & description changes
https://www.wikidata.org/w/index.php?title=Q24726011&type=revision&diff=411053368&oldid=409920659
Need to keep label and add to aliases?

Don't overwrite things people have added
https://www.wikidata.org/w/index.php?title=Q2838685&diff=prev&oldid=410957448
Fixed by adding instance of to append values, but:

Actual change, that should have been changed, but also wiped something someone added
https://www.wikidata.org/w/index.php?title=Q24724709&diff=prev&oldid=410957751
Have to check which statements are from previous interpro versions and only wipe those

Double ENSG codes in Wikidata human genes

Hello Guys. I hope I'm posting in the right place. I was mapping my local Ensembl IDs to Wikidata when I found some double ENSG codes in Wikidata human genes collection.

For example: Q18035090 and Q30251272 do have same IDs.

Here is the full list.

It seemed strange to me but maybe it's perfectly normal. ๐Ÿ˜‰

Bye!

Duplicated Ensembl IDs

Hi guys! I am opening this issue to notify a potential problem that I found in data.

According to this query:

SELECT ?item ?itemLabel ?item2 ?item2Label 
WHERE 
{
  ?item wdt:P594 ?ensg .
  ?item2 wdt:P594 ?ensg .
  FILTER (str(?item) > str(?item2))
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}

Try it!

There are some Ensembl IDs re used across items. It sounds pretty strange.

For example, Q413766 is Fibronectin 1 protein, and Q14819473 is its encoding gene. Both items share
?item wdt:P594 'ENSG00000115414'. AFAIK, ENSG* should be reserved to genes.

Is there something to check in data loading process?

PS: guys at SuLab, please don't hate me too much for my issues submissions ๐Ÿ˜ƒ

GeneBot_microbes is disabled

The bot does not run. The output suggests that there is an data issue with: ftp://ftp.ncbi.nlm.nih.gov/genomes/GENOME_REPORTS/prokaryotes.txt

pandas.errors.ParserError: Error tokenizing data. C error: Expected 23 fields in line 33502, saw 24
see: console output for details

Changes to Gene Bot

Fix HGNC symbols
bitbucket issue
Example
mygene.info
Issue is mygene.info has incorrect values in the symbol field for certain genes. I don't know where these come from.
Fix: If an entry doesn't have a "HGNC" field, the "symbol" field won't be used.

Add chromosome items as qualifier to genomic positions
Needed for wikigenomes

Add genome assembly as qualifier to chromosome
Issue: It is possible that a gene is located on two different chromosome in two different builds
example
mygene.info
These genes need two chromosome statements, qualify using genome assembly

Some genes are missing chromosome
Example
mygene.info
Issue: mygene.info has an alternative chromosome identifier specified.
Fix: The identifier mapping is located here (for human). The chromosome should be added to the item, along with the RefSeq-Accn

miscellaneous things to handle
This gene has multiple genomic positions: http://mygene.info/v3/gene/8924
Gene on multiple chromosomes: https://mygene.info/v3/gene/150786

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.