elixirtess / tess_scrapers Goto Github PK
View Code? Open in Web Editor NEWTeSS HTML page scrapers in Ruby looking for training resources and events metadata.
License: Other
TeSS HTML page scrapers in Ruby looking for training resources and events metadata.
License: Other
The BITSVIB people are writing JSONLD expressions of all their things.
Would be worth replacing the existing BITS scraper with one for this feed. http://dev.bits.vib.be/eulife/all_events-vib_conferences.json
Keep the old data and associate any new stuff with the new old content providerr
http://www.bioinfo.no/Training
Not well formatted so don't attemp. Approaching them about schema.org first
We've got a materials one, but now they're implementing schema.org for events
They separate old and new
* http://www.france-bioinformatique.fr/en/evenements_upcoming for upcoming events
* http://www.france-bioinformatique.fr/en/evenements_previous for our previous events.
Not searching within nodes properly.
Probably due to overuse of "optional" flag for example in the following code:
pattern RDF::Query::Pattern.new(material_uri, RDF::Vocab::SIOC.has_creator, :author_obs, optional: true)
pattern RDF::Query::Pattern.new(:author_obs, RDF::Vocab::SCHEMA.name, :authors, optional: true)
if no author_obs
were found, authors
contain all the RDF::Vocab::SCHEMA.name
s from the whole page.
a) scraper hasn't been run in a long while
b) Duplicate records for some materials https://tess.elixir-europe.org/materials?content_provider=IFB+French+Institute+of+Bioinformatics&page=3
c) New scraper for events - https://www.france-bioinformatique.fr/en/evenements_upcoming
The scrapers need to be updated so that records can be updated. Ideally there will be some sort of versioning active on the main TeSS site so that the previous revision will be kept.
GobletRdfaScraper: redirection forbidden: http://www.mygoblet.org/training-portal/materials-xml -> https://www.mygoblet.orgtraining-portal/materials-xml
(missing / between domain and path)
But the actual URL returns a 404: https://www.mygoblet.org/training-portal/materials-xml
Contact the maintainer and let them know their web server config is broken.
The training materials are listed in a TSV format here;
https://github.com/Bioconductor/bioconductor.org/blob/master/etc/course_descriptions.tsv
For each material you can add the title column; keyword; instructor (as author); and URL. Extract and add the first 'material' url as the link; and any subsequent URLs as associated resources.
All can be linked with the bio.tools Tool; Bioconductor:
https://bio.tools/bioconductor
Any upcoming events can be added as events and they should have a link to the added material too!
We need to bring back the old GOBLET API as the RDFa one is insufficuent
github.com/ElixirUK/TeSS_scrapers/blob/9f22a17065e0d2c97f9023568e69979f8ae68dd3/unrefactored_scrapers/goblet_api_scraper.rb
It needs putting in the new scraper framework format
Could not email: From: TeSS <[email protected]>
To: TeSS <[email protected]>
Subject: Scraper Failure
It would seem that the following scrapers have failed to run:
GalaxyScraper: undefined method `each' for nil:NilClass
GobletRdfaScraper: redirection forbidden: http://www.mygoblet.org/training-portal/materials-xml -> https://www.mygoblet.orgtraining-portal/materials-xml
| 553 5.1.8 <[email protected]>... Domain of sender address [email protected] does not exist
http://www.ucl.ac.uk/isd/services/research-it/training/#training
Contact them (or James Hetherington) about structured data first
Some scrapers randomly change the order of certain array fields, causing an activity log to be generated even though nothing really changed.
scraper updated "Key-terms", a learning game for conceptual consolidation at 2017-10-05 03:10:19 UTC.
changed Remote updated date to: 2017-10-05
scraper updated "Key-terms", a learning game for conceptual consolidation at 2017-10-04 03:10:04 UTC.
changed Target audience to: ["Trainers", "Educators", "Ontologists"]
changed Remote updated date to: 2017-10-04
scraper updated "Key-terms", a learning game for conceptual consolidation at 2017-10-03 09:15:05 UTC.
changed Target audience to: ["Educators", "Trainers", "Ontologists"]
changed Remote updated date to: 2017-10-03
scraper updated "Key-terms", a learning game for conceptual consolidation at 2017-09-28 03:21:47 UTC.
changed Target audience to: ["Ontologists", "Trainers", "Educators"]
changed Remote updated date to: 2017-09-28
http://www.elixir-czech.cz/events/workshops-and-courses
Have google calendar links
Needs to handle cases where metadata isn't available, or is in a different form to what was expected.
Cross issue with: ElixirTeSS/TeSS#360
Integrating TeSS workflows in bio.tools. The bio.tools want a read-only workflow viewer in their registry. I said we'd do it as we know the stuff well. They're using git in a private repo so I am waiting to get access. For the development of this:
phase 0: Add EDAM scientific topics to TeSS workflows. We'll have to extend this to include EDAM operations too at some point.
phase 1: They'll store some workflows in the original cytoscpae JSON format on their server. We'll add the cytoscape/TeSS-workflow Jquery library to their codebase to render it.
phase 2: Have them read the workflow from our API so it renders the latest version
From a user regarding the SIB scraper:
Now a "last" request... could Switzerland appear in the Country drop down menu on the Events left hand side? :)
(hummm, I think Wageningen is not a country...).
Uses reveal https://github.com/bgruening/training-material
BiVi - Bioinformatics Visualization - have made some RSS feeds for us. These are the three RSS feeds listed below: one event, one material, and one to be ignored for now as it seems more to be tools than TeSS content. The <description></description> element has encoded some attributes within the text in the format
#field: <value>
So these will need to be regexed out, and some are lists (e.g. keywords) so will need to be split by commas.
Ignore: http://bivi.co/visualisation-feed
Materials: http://bivi.co/presentation-feed
Events: http://bivi.co/event-feed
Text for content provider should be
About BiVi
The Biological Visualisation Network (BiVi) provides a forum for dissemination, training and discussion for life-scientists to discover and promote complex data visualisation ideas and solutions. BiVi, funded by the BBSRC, is a central resource for information on bio-visualisation and is supplemented with annual meetings for networking and educational purposes, focussed around emerging trends in visualisation and challenges facing biology.
The Hub said our scraper has not been retrieving all of the events from here https://www.elixir-europe.org/events
Software carpentry have an ICS file of all training events. Should be easy enough to write a scraper for http://software-carpentry.org/workshops.ics
Need to check the query being used is sensible
undefined method `include?' for nil:NilClass
/home/tess/TeSS_scrapers/app/scrapers/data_carpentry_scraper.rb:47:in `block in scrape'
TeSS itself is using Nominatim for Geocoder lookups:
The scrapers should perhaps be updated to operate similarly.
Should just be the terms
Also scientific_topic_names
are now missing
Relates to ElixirTeSS/TeSS#123
Parse NBIS. It's the swedish nodes stuff. https://www.googleapis.com/calendar/v3/calendars/bils.elixir%40gmail.com/events?key=AIzaSyA7tQAGCL4d8mNBSUZRBhedexrswhzgY6s&orderBy=startTime&singleEvents=true
Look back into making a pull request for adding schema.org to Software carpentry template. Contact jduckles
If for some reason the scraper fails to run, send out an e-mail informing tess info mailing / niall
Scraper hasn't run for ages.
Need to filter by tag that says 'tess'.
Archive feature on website. Click on an event that has been and gone. There's a field at the bottom called 'course materials' this gives you link to course material
undefined method `address_components' for nil:NilClass
/home/tess/TeSS_scrapers/app/scrapers/software_carpentry_events_scraper.rb:38:in `block (2 levels) in scrape'
/home/tess/TeSS_scrapers/app/scrapers/software_carpentry_events_scraper.rb:29:in `each'
Hasn't ran on production for 4 weeks
and potentially is missing some fields?
e.g.
old: short_description: material['schema:about'],
new: short_description: material['http://schema.org/description'],
There are a huge number!
Training events from Cambridge need to be added. No API or possibility of better structured data. It'll have to be HTML :(
Looks nicely formatted!
http://www.prace-ri.eu/upcoming-patc-events/
Separate out our schema.org extraction functionality into a separate repository in bioschemas. This will allow people to extend it for all sorts of schema types and use it in the creation of new tools.
IFB Scraper
Khan Academy
CSC Events
SIB Scraper
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.