ropensci / fishbaseapi Goto Github PK

View Code? Open in Web Editor NEW

42.0 12.0 12.0 395 KB

Fishbase API

Home Page: https://fishbaseapi.readme.io/

License: MIT License

Ruby 54.11% Shell 32.69% R 1.75% HTML 11.08% Dockerfile 0.37%

rest-api sinatra database unicorn caddy-server

fishbaseapi's Introduction

FishBase API

Update The Ruby-based fishbase API with custom endpoints has been deprecated.

Fishbase and Sealifebase data can now be accessed programmatically using a standard S3 API at the following endpoints:

https://fishbase.ropensci.org/fishbase https://fishbase.ropensci.org/sealifebase

These endpoints are provided by the open source MINIO Server which conforms to the current (v4) AWS S3 REST API. This supports direct REST queries or any of the many great and well-maintained client packages and tools, including minio client, python boto, Apache Arrow, etc.

fishbaseapi's People

Contributors

Stargazers

Watchers

Forkers

fin-fbslb evilscott bigislandfish hboo jindaeyoon otoliths faydaliyogurt lalithsaran99 iiiiiillllll mulxcode jannmacutay laowudeid

fishbaseapi's Issues

get SpecCode from species name (or higher taxonomy)

@sckott Do we have a way to get the fishbase SpecCode for a species, given the species name (or similarly, a list of all SpecCodes corresponding to species matching a queried genus, family, etc?)

DO box

@cboettig

So we currently have discuss.ropensci.org on a 1 GB droplet . A few thoughts:

Will we run into any trouble with serving the API from the same server as the discussion forum since discuss.ropensci.org is mapped to that servers IP address, and I think port 80. Can we map <server.ip.address>:80 to a domain name e.g., fishbaseapi.org exposing only the fishbase API, and not allow exposing the forum?

Box size: currently on a 1 GB droplet at $10/mo - If we decide we need more room, the next one up is 2 GB at $20/mo - I guess we should decide what resources we'll need first, then adjust the box size accordingly. The discussion forum i think takes up very little resources right now since there is little activity.

Provide full URL to images

Only file name provided now, e.,g,

http://fishbase.ropensci.org/species?genus=Abalistes&species=filamentosus

Only gives "PicPreferredName": "Abfil_u0.jpg"

Base url: http://www.fishbase.org/images/species

e.g., full image urls

nginx / login-only settings for elasticsearch and kibana?

Do we want secure credentials to be required for login access to the elasticsearch api?

I think the best way to do this would be a nginx layer with a CA certificate; since that should allow the API to work under programmatic calls without the user having to click / authenticate stuff, as long as they had the secure CA certificate installed on their machine.

Add a function to ping / reconnect the mysql server?

Hey @sckott ,

I'm having some stability problems with the mysql server still. It seems that often it is sufficient to ping the server to re-establish the connection. Could we add an endpoint that would just ping the server? (Would at least be handy for debugging). Might need to explore some more tricks to babysit the mysql server too.

Redis thoughts

We may at various points need to flush the redis cache, etc. so need to look into how to make this easy since it's running in a docker container
We may want to set an expiration time for caching in redis, e.g., items are only cached for 24 hrs, or 1 week? or no expiration?

Search for term anywhere in database

@sckott I'm wondering if we can construct an endpoint to search for a term anywhere in the database.

Given how disorganized the FishBase SQL is, it can be pretty hard to know which table to find something in (e.g. min/max temp, ropensci/rfishbase#47, that several people have requested recently -- it must be there somewhere since it's on the species summary pages).

Haven't found a great solution for doing this in SQL, but there's a few ideas:

The information_schema answer in the first link looks promising. Let me know if you get a chance to take a whack at this?

Use Watcher?

https://www.elastic.co/products/watcher

watches and gives notifications on elasticsearch server

DRY out the API script a bit?

@sckott Now that it seems like we've somewhat pinned down a standard pattern for defining endpoints, would it be worth functionalizing the API a bit more (I think you suggested this a bit earlier). I'm getting to the point where it would be convenient to add a bunch more endpoints but don't want to do a bunch of copy-paste that will make more work for you later, so thought I touch base on this first. I'm happy either way, just want to follow your lead here.

More recent fishbase dump?

@cboettig At some point we should get a more recent dump to make sure we aren't developing against old table/field names ?

error handling / recovery wrt redis

Should we handle failures of get_cached in error handling with a rescue method that would just repeat the call? e.g. I think something like this would fall-back successfully if either redis call failed.

get '/species/?:id?/?' do
    key = rediskey('species', params)
    begin
      if redis_exists(key)
        obj = get_cached(key)
      else
      obj = get_new_ids(client, key, 'species', 'SpecCode', params)
      end
    rescue
      obj = get_new_ids(client, key, 'species', 'SpecCode', params)
    end
    return give_data(obj)
  end

(related to #23)

moving log file location

Hey @sckott ,

Trying to get the logstash to work with kibana URL being configured at runtime instead of the crazy solution in #25. I should just be able to pass in the name of the server as the env var ES_HOST when running the logstash container, but for some reason doing so is causing the logstash container to crash.

Apparently the problem may be due to writing logs (and thus having to link volume of) /root (which is ~ since api runs as root on the docker container). See: pblittle/docker-logstash#56

I tried to switch the location of the logfile to /var/log/fishbase/api.log but it does not seem to be writing logs now, even though it is creating the file: (see https://github.com/ropensci/fishbaseapi/blob/master/api.rb#L13 and https://github.com/ropensci/fishbaseapi/blob/master/api.rb#L100). And for some reason, linking this volume instead still causes a crash.

Stability / Error handling

Might be good for the app to be robust to any of the non-essential components going down or not responding? This would also allow a user to deploy without things like the redis and/or logstash containers if they'd prefer.

recover option to skip handle a failed attempt to connect to redis, which would also need to trip some kind of flag to avoid any later calls to cache or read from cache I think.
similarly for logging, though I suppose logs could be written locally
might be good to be able to turn off logging completely?

Explore halt to throw good error info

http://myronmars.to/n/dev-blog/2012/01/why-sinatras-halt-is-awesome

Kibana needs manual configuration

docker exec -ti fblogstash bash
apt-get update && apt-get install -y vim-tiny
cd /opt/logstash/vendor/kibana
vi config.js

Find the line that has the server address as http://127.0.0.1 and change to match your external server address.

unit testing

Might be worth thinking if there's any kind of testing we can run here?

Might look more like queries against the testing API instance than actually deploying the API here (though I suppose the tests on rfishbase2.0 rather fulfill that role already). Or perhaps we could have a dummy SQL database with no real data, for testing purposes only.

The docker calls can all be run on circle.io.

Prevent SQL injection

Inline/API-based documentation of endpoints?

Hey @sckott,

Keep feeling it would be nice to have some inline documentation of the endpoints and I'm wondering if just editing the heartbeat list directly is the best way to do this. It's nice to get a table of endpoints and all, but at minimum it would be useful to have a description of what the endpoint does; (particularly since using the fishbase SQL table names helps in programming, but makes it all the harder to know what an endpoint with a name like intrcase actually returns). Thoughts?

Which tables need api routes?

@cboettig curious what tables you think would be good to have routes on? or are you mostly interested in replicating what the package already does, and the routes are secondary?

ecosystems route seems to not exist, or be broken somehow

/ecosystems/ gives a 500 server error, which is a catch all, so not sure what's wrong yet

delete merged branches that we're done with (unicorn, logging, ...)

Just thought I should check in about this first

prevent all http methods that are not GET

or maybe allow HEAD in addition to GET?

JSON-API standard-ish

@cboettig http://jsonapi.org/ interesting - if there's something that is getting adoption, we could consider modifying our responses

testing

http://www.sinatrarb.com/intro.html#Testing

handle reconnecting to mysql if container goes down

Currently if the mysql container exits, I need to bring everything down and restart. We should just be able to (auto) restart the mysql container.

e.g.

deploy on server: ./docker.sh. Then docker.ps shows all 5 containers running, and we can hit /mysqlping successfully.
Do: docker stop fbmysql, simulates the sql container going down. /mysqlping is now false.
Restart it: docker start fbmysql. Container comes back up, but /mysqlping is still false.

Currently I need to take everything down and restart to recover:

docker rm -f fbapi fbmysql fbredis fblogstash fbnginx
./docker.sh

If we resolve this, then we should be able to also add auto-restarting for the mysql docker container and get a more stable system. Again not sure why it goes down as frequently as it does, though could be the limited resources on my test server.

mangled json responses

Hey @sckott ,

Occasionally I will get the following error message (e.g. on a large call such as: "http://fishbaseapi.info/taxa?family=&limit=40000", though it happens occasionally on smaller calls as well)

Error in parseJSON(txt) : parse error: trailing garbage
           "Actinopterygii"     }   ] }HTTP/1.1 500 Internal Server Er
                     (right here) ------^

If you put that link into a browser you will probably get the same thing -- a long field of valid JSON that suddenly terminates with the HTTP error message. Because the header is intact, httr etc see the return as response code 200 and continue, and the error doesn't occur until the JSON parser attempts to parse the content and freaks out.

Any idea why this error is ending up in the JSON output like this? Or ideas on how to avoid it?

Taxonomy endpoint

to get taxonomic data only,

Pagination

something to start with

https://github.com/mislav/will_paginate
https://github.com/deepfryed/sinatra-paginate

in SQL we can simply query for ... limit 10 offset 20 to get certain range of results,

HTTP caching

http://www.sinatrarb.com/intro.html#Cache%20Control
http://www.w3.org/Protocols/rfc2616/rfc2616-sec13.html
https://devcenter.heroku.com/articles/increasing-application-performance-with-http-cache-headers

As far as I can tell, this only helps when in the browser, but maybe also helps with non-browser access?

Use stats/logging

Just thinking it might be valuable to be able to compute some basic metrics of use for the FishBase team to monitor traffic; e.g. which endpoints are getting hit, how often, perhaps from what geographic areas. Have you looked into this?

Query-able fields?

@cboettig what fields should users be allowed to query? since you're more familiar with the data...

Right now, I've set it up so that users can query on any field. Do you think this is best? Or do you think only certain fields should be exposed to query on? If we do limit to only some fields can be queried on that does make it a little easier to solve #7 because we don't have to account for any field queried.

note that this is different from what gets returned (All fields unless there some reason not to)

Elasticsearch fail behavior

Wonder if there's a way to not lose old indices (each day's collection of logs) when the container crashes. E.g., today @cboettig you put it back up, but we only have today's index. However, it's weird, cause it had data only from back in April, and we know there's been requests since then

Set maximum value for limit parameter

to 5000 for now

Route on api that redirects to the documentation

E.g., Martin has Lagotto setup so that alm.plos.org/api redirects to the docs (see http://alm.plos.org/api)

We could have e.g., fishbaseapi.info/docs redirect to http://docs.fishbaseapi.apiary.io with a simple

get '/docs' do
    redirect 'http://docs.fishbaseapi.apiary.io'
end

passenger info

not saying we should switch, but just collecting thoughts since martin recommended it:

Check parameters

Some routes will fail when a parameter is passed that is not a field in the table being queried. Right now we use check_fields() to make sure the user isn't requesting a field that doesn't exist, but we should do similarly for fields queried on.

Docker script

@cboettig on trying to set this up on a DO droplet, lines in docker.sh work great, but then on trying these lines https://github.com/ropensci/fishbaseapi/blob/master/docker.sh#L22-L31 I get FATA[0000] cannot enable tty mode on non tty input

I did scp the sql database into the droplet, so it's there and named correctly. I'm probably missing something easy...

Devise plan for shuttling elasticsearch log data elsewhere

Since security is an issue with elasticsearch, we could use ES's snapshot and restore tools http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/modules-snapshots.html that should allow us to create backups that we can easily scp or similar to our own machines, load into ES locally, and explore...

Fields that do not exist/spelled wrong - seems like most APIs I've played with silently ignore fields that don't exist - if we do this, we have to compare to fields that exist in a table, and drop those that don't match...
path doesn't exist (e.g., user calls /swisscheese) -> throw 404

curl -v http://fishbaseapi.info/genera56

and returns genus code 56.


  "count": 1,
  "returned": 1,
  "error": null,
  "data": [
    {
      "GenCode": 56,
      "GenName": "Euprotomicrus",
      "GenAuthorYear": "Gill, 1865",
      "GenAuthor": "Gill",
      "GenYear": 1865,
...cutoff

i imagine this is b/c https://github.com/ropensci/fishbaseapi/blob/master/api.rb#L203 the trailing slash is allowed to be absent, but really this should throw a 404

ropensci / fishbaseapi Goto Github PK

fishbaseapi's Introduction

FishBase API

fishbaseapi's People

Contributors

Stargazers

Watchers

Forkers

fishbaseapi's Issues

Recommend Projects

Recommend Topics

Recommend Org