The annotation of biological data and not only. (Wikidata source)

All scripts were developed for python3.x for Windows users (Other platforms are not verified) ###Repo contains 3 scripts:

find_file_on_wikidata.py

Script is mainly responsible for reading files which contains information that has to be found, and write the output to another file.

All configurations that script is using can be found in config.json

Main configurations and paths that have to be specified in config.json:

"classify_entities"

If you don't know the instances of each column, you would like to get it, and to write the id of the instances of to the config file. To do this, you have to put "True", and afterward, you have to specify the:

"get_possible_output_method"
There are 2 methods, by which we can define columns:

Method is searching the instaces_of separately each column, and counting only values, where in search output is only one element.
Config file looks like:

  "get_possible_output": "True",
  "get_possible_output_method": 2,
  ...

The output will be e.g.:

  {   0: [('Q7187', 486)],
      1: [('Q8054', 318), ('Q66826848', 3), ('Q417841', 1)],
      2: [('Q2996394', 447), ('Q4915012', 2), ('Q30612', 1), ('Q210973', 1), ('Q20732156', 1)],
      3: [('Q14860489', 437), ('Q67015883', 1)],
      4: [('Q5058355', 59), ('Q78155096', 6), ('Q67015883', 1), ('Q67101749', 1)]
  }

"search_after_classification"
write "True" if you want to search after classification and based on classification

Method is searching all instances_of of all related items, and then counts them.
Config file looks like:

  "get_possible_output": "True",
  "get_possible_output_method": 1,
  ...

The output will be e.g.:

          ### numb: 1
          # Q8054 --> 58.33%
          # Q7187 --> 41.67%
          ### numb: 2
          # Q7187 --> 25.0%
          # Q8054 --> 75.0%
          ### numb: 3
          # Q2996394 --> 66.67%
          #  --> 33.33%
          ### numb: 4
          #  --> 83.33%
          # Q14860489 --> 16.67%
          ### numb: 5
          # Q5058355 --> 75.0%
          #  --> 25.0%

"file_in"

The Path to the file with input data
example input file (input_file):

SCO3114,SCO3114,protein transport,,integral component of membrane
SPy_0779,SPy0779,,,
Serine transporter BC3398,serine transporter BC3398,amino acid transmembrane transport,,integral component of membrane
kdsD,PP_0957,carbohydrate metabolic process,metal ion binding,
smi_1464,smi_1464,,,integral component of membrane

The input data can be whatever it is, but every item has to be connected with at least one item.

"file_out"

Output of the script is file with links:
file_example (output_file):

http://www.wikidata.org/entity/Q23284357,http://www.wikidata.org/entity/Q27750183,http://www.wikidata.org/entity/Q14860325,,http://www.wikidata.org/entity/Q14327652
http://www.wikidata.org/entity/Q23235634,http://www.wikidata.org/entity/Q23497168,,,
http://www.wikidata.org/entity/Q23196205,http://www.wikidata.org/entity/Q23514357,http://www.wikidata.org/entity/Q14905294,,http://www.wikidata.org/entity/Q14327652
http://www.wikidata.org/entity/Q22311550,http://www.wikidata.org/entity/Q22318912,http://www.wikidata.org/entity/Q2734081,http://www.wikidata.org/entity/Q13667380,
http://www.wikidata.org/entity/Q23235886,http://www.wikidata.org/entity/Q23548717,,,http://www.wikidata.org/entity/Q14327652

"row_number_to_check"

Specify, how many rows do we want to check in the file

"multiprocessing_number"

Specify, how many cores do we want to use to fined related items

"api_search_quantity"

Specify, how many items we want to search with wikidata API, you have to specify value between 1 and 50. The bigger value, better result, but slower it will go.

"data_instances"

Specify, the data instaces for all the clumns, if you know. If you don't know, you can run program with get_possible_output: "True", and find this instaces_of.

It has to look like:

    {
    ... ,
    "data_instances": ["Q7187", "Q8054", "Q2996394", "Q14860489", "Q5058355"]
}

Or you can specify it None (if you don't know it)

    {
    ... ,
    "data_instances": "None"
}

find_single_row_on_wikidata.py

Script to find linked data in wikidata. Script uses library: requests and pywikibot to find and download data on wikidata portal.

To run the script first, you have to call the class "FindWikiPage":

data_instances = ['Q7187', 'Q8054', 'Q2996394', 'Q14860489', 'Q5058355']
find_wiki = FindWikiPage(instances=data_instances, api_search_quantity=30)

where instances are the instance_of the item in the wikidata (If not known, write None)
api_search_quantity is number of items we want to search in wikidata.

Main Methods in FindWikiPage are:

search(data)

Method for searching information about strings, but returning nothing, just remembering information that was found data - is a list of strings that are related, that we want to find

get_answer()

Method returning ids, that has been found

search_and_get(data)

Method is a combination of 2 methods above

get_list_of_possible_answers(with_instances=True)

Method returning all possible answers for this data If yo would like to get all possible answers with instance_of this items, you would have to specify with_instances=True, otherwise: with_instances=False

Some examples:

First of all call the class "FindWikiPage":

data_instances = ['Q7187', 'Q8054', 'Q2996394', 'Q14860489', 'Q5058355']
find_wiki = FindWikiPage(instances=data_instances, api_search_quantity=15)

Search and get_answers the ids:

data1 = ['SCO3114', 'protein transport', '', 'SCO3114', 'integral component of membrane']

find_wiki.search(data1)
output = find_wiki.get_answer()
print(output)

The same result will be by using search_and_get() The return is:

['Q23284357', 'Q27750183', 'Q14860325', '', 'Q14327652']

Get all possible answers with instances:

    find_wiki.search(data1)
    possible_answers = find_wiki.get_list_of_possible_answers(with_instances=True)
    print("Possible answers are:")
    pprint.pprint(possible_answers)

output:

Possible answers are:
[{'instances': ['Q8054', 'Q2996394', '', 'Q7187', 'Q5058355'],
  'items': ['Q27750183', 'Q14860325', '', 'Q23284357', 'Q14327652']},
 {'instances': ['Q7187', 'Q2996394', '', 'Q8054', 'Q5058355'],
  'items': ['Q23284357', 'Q14860325', '', 'Q27750183', 'Q14327652']},
 {'instances': ['Q8054', 'Q2996394', '', 'Q8054', 'Q5058355'],
  'items': ['Q27750183', 'Q14860325', '', 'Q27750183', 'Q14327652']}]

check_two_csvs.py

check_two_csvs(results_path,ground_truth_path)
For given results and ground truth paths, gives a weighted accuracy of the results based on the ground truth. The scoring method is not done per cell, but per row. This method is adjusted to the annotation project where the first two columns are more important than the last three.

To run this program do this:

Open check_two_csvs.py
Change to your path results_path and ground_truth_path this rows:

    check_two_csvs(results_path,
                   ground_truth_path)

Save and Run program

leoningel / annotation-of-biological-data Goto Github PK

annotation-of-biological-data's Introduction

The annotation of biological data and not only. (Wikidata source)

find_file_on_wikidata.py

Main configurations and paths that have to be specified in config.json:

find_single_row_on_wikidata.py

Some examples:

check_two_csvs.py

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent