cordis-projects-crawler's Issues

H2020 compatible

Is this crawler H2020 file format compatible?
Also, some results briefs are published for fp7.

v2.0

This is the plan for v2.0. I'll modify this description as something changes/comes up in my mind or in the comments below.

Features (based on 1.x)

Crawl project RCNs: single RCN, RCN list, RCN range
Crawl RCNs found in output directory
Crawl all available RCN
Crawl RCNs of search URL
CSV/TSV export
MySQL export
~~RCN list export + seed?~~ v2.1
Java API for developers
CLI for users
Fancy documentation with Docsify (MySQL docs, extending docs)

Improvements

Use CORDIS XML and OpenAIRE API
Test old and new projects too
~~Unified view - separate ticket? may need help?~~ v2.1

Under the hood

JitPack compatible POM
Complete rewrite in Kotlin
Batch processing pattern (Kotlin sequences)
Parse XML with Simple framework
Modular design - interfaces and IoC framework
Write in batches for better performance
~~database: use an ORM framework (OrmLite? Hibernate? jOOQ?)~~ - stick with plain JDBC
~~database: should remove relation records before inserting new ones~~ - now I don't think it's needed
~~use Spring Boot framework? would simplify config handling and ORM~~ - keep it simple

Add seed: RCN list file

What problem this feature would solve (e.g. who needs it and why)

One may collect RCNs by hand into a spreadsheet, it may be difficult to generate a -s rcn1,rcn2,... list.

How would it solve it

It would be easier to export the RCN column or the whole spreadsheet to a text file, then tell the crawler to read RCNs from there.

How do you imagine using the feature (e.g. CLI configuration, output format)

-s file.tsv,1 seed argument would tell the crawler to read file.tsv as a TSV file and search for RCNs in the first column. Let's use 1-based indexes here.

The column index can be optional with default value 1.

We can accept both *.tsv and *.txt file extensions.

The reader would iterate the file line-by-line and would skip non-number values.

juzraai / cordis-projects-crawler Goto Github PK

cordis-projects-crawler's Issues

H2020 compatible

v2.0

Features (based on 1.x)

Improvements

Under the hood

Add seed: RCN list file

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent