Comments (3)
This project might be good for inspiration:
https://github.com/apache/systemml
from josimtext.
There seems to be three intersting places for how to create scripts that provide a good entry point into Spark.
- https://github.com/apache/systemml/blob/master/bin/systemml
- https://github.com/apache/systemml/blob/master/src/main/resources/scripts/sparkDML.sh
- https://github.com/apache/systemml/tree/master/src/main/resources/scripts
I am not yet sure how they fit together, though.
But you can see that they favor you solution with explicitly passing Spark configuration, such as --driver-memory
. What I do not like is how they have hard coded the defaults.
What I like is that they have a single entry point.
And I have a suggestion how this would be possible in our situation, even with the concerns you have mentioned (having a fast starting point for researchers with an overview of all model params and no need to write them donw manually). The solution could be extracting the model params to an extra key-value file and then solely providing this key-value file for each "method". The nice part about this idea is, that we can later regenerate such a file and include it into the output folder. (That is by the way similar to what Spark does within MLLibs model persistence.)
My main point is, that a single entry point, would reduce boilerplate and make it easier to resolve issues in the scripts.
- Take for example any of the 20 scripts last argument, which is <config.sh>. If you open any of those scripts you see that this <config.sh> is sourced and then some variables are used which are never defined before. A reader might assume that this <config.sh> contains those variables and then assume that it is in the config folder. But it involves enough reasoning to question how explicit this is.
- Another problem is the lack of naming the model parameters on the command line.
from josimtext.
But you can see that they favor you solution with explicitly passing Spark configuration, such as --driver-memory.
I thought again about this and ready to say that I am very much in favor of such explicit setting spark, not via env. vars.
What I do not like is how they have hard coded the defaults.
yeah, what we do now seems to be even more advanced
create scripts that provide a good entry point into Spark.
my main bias is to make the scripts as simple as possible, which is not really the case in this project. i want them ideally to have no while or for loops, no functions, and as little ifs as possible, so even a kid (=researcher) can read such bash script. in this project, the scripts are quite complex.
The nice part about this idea is, that we can later regenerate such a file and include it into the output folder. (That is by the way similar to what Spark does within MLLibs model persistence.)
I strongly oppose "later" thing. If it is a benefit, then we need to do it now or do not even consider it. For me actually it is not clear how you will do it. Though reflection?
Please answer: which problem you are trying to solve by changing the configuration? Please answer this question very clearly and with as much details as possible. For now, I cannot really understand it and this is very important.
from josimtext.
Related Issues (12)
- dt_spark.sh error
- CoNLL processing exception HOT 3
- Support for not only enhanced dependencies HOT 4
- Large scale conll tests
- From trigrams to n-grams
- For trigrams, make possible to turn off lowercasing
- Support of multiword expressions and named entities HOT 3
- Support of multiword expressions for the trigrams
- Trigrams: remove "." after the tokens HOT 2
- Alternative feature extraction approach based on positional unigrams
- List of stop dependency types HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from josimtext.