fmarten / josimtext Goto Github PK

View Code? Open in Web Editor NEW

This project forked from uhh-lt/josimtext

0.0 0.0 0.0 2.96 MB

A system for word sense induction and disambiguation based on JoBimText approach

Scala 89.15% Shell 1.65% Python 9.11% Makefile 0.09%

josimtext's People

Contributors

Watchers

josimtext's Issues

Support of multiword expressions and named entities

Motivation

The current implementation of lefex supports generation of features for both single and multiword terms while the current implementation https://github.com/uhh-lt/josimtext/blob/master/src/main/scala/de/uhh/lt/jst/dt/CoNLL2DepTermContext.scala only generates features for single words.

Implementation

The idea of generation of features for MWE is illustrated in the figure below:

Here, for the features of "mickey mouse" are all dependencies of "mickey" + all dependencies of "mouse" - dependencies between "mickey" and "mouse".

Here is an example of the named entity from our data "Lower Johnson".

It should be represented with the features "pobj(@,to)" (.i.e line number 18). note that "nn(@,Johnson)" (.i.e line number 17) is not a feature of this entity.

Another example:

Here the "New York City" should have the features: "prep_in(@,hotel)"

Large scale conll tests

I included this file for larger scale conll tests: you can use it for better testing.

http://panchenko.me/data/joint/corpora/cc16-conll-copp-222.csv.gz

Trigrams: remove "." after the tokens

Problem

Our current strategy for tokenization is too simple (just splitting by whitespace), which leads to the fact that many tokens end up with "." or "," or ";" in the end.

Solution

For any token, remove the trailing ".", ";" or "," if length of the token is greater than 1.

Examples:

"man." --> "man"  
"woman," --> "woman"
"." --> "."

List of stop dependency types

During the CoNLL based feature extraction it makes sense to take into account only some dependency types, e.g. to remove the ROOT type.

Please add a constant array that will list such types which are filtered out and currently add only "ROOT" type in this array:

Motivation

This is a part of a small series of improvements aiming to enrich the number of feature extractors natively in Spark. Currently, most of such extractors are part of the lefex or the "classical" jobimtext project and are thus only available in hadoop.

Currently, only trigrams can be computed. However, it would make sense to allow users to specify n in the n-gram model (the size of the context window).

Implementation

Specify a command line argument of the https://github.com/uhh-lt/josimtext/blob/master/src/main/scala/de/uhh/lt/jst/dt/Text2TrigramTermContext.scala which will specify the size of the left/right context. The default value of n=1 corresponds to the trigrams and used now. For n=2 we obtain 5-grams, for n=3 we obtain 7-grams, etc.

CoNLL processing exception

trying to extract features from this files causes the exception listed below

~/Desktop/JoSimText/scripts$ bash dt_spark.sh conll ~/Desktop/test/cc16-conll-copp-sample.csv ~/Desktop/test/conll-output-3/ config/l.sh

The input file with the new lines inserted after each sentence.
http://panchenko.me/data/joint/corpora/cc16-conll-copp-sample-newlines.csv.gz

 Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
17/08/18 23:09:13 INFO SparkContext: Running Spark version 2.2.0
17/08/18 23:09:14 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
17/08/18 23:09:19 INFO SparkContext: Submitted application: CoNLL2DepTermContext$
17/08/18 23:09:19 INFO SecurityManager: Changing view acls to: panchenko
17/08/18 23:09:19 INFO SecurityManager: Changing modify acls to: panchenko
17/08/18 23:09:19 INFO SecurityManager: Changing view acls groups to:
17/08/18 23:09:19 INFO SecurityManager: Changing modify acls groups to:
17/08/18 23:09:19 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(panchenko); groups with view permissions: Set(); users  with modify permissions: Set(panchenko); groups with modify permissions: Set()
17/08/18 23:09:19 INFO Utils: Successfully started service 'sparkDriver' on port 56249.
17/08/18 23:09:19 INFO SparkEnv: Registering MapOutputTracker
17/08/18 23:09:19 INFO SparkEnv: Registering BlockManagerMaster
17/08/18 23:09:19 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
17/08/18 23:09:19 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
17/08/18 23:09:19 INFO DiskBlockManager: Created local directory at /private/var/folders/tf/cy2lzyld3rz6mg8tqxm7zstr0000gn/T/blockmgr-1a908ae1-d1d7-4acb-8bbd-0b61aa43e047
17/08/18 23:09:19 INFO MemoryStore: MemoryStore started with capacity 4.1 GB
17/08/18 23:09:19 INFO SparkEnv: Registering OutputCommitCoordinator
17/08/18 23:09:19 INFO Utils: Successfully started service 'SparkUI' on port 4040.
17/08/18 23:09:20 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://10.0.1.3:4040
17/08/18 23:09:20 INFO SparkContext: Added JAR file:/Users/panchenko/Desktop/JoSimText/scripts/../target/scala-2.11/josimtext_2.11-0.4.jar at spark://10.0.1.3:56249/jars/josimtext_2.11-0.4.jar with timestamp 1503090560059
17/08/18 23:09:20 INFO Executor: Starting executor ID driver on host localhost
17/08/18 23:09:20 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 56250.
17/08/18 23:09:20 INFO NettyBlockTransferService: Server created on 10.0.1.3:56250
17/08/18 23:09:20 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
17/08/18 23:09:20 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, 10.0.1.3, 56250, None)
17/08/18 23:09:20 INFO BlockManagerMasterEndpoint: Registering block manager 10.0.1.3:56250 with 4.1 GB RAM, BlockManagerId(driver, 10.0.1.3, 56250, None)
17/08/18 23:09:20 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, 10.0.1.3, 56250, None)
17/08/18 23:09:20 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, 10.0.1.3, 56250, None)
17/08/18 23:09:20 INFO SharedState: Setting hive.metastore.warehouse.dir ('null') to the value of spark.sql.warehouse.dir ('file:/Users/panchenko/Desktop/JoSimText/scripts/spark-warehouse/').
17/08/18 23:09:20 INFO SharedState: Warehouse path is 'file:/Users/panchenko/Desktop/JoSimText/scripts/spark-warehouse/'.
17/08/18 23:09:21 INFO StateStoreCoordinatorRef: Registered StateStoreCoordinator endpoint
17/08/18 23:09:23 INFO FileSourceStrategy: Pruning directories with:
17/08/18 23:09:23 INFO FileSourceStrategy: Post-Scan Filters:
17/08/18 23:09:23 INFO FileSourceStrategy: Output Data Schema: struct<value: string>
17/08/18 23:09:23 INFO FileSourceScanExec: Pushed Filters:
17/08/18 23:09:23 INFO FileOutputCommitter: File Output Committer Algorithm version is 1
17/08/18 23:09:23 INFO SQLHadoopMapReduceCommitProtocol: Using output committer class org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter
17/08/18 23:09:24 INFO CodeGenerator: Code generated in 158.486406 ms
17/08/18 23:09:24 INFO CodeGenerator: Code generated in 51.904684 ms
17/08/18 23:09:24 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 277.3 KB, free 4.1 GB)
17/08/18 23:09:24 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 23.4 KB, free 4.1 GB)
17/08/18 23:09:24 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 10.0.1.3:56250 (size: 23.4 KB, free: 4.1 GB)
17/08/18 23:09:24 INFO SparkContext: Created broadcast 0 from broadcast at DefaultSource.scala:86
17/08/18 23:09:24 INFO FileSourceScanExec: Planning scan with bin packing, max size: 23280808 bytes, open cost is considered as scanning 4194304 bytes.
17/08/18 23:09:24 INFO SparkContext: Starting job: text at CoNLL2DepTermContext.scala:29
17/08/18 23:09:24 INFO DAGScheduler: Got job 0 (text at CoNLL2DepTermContext.scala:29) with 4 output partitions
17/08/18 23:09:24 INFO DAGScheduler: Final stage: ResultStage 0 (text at CoNLL2DepTermContext.scala:29)
17/08/18 23:09:24 INFO DAGScheduler: Parents of final stage: List()
17/08/18 23:09:24 INFO DAGScheduler: Missing parents: List()
17/08/18 23:09:24 INFO DAGScheduler: Submitting ResultStage 0 (MapPartitionsRDD[4] at text at CoNLL2DepTermContext.scala:29), which has no missing parents
17/08/18 23:09:24 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 99.2 KB, free 4.1 GB)
17/08/18 23:09:24 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 35.6 KB, free 4.1 GB)
17/08/18 23:09:24 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on 10.0.1.3:56250 (size: 35.6 KB, free: 4.1 GB)
17/08/18 23:09:24 INFO SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:1006
17/08/18 23:09:24 INFO DAGScheduler: Submitting 4 missing tasks from ResultStage 0 (MapPartitionsRDD[4] at text at CoNLL2DepTermContext.scala:29) (first 15 tasks are for partitions Vector(0, 1, 2, 3))
17/08/18 23:09:24 INFO TaskSchedulerImpl: Adding task set 0.0 with 4 tasks
17/08/18 23:09:24 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, executor driver, partition 0, PROCESS_LOCAL, 5300 bytes)
17/08/18 23:09:24 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, localhost, executor driver, partition 1, PROCESS_LOCAL, 5300 bytes)
17/08/18 23:09:24 INFO TaskSetManager: Starting task 2.0 in stage 0.0 (TID 2, localhost, executor driver, partition 2, PROCESS_LOCAL, 5300 bytes)
17/08/18 23:09:24 INFO TaskSetManager: Starting task 3.0 in stage 0.0 (TID 3, localhost, executor driver, partition 3, PROCESS_LOCAL, 5300 bytes)
17/08/18 23:09:24 INFO Executor: Running task 2.0 in stage 0.0 (TID 2)
17/08/18 23:09:24 INFO Executor: Running task 1.0 in stage 0.0 (TID 1)
17/08/18 23:09:24 INFO Executor: Running task 3.0 in stage 0.0 (TID 3)
17/08/18 23:09:24 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
17/08/18 23:09:24 INFO Executor: Fetching spark://10.0.1.3:56249/jars/josimtext_2.11-0.4.jar with timestamp 1503090560059
17/08/18 23:09:25 INFO TransportClientFactory: Successfully created connection to /10.0.1.3:56249 after 46 ms (0 ms spent in bootstraps)
17/08/18 23:09:25 INFO Utils: Fetching spark://10.0.1.3:56249/jars/josimtext_2.11-0.4.jar to /private/var/folders/tf/cy2lzyld3rz6mg8tqxm7zstr0000gn/T/spark-5eea660b-5cc0-476a-8347-be8044dd897e/userFiles-21df9990-d05e-4469-b394-b397092a2323/fetchFileTemp5738246302582940000.tmp
17/08/18 23:09:25 INFO Executor: Adding file:/private/var/folders/tf/cy2lzyld3rz6mg8tqxm7zstr0000gn/T/spark-5eea660b-5cc0-476a-8347-be8044dd897e/userFiles-21df9990-d05e-4469-b394-b397092a2323/josimtext_2.11-0.4.jar to class loader
17/08/18 23:09:25 INFO CodeGenerator: Code generated in 30.108062 ms
17/08/18 23:09:25 INFO FileOutputCommitter: File Output Committer Algorithm version is 1
17/08/18 23:09:25 INFO FileOutputCommitter: File Output Committer Algorithm version is 1
17/08/18 23:09:25 INFO FileOutputCommitter: File Output Committer Algorithm version is 1
17/08/18 23:09:25 INFO FileOutputCommitter: File Output Committer Algorithm version is 1
17/08/18 23:09:25 INFO SQLHadoopMapReduceCommitProtocol: Using output committer class org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter
17/08/18 23:09:25 INFO SQLHadoopMapReduceCommitProtocol: Using output committer class org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter
17/08/18 23:09:25 INFO SQLHadoopMapReduceCommitProtocol: Using output committer class org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter
17/08/18 23:09:25 INFO SQLHadoopMapReduceCommitProtocol: Using output committer class org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter
17/08/18 23:09:25 INFO FileScanRDD: Reading File path: file:///Users/panchenko/Desktop/test/cc16-conll-copp-sample.csv, range: 0-23280808, partition values: [empty row]
17/08/18 23:09:25 INFO FileScanRDD: Reading File path: file:///Users/panchenko/Desktop/test/cc16-conll-copp-sample.csv, range: 69842424-88928930, partition values: [empty row]
17/08/18 23:09:25 INFO FileScanRDD: Reading File path: file:///Users/panchenko/Desktop/test/cc16-conll-copp-sample.csv, range: 23280808-46561616, partition values: [empty row]
17/08/18 23:09:25 INFO FileScanRDD: Reading File path: file:///Users/panchenko/Desktop/test/cc16-conll-copp-sample.csv, range: 46561616-69842424, partition values: [empty row]
17/08/18 23:09:25 INFO CodeGenerator: Code generated in 13.221717 ms
17/08/18 23:09:25 ERROR Utils: Aborting task
org.apache.spark.SparkException: Failed to execute user defined function(anonfun$1: (string) => array<string>)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:315)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:258)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:256)
	at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1375)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:261)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$apply$mcV$sp$1.apply(FileFormatWriter.scala:191)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$apply$mcV$sp$1.apply(FileFormatWriter.scala:190)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
	at org.apache.spark.scheduler.Task.run(Task.scala:108)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.NullPointerException
	at com.univocity.parsers.common.LineReader.read(LineReader.java:51)
	at com.univocity.parsers.common.input.DefaultCharInputReader.reloadBuffer(DefaultCharInputReader.java:75)
	at com.univocity.parsers.common.input.AbstractCharInputReader.updateBuffer(AbstractCharInputReader.java:159)
	at com.univocity.parsers.common.input.AbstractCharInputReader.start(AbstractCharInputReader.java:145)
	at com.univocity.parsers.common.AbstractParser.beginParsing(AbstractParser.java:232)
	at com.univocity.parsers.common.AbstractParser.parseLine(AbstractParser.java:523)
	at de.uhh.lt.conll.CoNLLParser$$anonfun$3.apply(CoNLLParser.scala:42)
	at de.uhh.lt.conll.CoNLLParser$$anonfun$3.apply(CoNLLParser.scala:42)
	at scala.collection.immutable.List.map(List.scala:273)
	at de.uhh.lt.conll.CoNLLParser$.parseSingleSentence(CoNLLParser.scala:42)
	at de.uhh.lt.jst.dt.CoNLL2DepTermContext$$anonfun$1.apply(CoNLL2DepTermContext.scala:40)
	at de.uhh.lt.jst.dt.CoNLL2DepTermContext$$anonfun$1.apply(CoNLL2DepTermContext.scala:40)
	... 23 more
17/08/18 23:09:25 ERROR Utils: Aborting task
org.apache.spark.SparkException: Failed to execute user defined function(anonfun$1: (string) => array<string>)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:315)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:258)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:256)
	at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1375)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:261)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$apply$mcV$sp$1.apply(FileFormatWriter.scala:191)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$apply$mcV$sp$1.apply(FileFormatWriter.scala:190)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
	at org.apache.spark.scheduler.Task.run(Task.scala:108)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.NullPointerException
	at com.univocity.parsers.common.LineReader.read(LineReader.java:51)
	at com.univocity.parsers.common.input.DefaultCharInputReader.reloadBuffer(DefaultCharInputReader.java:75)
	at com.univocity.parsers.common.input.AbstractCharInputReader.updateBuffer(AbstractCharInputReader.java:159)
	at com.univocity.parsers.common.input.AbstractCharInputReader.start(AbstractCharInputReader.java:145)
	at com.univocity.parsers.common.AbstractParser.beginParsing(AbstractParser.java:232)
	at com.univocity.parsers.common.AbstractParser.parseLine(AbstractParser.java:523)
	at de.uhh.lt.conll.CoNLLParser$$anonfun$3.apply(CoNLLParser.scala:42)
	at de.uhh.lt.conll.CoNLLParser$$anonfun$3.apply(CoNLLParser.scala:42)
	at scala.collection.immutable.List.map(List.scala:273)
	at de.uhh.lt.conll.CoNLLParser$.parseSingleSentence(CoNLLParser.scala:42)
	at de.uhh.lt.jst.dt.CoNLL2DepTermContext$$anonfun$1.apply(CoNLL2DepTermContext.scala:40)
	at de.uhh.lt.jst.dt.CoNLL2DepTermContext$$anonfun$1.apply(CoNLL2DepTermContext.scala:40)
	... 23 more
17/08/18 23:09:25 ERROR Utils: Aborting task
org.apache.spark.SparkException: Failed to execute user defined function(anonfun$1: (string) => array<string>)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:315)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:258)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:256)
	at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1375)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:261)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$apply$mcV$sp$1.apply(FileFormatWriter.scala:191)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$apply$mcV$sp$1.apply(FileFormatWriter.scala:190)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
	at org.apache.spark.scheduler.Task.run(Task.scala:108)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.NullPointerException
	at com.univocity.parsers.common.LineReader.read(LineReader.java:51)
	at com.univocity.parsers.common.input.DefaultCharInputReader.reloadBuffer(DefaultCharInputReader.java:75)
	at com.univocity.parsers.common.input.AbstractCharInputReader.updateBuffer(AbstractCharInputReader.java:159)
	at com.univocity.parsers.common.input.AbstractCharInputReader.start(AbstractCharInputReader.java:145)
	at com.univocity.parsers.common.AbstractParser.beginParsing(AbstractParser.java:232)
	at com.univocity.parsers.common.AbstractParser.parseLine(AbstractParser.java:523)
	at de.uhh.lt.conll.CoNLLParser$$anonfun$3.apply(CoNLLParser.scala:42)
	at de.uhh.lt.conll.CoNLLParser$$anonfun$3.apply(CoNLLParser.scala:42)
	at scala.collection.immutable.List.map(List.scala:273)
	at de.uhh.lt.conll.CoNLLParser$.parseSingleSentence(CoNLLParser.scala:42)
	at de.uhh.lt.jst.dt.CoNLL2DepTermContext$$anonfun$1.apply(CoNLL2DepTermContext.scala:40)
	at de.uhh.lt.jst.dt.CoNLL2DepTermContext$$anonfun$1.apply(CoNLL2DepTermContext.scala:40)
	... 23 more
17/08/18 23:09:25 ERROR Utils: Aborting task
org.apache.spark.SparkException: Failed to execute user defined function(anonfun$1: (string) => array<string>)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:315)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:258)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:256)
	at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1375)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:261)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$apply$mcV$sp$1.apply(FileFormatWriter.scala:191)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$apply$mcV$sp$1.apply(FileFormatWriter.scala:190)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
	at org.apache.spark.scheduler.Task.run(Task.scala:108)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.NullPointerException
	at com.univocity.parsers.common.LineReader.read(LineReader.java:51)
	at com.univocity.parsers.common.input.DefaultCharInputReader.reloadBuffer(DefaultCharInputReader.java:75)
	at com.univocity.parsers.common.input.AbstractCharInputReader.updateBuffer(AbstractCharInputReader.java:159)
	at com.univocity.parsers.common.input.AbstractCharInputReader.start(AbstractCharInputReader.java:145)
	at com.univocity.parsers.common.AbstractParser.beginParsing(AbstractParser.java:232)
	at com.univocity.parsers.common.AbstractParser.parseLine(AbstractParser.java:523)
	at de.uhh.lt.conll.CoNLLParser$$anonfun$3.apply(CoNLLParser.scala:42)
	at de.uhh.lt.conll.CoNLLParser$$anonfun$3.apply(CoNLLParser.scala:42)
	at scala.collection.immutable.List.map(List.scala:273)
	at de.uhh.lt.conll.CoNLLParser$.parseSingleSentence(CoNLLParser.scala:42)
	at de.uhh.lt.jst.dt.CoNLL2DepTermContext$$anonfun$1.apply(CoNLL2DepTermContext.scala:40)
	at de.uhh.lt.jst.dt.CoNLL2DepTermContext$$anonfun$1.apply(CoNLL2DepTermContext.scala:40)
	... 23 more
17/08/18 23:09:25 ERROR FileFormatWriter: Job job_20170818230925_0000 aborted.
17/08/18 23:09:25 ERROR FileFormatWriter: Job job_20170818230925_0000 aborted.
17/08/18 23:09:25 ERROR FileFormatWriter: Job job_20170818230925_0000 aborted.
17/08/18 23:09:25 ERROR FileFormatWriter: Job job_20170818230925_0000 aborted.
17/08/18 23:09:25 ERROR Executor: Exception in task 2.0 in stage 0.0 (TID 2)
org.apache.spark.SparkException: Task failed while writing rows
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:272)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$apply$mcV$sp$1.apply(FileFormatWriter.scala:191)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$apply$mcV$sp$1.apply(FileFormatWriter.scala:190)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
	at org.apache.spark.scheduler.Task.run(Task.scala:108)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.spark.SparkException: Failed to execute user defined function(anonfun$1: (string) => array<string>)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:315)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:258)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:256)
	at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1375)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:261)
	... 8 more
Caused by: java.lang.NullPointerException
	at com.univocity.parsers.common.LineReader.read(LineReader.java:51)
	at com.univocity.parsers.common.input.DefaultCharInputReader.reloadBuffer(DefaultCharInputReader.java:75)
	at com.univocity.parsers.common.input.AbstractCharInputReader.updateBuffer(AbstractCharInputReader.java:159)
	at com.univocity.parsers.common.input.AbstractCharInputReader.start(AbstractCharInputReader.java:145)
	at com.univocity.parsers.common.AbstractParser.beginParsing(AbstractParser.java:232)
	at com.univocity.parsers.common.AbstractParser.parseLine(AbstractParser.java:523)
	at de.uhh.lt.conll.CoNLLParser$$anonfun$3.apply(CoNLLParser.scala:42)
	at de.uhh.lt.conll.CoNLLParser$$anonfun$3.apply(CoNLLParser.scala:42)
	at scala.collection.immutable.List.map(List.scala:273)
	at de.uhh.lt.conll.CoNLLParser$.parseSingleSentence(CoNLLParser.scala:42)
	at de.uhh.lt.jst.dt.CoNLL2DepTermContext$$anonfun$1.apply(CoNLL2DepTermContext.scala:40)
	at de.uhh.lt.jst.dt.CoNLL2DepTermContext$$anonfun$1.apply(CoNLL2DepTermContext.scala:40)
	... 23 more
17/08/18 23:09:25 ERROR Executor: Exception in task 1.0 in stage 0.0 (TID 1)
org.apache.spark.SparkException: Task failed while writing rows
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:272)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$apply$mcV$sp$1.apply(FileFormatWriter.scala:191)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$apply$mcV$sp$1.apply(FileFormatWriter.scala:190)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
	at org.apache.spark.scheduler.Task.run(Task.scala:108)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.spark.SparkException: Failed to execute user defined function(anonfun$1: (string) => array<string>)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:315)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:258)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:256)
	at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1375)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:261)
	... 8 more
Caused by: java.lang.NullPointerException
	at com.univocity.parsers.common.LineReader.read(LineReader.java:51)
	at com.univocity.parsers.common.input.DefaultCharInputReader.reloadBuffer(DefaultCharInputReader.java:75)
	at com.univocity.parsers.common.input.AbstractCharInputReader.updateBuffer(AbstractCharInputReader.java:159)
	at com.univocity.parsers.common.input.AbstractCharInputReader.start(AbstractCharInputReader.java:145)
	at com.univocity.parsers.common.AbstractParser.beginParsing(AbstractParser.java:232)
	at com.univocity.parsers.common.AbstractParser.parseLine(AbstractParser.java:523)
	at de.uhh.lt.conll.CoNLLParser$$anonfun$3.apply(CoNLLParser.scala:42)
	at de.uhh.lt.conll.CoNLLParser$$anonfun$3.apply(CoNLLParser.scala:42)
	at scala.collection.immutable.List.map(List.scala:273)
	at de.uhh.lt.conll.CoNLLParser$.parseSingleSentence(CoNLLParser.scala:42)
	at de.uhh.lt.jst.dt.CoNLL2DepTermContext$$anonfun$1.apply(CoNLL2DepTermContext.scala:40)
	at de.uhh.lt.jst.dt.CoNLL2DepTermContext$$anonfun$1.apply(CoNLL2DepTermContext.scala:40)
	... 23 more
17/08/18 23:09:25 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
org.apache.spark.SparkException: Task failed while writing rows
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:272)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$apply$mcV$sp$1.apply(FileFormatWriter.scala:191)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$apply$mcV$sp$1.apply(FileFormatWriter.scala:190)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
	at org.apache.spark.scheduler.Task.run(Task.scala:108)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.spark.SparkException: Failed to execute user defined function(anonfun$1: (string) => array<string>)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:315)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:258)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:256)
	at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1375)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:261)
	... 8 more
Caused by: java.lang.NullPointerException
	at com.univocity.parsers.common.LineReader.read(LineReader.java:51)
	at com.univocity.parsers.common.input.DefaultCharInputReader.reloadBuffer(DefaultCharInputReader.java:75)
	at com.univocity.parsers.common.input.AbstractCharInputReader.updateBuffer(AbstractCharInputReader.java:159)
	at com.univocity.parsers.common.input.AbstractCharInputReader.start(AbstractCharInputReader.java:145)
	at com.univocity.parsers.common.AbstractParser.beginParsing(AbstractParser.java:232)
	at com.univocity.parsers.common.AbstractParser.parseLine(AbstractParser.java:523)
	at de.uhh.lt.conll.CoNLLParser$$anonfun$3.apply(CoNLLParser.scala:42)
	at de.uhh.lt.conll.CoNLLParser$$anonfun$3.apply(CoNLLParser.scala:42)
	at scala.collection.immutable.List.map(List.scala:273)
	at de.uhh.lt.conll.CoNLLParser$.parseSingleSentence(CoNLLParser.scala:42)
	at de.uhh.lt.jst.dt.CoNLL2DepTermContext$$anonfun$1.apply(CoNLL2DepTermContext.scala:40)
	at de.uhh.lt.jst.dt.CoNLL2DepTermContext$$anonfun$1.apply(CoNLL2DepTermContext.scala:40)
	... 23 more
17/08/18 23:09:25 ERROR Executor: Exception in task 3.0 in stage 0.0 (TID 3)
org.apache.spark.SparkException: Task failed while writing rows
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:272)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$apply$mcV$sp$1.apply(FileFormatWriter.scala:191)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$apply$mcV$sp$1.apply(FileFormatWriter.scala:190)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
	at org.apache.spark.scheduler.Task.run(Task.scala:108)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.spark.SparkException: Failed to execute user defined function(anonfun$1: (string) => array<string>)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:315)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:258)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:256)
	at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1375)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:261)
	... 8 more
Caused by: java.lang.NullPointerException
	at com.univocity.parsers.common.LineReader.read(LineReader.java:51)
	at com.univocity.parsers.common.input.DefaultCharInputReader.reloadBuffer(DefaultCharInputReader.java:75)
	at com.univocity.parsers.common.input.AbstractCharInputReader.updateBuffer(AbstractCharInputReader.java:159)
	at com.univocity.parsers.common.input.AbstractCharInputReader.start(AbstractCharInputReader.java:145)
	at com.univocity.parsers.common.AbstractParser.beginParsing(AbstractParser.java:232)
	at com.univocity.parsers.common.AbstractParser.parseLine(AbstractParser.java:523)
	at de.uhh.lt.conll.CoNLLParser$$anonfun$3.apply(CoNLLParser.scala:42)
	at de.uhh.lt.conll.CoNLLParser$$anonfun$3.apply(CoNLLParser.scala:42)
	at scala.collection.immutable.List.map(List.scala:273)
	at de.uhh.lt.conll.CoNLLParser$.parseSingleSentence(CoNLLParser.scala:42)
	at de.uhh.lt.jst.dt.CoNLL2DepTermContext$$anonfun$1.apply(CoNLL2DepTermContext.scala:40)
	at de.uhh.lt.jst.dt.CoNLL2DepTermContext$$anonfun$1.apply(CoNLL2DepTermContext.scala:40)
	... 23 more
17/08/18 23:09:25 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, localhost, executor driver): org.apache.spark.SparkException: Task failed while writing rows
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:272)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$apply$mcV$sp$1.apply(FileFormatWriter.scala:191)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$apply$mcV$sp$1.apply(FileFormatWriter.scala:190)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
	at org.apache.spark.scheduler.Task.run(Task.scala:108)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.spark.SparkException: Failed to execute user defined function(anonfun$1: (string) => array<string>)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:315)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:258)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:256)
	at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1375)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:261)
	... 8 more
Caused by: java.lang.NullPointerException
	at com.univocity.parsers.common.LineReader.read(LineReader.java:51)
	at com.univocity.parsers.common.input.DefaultCharInputReader.reloadBuffer(DefaultCharInputReader.java:75)
	at com.univocity.parsers.common.input.AbstractCharInputReader.updateBuffer(AbstractCharInputReader.java:159)
	at com.univocity.parsers.common.input.AbstractCharInputReader.start(AbstractCharInputReader.java:145)
	at com.univocity.parsers.common.AbstractParser.beginParsing(AbstractParser.java:232)
	at com.univocity.parsers.common.AbstractParser.parseLine(AbstractParser.java:523)
	at de.uhh.lt.conll.CoNLLParser$$anonfun$3.apply(CoNLLParser.scala:42)
	at de.uhh.lt.conll.CoNLLParser$$anonfun$3.apply(CoNLLParser.scala:42)
	at scala.collection.immutable.List.map(List.scala:273)
	at de.uhh.lt.conll.CoNLLParser$.parseSingleSentence(CoNLLParser.scala:42)
	at de.uhh.lt.jst.dt.CoNLL2DepTermContext$$anonfun$1.apply(CoNLL2DepTermContext.scala:40)
	at de.uhh.lt.jst.dt.CoNLL2DepTermContext$$anonfun$1.apply(CoNLL2DepTermContext.scala:40)
	... 23 more

17/08/18 23:09:25 ERROR TaskSetManager: Task 0 in stage 0.0 failed 1 times; aborting job
17/08/18 23:09:25 INFO TaskSetManager: Lost task 2.0 in stage 0.0 (TID 2) on localhost, executor driver: org.apache.spark.SparkException (Task failed while writing rows) [duplicate 1]
17/08/18 23:09:25 INFO TaskSetManager: Lost task 3.0 in stage 0.0 (TID 3) on localhost, executor driver: org.apache.spark.SparkException (Task failed while writing rows) [duplicate 2]
17/08/18 23:09:25 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
17/08/18 23:09:25 INFO TaskSetManager: Lost task 1.0 in stage 0.0 (TID 1) on localhost, executor driver: org.apache.spark.SparkException (Task failed while writing rows) [duplicate 3]
17/08/18 23:09:25 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
17/08/18 23:09:25 INFO TaskSchedulerImpl: Cancelling stage 0
17/08/18 23:09:25 INFO DAGScheduler: ResultStage 0 (text at CoNLL2DepTermContext.scala:29) failed in 0.720 s due to Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost, executor driver): org.apache.spark.SparkException: Task failed while writing rows
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:272)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$apply$mcV$sp$1.apply(FileFormatWriter.scala:191)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$apply$mcV$sp$1.apply(FileFormatWriter.scala:190)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
	at org.apache.spark.scheduler.Task.run(Task.scala:108)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.spark.SparkException: Failed to execute user defined function(anonfun$1: (string) => array<string>)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:315)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:258)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:256)
	at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1375)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:261)
	... 8 more
Caused by: java.lang.NullPointerException
	at com.univocity.parsers.common.LineReader.read(LineReader.java:51)
	at com.univocity.parsers.common.input.DefaultCharInputReader.reloadBuffer(DefaultCharInputReader.java:75)
	at com.univocity.parsers.common.input.AbstractCharInputReader.updateBuffer(AbstractCharInputReader.java:159)
	at com.univocity.parsers.common.input.AbstractCharInputReader.start(AbstractCharInputReader.java:145)
	at com.univocity.parsers.common.AbstractParser.beginParsing(AbstractParser.java:232)
	at com.univocity.parsers.common.AbstractParser.parseLine(AbstractParser.java:523)
	at de.uhh.lt.conll.CoNLLParser$$anonfun$3.apply(CoNLLParser.scala:42)
	at de.uhh.lt.conll.CoNLLParser$$anonfun$3.apply(CoNLLParser.scala:42)
	at scala.collection.immutable.List.map(List.scala:273)
	at de.uhh.lt.conll.CoNLLParser$.parseSingleSentence(CoNLLParser.scala:42)
	at de.uhh.lt.jst.dt.CoNLL2DepTermContext$$anonfun$1.apply(CoNLL2DepTermContext.scala:40)
	at de.uhh.lt.jst.dt.CoNLL2DepTermContext$$anonfun$1.apply(CoNLL2DepTermContext.scala:40)
	... 23 more

Driver stacktrace:
17/08/18 23:09:25 INFO DAGScheduler: Job 0 failed: text at CoNLL2DepTermContext.scala:29, took 0.892415 s
17/08/18 23:09:25 ERROR FileFormatWriter: Aborting job null.
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost, executor driver): org.apache.spark.SparkException: Task failed while writing rows
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:272)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$apply$mcV$sp$1.apply(FileFormatWriter.scala:191)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$apply$mcV$sp$1.apply(FileFormatWriter.scala:190)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
	at org.apache.spark.scheduler.Task.run(Task.scala:108)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.spark.SparkException: Failed to execute user defined function(anonfun$1: (string) => array<string>)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:315)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:258)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:256)
	at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1375)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:261)
	... 8 more
Caused by: java.lang.NullPointerException
	at com.univocity.parsers.common.LineReader.read(LineReader.java:51)
	at com.univocity.parsers.common.input.DefaultCharInputReader.reloadBuffer(DefaultCharInputReader.java:75)
	at com.univocity.parsers.common.input.AbstractCharInputReader.updateBuffer(AbstractCharInputReader.java:159)
	at com.univocity.parsers.common.input.AbstractCharInputReader.start(AbstractCharInputReader.java:145)
	at com.univocity.parsers.common.AbstractParser.beginParsing(AbstractParser.java:232)
	at com.univocity.parsers.common.AbstractParser.parseLine(AbstractParser.java:523)
	at de.uhh.lt.conll.CoNLLParser$$anonfun$3.apply(CoNLLParser.scala:42)
	at de.uhh.lt.conll.CoNLLParser$$anonfun$3.apply(CoNLLParser.scala:42)
	at scala.collection.immutable.List.map(List.scala:273)
	at de.uhh.lt.conll.CoNLLParser$.parseSingleSentence(CoNLLParser.scala:42)
	at de.uhh.lt.jst.dt.CoNLL2DepTermContext$$anonfun$1.apply(CoNLL2DepTermContext.scala:40)
	at de.uhh.lt.jst.dt.CoNLL2DepTermContext$$anonfun$1.apply(CoNLL2DepTermContext.scala:40)
	... 23 more

Driver stacktrace:
	at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1499)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1487)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1486)
	at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1486)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
	at scala.Option.foreach(Option.scala:257)
	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:814)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1714)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1669)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1658)
	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:630)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2022)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply$mcV$sp(FileFormatWriter.scala:188)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:173)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:173)
	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:65)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:173)
	at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:145)
	at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
	at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
	at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135)
	at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116)
	at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:92)
	at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:92)
	at org.apache.spark.sql.execution.datasources.DataSource.writeInFileFormat(DataSource.scala:438)
	at org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:474)
	at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:48)
	at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
	at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
	at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135)
	at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116)
	at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:92)
	at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:92)
	at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:610)
	at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:233)
	at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:217)
	at org.apache.spark.sql.DataFrameWriter.text(DataFrameWriter.scala:555)
	at de.uhh.lt.jst.dt.CoNLL2DepTermContext$.main(CoNLL2DepTermContext.scala:29)
	at de.uhh.lt.jst.dt.CoNLL2DepTermContext.main(CoNLL2DepTermContext.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:755)
	at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
	at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:119)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: org.apache.spark.SparkException: Task failed while writing rows
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:272)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$apply$mcV$sp$1.apply(FileFormatWriter.scala:191)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$apply$mcV$sp$1.apply(FileFormatWriter.scala:190)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
	at org.apache.spark.scheduler.Task.run(Task.scala:108)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.spark.SparkException: Failed to execute user defined function(anonfun$1: (string) => array<string>)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:315)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:258)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:256)
	at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1375)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:261)
	... 8 more
Caused by: java.lang.NullPointerException
	at com.univocity.parsers.common.LineReader.read(LineReader.java:51)
	at com.univocity.parsers.common.input.DefaultCharInputReader.reloadBuffer(DefaultCharInputReader.java:75)
	at com.univocity.parsers.common.input.AbstractCharInputReader.updateBuffer(AbstractCharInputReader.java:159)
	at com.univocity.parsers.common.input.AbstractCharInputReader.start(AbstractCharInputReader.java:145)
	at com.univocity.parsers.common.AbstractParser.beginParsing(AbstractParser.java:232)
	at com.univocity.parsers.common.AbstractParser.parseLine(AbstractParser.java:523)
	at de.uhh.lt.conll.CoNLLParser$$anonfun$3.apply(CoNLLParser.scala:42)
	at de.uhh.lt.conll.CoNLLParser$$anonfun$3.apply(CoNLLParser.scala:42)
	at scala.collection.immutable.List.map(List.scala:273)
	at de.uhh.lt.conll.CoNLLParser$.parseSingleSentence(CoNLLParser.scala:42)
	at de.uhh.lt.jst.dt.CoNLL2DepTermContext$$anonfun$1.apply(CoNLL2DepTermContext.scala:40)
	at de.uhh.lt.jst.dt.CoNLL2DepTermContext$$anonfun$1.apply(CoNLL2DepTermContext.scala:40)
	... 23 more
Exception in thread "main" org.apache.spark.SparkException: Job aborted.
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply$mcV$sp(FileFormatWriter.scala:215)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:173)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:173)
	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:65)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:173)
	at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:145)
	at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
	at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
	at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135)
	at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116)
	at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:92)
	at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:92)
	at org.apache.spark.sql.execution.datasources.DataSource.writeInFileFormat(DataSource.scala:438)
	at org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:474)
	at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:48)
	at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
	at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
	at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135)
	at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116)
	at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:92)
	at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:92)
	at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:610)
	at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:233)
	at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:217)
	at org.apache.spark.sql.DataFrameWriter.text(DataFrameWriter.scala:555)
	at de.uhh.lt.jst.dt.CoNLL2DepTermContext$.main(CoNLL2DepTermContext.scala:29)
	at de.uhh.lt.jst.dt.CoNLL2DepTermContext.main(CoNLL2DepTermContext.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:755)
	at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
	at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:119)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost, executor driver): org.apache.spark.SparkException: Task failed while writing rows
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:272)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$apply$mcV$sp$1.apply(FileFormatWriter.scala:191)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$apply$mcV$sp$1.apply(FileFormatWriter.scala:190)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
	at org.apache.spark.scheduler.Task.run(Task.scala:108)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.spark.SparkException: Failed to execute user defined function(anonfun$1: (string) => array<string>)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:315)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:258)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:256)
	at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1375)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:261)
	... 8 more
Caused by: java.lang.NullPointerException
	at com.univocity.parsers.common.LineReader.read(LineReader.java:51)
	at com.univocity.parsers.common.input.DefaultCharInputReader.reloadBuffer(DefaultCharInputReader.java:75)
	at com.univocity.parsers.common.input.AbstractCharInputReader.updateBuffer(AbstractCharInputReader.java:159)
	at com.univocity.parsers.common.input.AbstractCharInputReader.start(AbstractCharInputReader.java:145)
	at com.univocity.parsers.common.AbstractParser.beginParsing(AbstractParser.java:232)
	at com.univocity.parsers.common.AbstractParser.parseLine(AbstractParser.java:523)
	at de.uhh.lt.conll.CoNLLParser$$anonfun$3.apply(CoNLLParser.scala:42)
	at de.uhh.lt.conll.CoNLLParser$$anonfun$3.apply(CoNLLParser.scala:42)
	at scala.collection.immutable.List.map(List.scala:273)
	at de.uhh.lt.conll.CoNLLParser$.parseSingleSentence(CoNLLParser.scala:42)
	at de.uhh.lt.jst.dt.CoNLL2DepTermContext$$anonfun$1.apply(CoNLL2DepTermContext.scala:40)
	at de.uhh.lt.jst.dt.CoNLL2DepTermContext$$anonfun$1.apply(CoNLL2DepTermContext.scala:40)
	... 23 more

Driver stacktrace:
	at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1499)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1487)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1486)
	at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1486)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
	at scala.Option.foreach(Option.scala:257)
	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:814)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1714)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1669)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1658)
	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:630)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2022)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply$mcV$sp(FileFormatWriter.scala:188)
	... 45 more
Caused by: org.apache.spark.SparkException: Task failed while writing rows
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:272)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$apply$mcV$sp$1.apply(FileFormatWriter.scala:191)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$apply$mcV$sp$1.apply(FileFormatWriter.scala:190)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
	at org.apache.spark.scheduler.Task.run(Task.scala:108)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.spark.SparkException: Failed to execute user defined function(anonfun$1: (string) => array<string>)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:315)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:258)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:256)
	at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1375)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:261)
	... 8 more
Caused by: java.lang.NullPointerException
	at com.univocity.parsers.common.LineReader.read(LineReader.java:51)
	at com.univocity.parsers.common.input.DefaultCharInputReader.reloadBuffer(DefaultCharInputReader.java:75)
	at com.univocity.parsers.common.input.AbstractCharInputReader.updateBuffer(AbstractCharInputReader.java:159)
	at com.univocity.parsers.common.input.AbstractCharInputReader.start(AbstractCharInputReader.java:145)
	at com.univocity.parsers.common.AbstractParser.beginParsing(AbstractParser.java:232)
	at com.univocity.parsers.common.AbstractParser.parseLine(AbstractParser.java:523)
	at de.uhh.lt.conll.CoNLLParser$$anonfun$3.apply(CoNLLParser.scala:42)
	at de.uhh.lt.conll.CoNLLParser$$anonfun$3.apply(CoNLLParser.scala:42)
	at scala.collection.immutable.List.map(List.scala:273)
	at de.uhh.lt.conll.CoNLLParser$.parseSingleSentence(CoNLLParser.scala:42)
	at de.uhh.lt.jst.dt.CoNLL2DepTermContext$$anonfun$1.apply(CoNLL2DepTermContext.scala:40)
	at de.uhh.lt.jst.dt.CoNLL2DepTermContext$$anonfun$1.apply(CoNLL2DepTermContext.scala:40)
	... 23 more
17/08/18 23:09:25 INFO SparkContext: Invoking stop() from shutdown hook
17/08/18 23:09:25 INFO SparkUI: Stopped Spark web UI at http://10.0.1.3:4040
17/08/18 23:09:25 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
17/08/18 23:09:25 INFO MemoryStore: MemoryStore cleared
17/08/18 23:09:25 INFO BlockManager: BlockManager stopped
17/08/18 23:09:25 INFO BlockManagerMaster: BlockManagerMaster stopped
17/08/18 23:09:25 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
17/08/18 23:09:25 INFO SparkContext: Successfully stopped SparkContext
17/08/18 23:09:25 INFO ShutdownHookManager: Shutdown hook called
17/08/18 23:09:25 INFO ShutdownHookManager: Deleting directory /private/var/folders/tf/cy2lzyld3rz6mg8tqxm7zstr0000gn/T/spark-5eea660b-5cc0-476a-8347-be8044dd897e
panchenko@Alexanders-MacBook-Pro:~/Desktop/JoSimText/scripts$

Support for not only enhanced dependencies

Problem

The CoNLL 2 TextContext extractor currently takes the values of dependencies from the column with the, so called, enhanced dependencies. However, very often column is not filled like in the file below: http://panchenko.me/data/joint/corpora/cc16-conll-copp-sample-newlines-no-enhanced.csv.gz

Actually, in the majority of the Conll corpora available online this column is not filled. Example of such file is presented below:

Many existing Conll files are not usable with our tool currently.

Solution

If no dependencies are found in the respective column, e.g. "" or "_", then you need to use another column to generate features (the two columns which directly precede the column you currently use).

Alternative feature extraction approach based on positional unigrams

Motivation

n-gram features can be too sparse to be practical. Also, the number of such features grows very fast. This ticket implements a very simple alternative strategy which may be better in some situations and will provide the users more choice of extraction methods.

Implementation

The positional features are always represent one word/mwe with only one term. However, in addition to the word itself, its position within the context window is stored. The example below clarifies this:

Input text:

This ice cream is sweet.

Input MWE vocabulary:

ice cream

Features generated using trigrams:

this _@_ice
ice this_@_cream
cream ice_@_is
is cream_@_sweet
ice cream this_@_is

Features generated the positional features for context window size of n=1 (the same as for trigrams):

this ice_+1
ice this_-1
ice cream_+1
cream ice_-1
cream is_+1
is cream_-1
is sweet_+1
ice cream this_-1
ice cream is_+1

Some features generated the positional features for context window size of n=2 (the same as for trigrams):

this ice_+1
this cream_+2
ice this_-1
ice cream_+1
ice is_+2
...
...
ice cream this_-1
ice cream is_+1
ice cream sweet_+2

dt_spark.sh error

the following line should be deleted:

 set -o nounset # Error on referencing undefined variables, shorthand: set -n

if not the script does not work (at least on mac):

~/Desktop/JoSimText/scripts$ bash dt_spark.sh
dt_spark.sh: line 6: $1: unbound variable

~/Desktop/JoSimText/scripts$ vim dt_spark.sh # removing line here

~/Desktop/JoSimText/scripts$ bash dt_spark.sh
Compute a DT from different input formats (conll, corpus, termcontext)
parameters: <format> <input-directory> <output-directory> <config.sh>

Support of multiword expressions for the trigrams

Motivation

n-gram based features extraction should also support multiword expressions. otherwise, only single terms can be represented and important terms, such as "ice cream" or "new york times" in principle cannot end up in a dt.

Implementation

Add to the https://github.com/uhh-lt/josimtext/blob/master/src/main/scala/de/uhh/lt/jst/dt/Text2TrigramTermContext.scala an optional parameter that takes a vocabulary file as an input in the same format as this file https://github.com/uhh-lt/josimtext/blob/master/src/test/resources/voc-tiny.csv
Generate features for all single words exactly in the same way as you do now, but in addition to this, generate also features for all multiword expressions found in the input list. Example:

Input text:

This ice cream is sweet.

Input MWE vocabulary:

ice cream

Features generated:

this _@_ice
ice this_@_cream
cream ice_@_is
is cream_@_sweet
ice cream this_@_is

Multiword expressions loaded from the input file can be lowercased and a "lower-cased match" in text should be considered to be sufficient: when you check a match it is sufficient to check if a text sequence in the lower cased form is in the dictionary of loaded MWEs.

Enhance configuration mechanics

The objectives could be:

configuration mechanics should be more standard, in the best case mimic Sparks behavior:
- We should use uppercase environment variables
- We should reuse environment variables whenever possible, compare JoSimTexts file /scripts/config/local.config.sh with $SPARK_HOME/conf/spark-env.sh.template. For example instead of creating spark_gb=8 and then passing it explicitly as cli argument to each of the ~20 bash scripts we can just use SPARK_EXECUTOR_MEMORY and SPARK_DRIVER_MEMORY
configuration should respect that there are not only different enviroments, but also different usage patterns. The same configuration should support using sbt console, custom spark-submit.sh commands and be used by all of the scripts in /scripts/*.sh
Reduce boilerplate of model parametrization in bash scripts. There should be a way to run a script with a specific model parametrization without changing the script. The paramters should be "easily" (in the sense of easy enough) to be copied to other usage patterns, i.e. unit tests, sbt console, etc.

It seems reasonable to attack this early. First because attacking it late will leave us no time to have real world experience with changes. Currently there is already a lot of boilerplate and continuing forces us to create even more, to stay consistent.

For trigrams, make possible to turn off lowercasing

Motivation

Currently the text is always lower cased, which (i) do not correspond to the "original" JBT implementation, (ii) will cause misunderstanding from the side of Chris as he is often prefer to keep the original case:

https://github.com/uhh-lt/josimtext/blob/master/src/main/scala/de/uhh/lt/jst/dt/Text2TrigramTermContext.scala#L35

Implementation

Make an additional command line boolean parameter which would be control use of lowercasing. Set this parameter by default to true (lowercase everything as now).

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.