Coder Social home page Coder Social logo

Comments (21)

stevenoh93 avatar stevenoh93 commented on August 23, 2024

Hello,

One of the requirements of the training data is that The file must contain at least 49 unique questions.
Could you try to get more training data? Or use a corpus from StackExchange?

from professor-languo.

germanattanasio avatar germanattanasio commented on August 23, 2024

@dbanda do you have an idea of what this is ?

from professor-languo.

dbanda avatar dbanda commented on August 23, 2024

I agree with Steven. It looks like there is something wrong with your training data. Check that is complies with the requirements.

from professor-languo.

poonamsaini17 avatar poonamsaini17 commented on August 23, 2024

I used a corpus of health stack .
While indexing it showed almost 50 answers...

from professor-languo.

poonamsaini17 avatar poonamsaini17 commented on August 23, 2024

I tried again , can you tell me now where i am doing the mistake?
1)
output_ingestion.txt

2016-05-05 12:59:46,412 INFO  [main] pipeline.PipelineDriver (PipelineDriver.java:93) - Training set has 54 questions and 54 answers
2016-05-05 12:59:46,412 INFO  [main] pipeline.PipelineDriver (PipelineDriver.java:99) - Validation set has 0 questions and 0 answers
2016-05-05 12:59:46,412 INFO  [main] pipeline.PipelineDriver (PipelineDriver.java:105) - Test set has 0 questions and 0 answers
2016-05-05 12:59:46,793 INFO  [main] api.PipelineQuestionAnswerer (PipelineQuestionAnswerer.java:288) - Initializing scheduler with 1 threads.
2016-05-05 12:59:47,303 INFO  [main] pipeline.RnrMergerAndRanker (RnrMergerAndRanker.java:116) - Solr cluster - sc75089e73_2a3b_4e9f_ae1e_4c7cfa09878d
2016-05-05 12:59:47,303 INFO  [main] pipeline.RnrMergerAndRanker (RnrMergerAndRanker.java:117) - Solr collection - collection_name
2016-05-05 12:59:47,303 INFO  [main] pipeline.RnrMergerAndRanker (RnrMergerAndRanker.java:118) - Ranker name - ranker_name
2016-05-05 12:59:47,303 INFO  [main] pipeline.RnrMergerAndRanker (RnrMergerAndRanker.java:119) - Rows per query - 50
2016-05-05 12:59:47,353 INFO  [main] api.PipelineQuestionAnswerer (PipelineQuestionAnswerer.java:135) - number of folds:1
2016-05-05 12:59:47,353 INFO  [main] api.PipelineQuestionAnswerer (PipelineQuestionAnswerer.java:136) - trainable components:[]
2016-05-05 12:59:47,423 INFO  [pool-3-thread-1] api.PipelineQuestionAnswerer$3 (PipelineQuestionAnswerer.java:150) - 8794
2016-05-05 12:59:53,524 INFO  [pool-3-thread-1] api.PipelineQuestionAnswerer$3 (PipelineQuestionAnswerer.java:150) - 8914
2016-05-05 12:59:57,498 INFO  [pool-3-thread-1] api.PipelineQuestionAnswerer$3 (PipelineQuestionAnswerer.java:150) - 8894
2016-05-05 13:00:02,387 INFO  [pool-3-thread-1] api.PipelineQuestionAnswerer$3 (PipelineQuestionAnswerer.java:150) - 9015
2016-05-05 13:00:06,019 INFO  [pool-3-thread-1] api.PipelineQuestionAnswerer$3 (PipelineQuestionAnswerer.java:150) - 9123
2016-05-05 13:00:07,762 INFO  [pool-3-thread-1] api.PipelineQuestionAnswerer$3 (PipelineQuestionAnswerer.java:150) - 8185
2016-05-05 13:00:09,243 INFO  [pool-3-thread-1] api.PipelineQuestionAnswerer$3 (PipelineQuestionAnswerer.java:150) - 8390
2016-05-05 13:00:10,504 INFO  [pool-3-thread-1] api.PipelineQuestionAnswerer$3 (PipelineQuestionAnswerer.java:150) - 8414
2016-05-05 13:00:11,755 INFO  [pool-3-thread-1] api.PipelineQuestionAnswerer$3 (PipelineQuestionAnswerer.java:150) - 8647
2016-05-05 13:00:12,706 INFO  [pool-3-thread-1] api.PipelineQuestionAnswerer$3 (PipelineQuestionAnswerer.java:150) - 8588
2016-05-05 13:00:13,677 INFO  [pool-3-thread-1] api.PipelineQuestionAnswerer$3 (PipelineQuestionAnswerer.java:150) - 9858
2016-05-05 13:00:14,641 INFO  [pool-3-thread-1] api.PipelineQuestionAnswerer$3 (PipelineQuestionAnswerer.java:150) - 9229
2016-05-05 13:00:15,591 INFO  [pool-3-thread-1] api.PipelineQuestionAnswerer$3 (PipelineQuestionAnswerer.java:150) - 9317
2016-05-05 13:00:16,302 INFO  [pool-3-thread-1] pipeline.RnrMergerAndRanker (RnrMergerAndRanker.java:252) - So far, 10 out of 12 questions have been retained
2016-05-05 13:00:16,302 INFO  [pool-3-thread-1] api.PipelineQuestionAnswerer$3 (PipelineQuestionAnswerer.java:150) - 9362
2016-05-05 13:00:17,524 INFO  [pool-3-thread-1] api.PipelineQuestionAnswerer$3 (PipelineQuestionAnswerer.java:150) - 9375
2016-05-05 13:00:18,237 INFO  [pool-3-thread-1] api.PipelineQuestionAnswerer$3 (PipelineQuestionAnswerer.java:150) - 9433
2016-05-05 13:00:18,937 INFO  [pool-3-thread-1] api.PipelineQuestionAnswerer$3 (PipelineQuestionAnswerer.java:150) - 9445
2016-05-05 13:00:19,612 INFO  [pool-3-thread-1] api.PipelineQuestionAnswerer$3 (PipelineQuestionAnswerer.java:150) - 9499
2016-05-05 13:00:20,350 INFO  [pool-3-thread-1] api.PipelineQuestionAnswerer$3 (PipelineQuestionAnswerer.java:150) - 9561
2016-05-05 13:00:21,126 INFO  [pool-3-thread-1] api.PipelineQuestionAnswerer$3 (PipelineQuestionAnswerer.java:150) - 11147
2016-05-05 13:00:21,806 INFO  [pool-3-thread-1] api.PipelineQuestionAnswerer$3 (PipelineQuestionAnswerer.java:150) - 2508
2016-05-05 13:00:22,533 INFO  [pool-3-thread-1] api.PipelineQuestionAnswerer$3 (PipelineQuestionAnswerer.java:150) - 2521
2016-05-05 13:00:23,723 INFO  [pool-3-thread-1] api.PipelineQuestionAnswerer$3 (PipelineQuestionAnswerer.java:150) - 2316
2016-05-05 13:00:24,926 INFO  [pool-3-thread-1] api.PipelineQuestionAnswerer$3 (PipelineQuestionAnswerer.java:150) - 2660
2016-05-05 13:00:25,873 INFO  [pool-3-thread-1] pipeline.RnrMergerAndRanker (RnrMergerAndRanker.java:252) - So far, 20 out of 23 questions have been retained
2016-05-05 13:00:25,873 INFO  [pool-3-thread-1] api.PipelineQuestionAnswerer$3 (PipelineQuestionAnswerer.java:150) - 3966
2016-05-05 13:00:26,846 INFO  [pool-3-thread-1] api.PipelineQuestionAnswerer$3 (PipelineQuestionAnswerer.java:150) - 13122
2016-05-05 13:00:28,135 INFO  [pool-3-thread-1] api.PipelineQuestionAnswerer$3 (PipelineQuestionAnswerer.java:150) - 13055
2016-05-05 13:00:29,356 INFO  [pool-3-thread-1] api.PipelineQuestionAnswerer$3 (PipelineQuestionAnswerer.java:150) - 13070
2016-05-05 13:00:30,300 INFO  [pool-3-thread-1] api.PipelineQuestionAnswerer$3 (PipelineQuestionAnswerer.java:150) - 13179
2016-05-05 13:00:31,340 INFO  [pool-3-thread-1] api.PipelineQuestionAnswerer$3 (PipelineQuestionAnswerer.java:150) - 12790
2016-05-05 13:00:32,592 INFO  [pool-3-thread-1] api.PipelineQuestionAnswerer$3 (PipelineQuestionAnswerer.java:150) - 12620
2016-05-05 13:00:35,569 INFO  [pool-3-thread-1] api.PipelineQuestionAnswerer$3 (PipelineQuestionAnswerer.java:150) - 12563
2016-05-05 13:00:39,409 INFO  [pool-3-thread-1] api.PipelineQuestionAnswerer$3 (PipelineQuestionAnswerer.java:150) - 4975
2016-05-05 13:00:43,446 INFO  [pool-3-thread-1] api.PipelineQuestionAnswerer$3 (PipelineQuestionAnswerer.java:150) - 12748
2016-05-05 13:00:45,968 INFO  [pool-3-thread-1] api.PipelineQuestionAnswerer$3 (PipelineQuestionAnswerer.java:150) - 12679
2016-05-05 13:00:47,946 INFO  [pool-3-thread-1] api.PipelineQuestionAnswerer$3 (PipelineQuestionAnswerer.java:150) - 12264
2016-05-05 13:00:49,643 INFO  [pool-3-thread-1] pipeline.RnrMergerAndRanker (RnrMergerAndRanker.java:252) - So far, 30 out of 35 questions have been retained
2016-05-05 13:00:49,643 INFO  [pool-3-thread-1] api.PipelineQuestionAnswerer$3 (PipelineQuestionAnswerer.java:150) - 12288
2016-05-05 13:00:51,105 INFO  [pool-3-thread-1] api.PipelineQuestionAnswerer$3 (PipelineQuestionAnswerer.java:150) - 13797
2016-05-05 13:00:52,056 INFO  [pool-3-thread-1] api.PipelineQuestionAnswerer$3 (PipelineQuestionAnswerer.java:150) - 5287
2016-05-05 13:00:53,263 INFO  [pool-3-thread-1] api.PipelineQuestionAnswerer$3 (PipelineQuestionAnswerer.java:150) - 13556
2016-05-05 13:00:53,944 INFO  [pool-3-thread-1] api.PipelineQuestionAnswerer$3 (PipelineQuestionAnswerer.java:150) - 13688
2016-05-05 13:00:54,856 INFO  [pool-3-thread-1] api.PipelineQuestionAnswerer$3 (PipelineQuestionAnswerer.java:150) - 13759
2016-05-05 13:00:55,647 INFO  [pool-3-thread-1] api.PipelineQuestionAnswerer$3 (PipelineQuestionAnswerer.java:150) - 13281
2016-05-05 13:00:56,358 INFO  [pool-3-thread-1] api.PipelineQuestionAnswerer$3 (PipelineQuestionAnswerer.java:150) - 13311
2016-05-05 13:00:57,048 INFO  [pool-3-thread-1] api.PipelineQuestionAnswerer$3 (PipelineQuestionAnswerer.java:150) - 5613
2016-05-05 13:00:57,729 INFO  [pool-3-thread-1] api.PipelineQuestionAnswerer$3 (PipelineQuestionAnswerer.java:150) - 13371
2016-05-05 13:00:58,445 INFO  [pool-3-thread-1] pipeline.RnrMergerAndRanker (RnrMergerAndRanker.java:252) - So far, 40 out of 45 questions have been retained
2016-05-05 13:00:58,446 INFO  [pool-3-thread-1] api.PipelineQuestionAnswerer$3 (PipelineQuestionAnswerer.java:150) - 5812
2016-05-05 13:00:59,466 INFO  [pool-3-thread-1] api.PipelineQuestionAnswerer$3 (PipelineQuestionAnswerer.java:150) - 13411
2016-05-05 13:01:00,381 INFO  [pool-3-thread-1] api.PipelineQuestionAnswerer$3 (PipelineQuestionAnswerer.java:150) - 5830
2016-05-05 13:01:01,600 INFO  [pool-3-thread-1] api.PipelineQuestionAnswerer$3 (PipelineQuestionAnswerer.java:150) - 5735
2016-05-05 13:01:02,812 INFO  [pool-3-thread-1] api.PipelineQuestionAnswerer$3 (PipelineQuestionAnswerer.java:150) - 6293
2016-05-05 13:01:03,792 INFO  [pool-3-thread-1] api.PipelineQuestionAnswerer$3 (PipelineQuestionAnswerer.java:150) - 6246
2016-05-05 13:01:04,757 INFO  [pool-3-thread-1] api.PipelineQuestionAnswerer$3 (PipelineQuestionAnswerer.java:150) - 6308
2016-05-05 13:01:05,702 INFO  [pool-3-thread-1] api.PipelineQuestionAnswerer$3 (PipelineQuestionAnswerer.java:150) - 7721
2016-05-05 13:01:06,632 INFO  [main] pipeline.RnrMergerAndRanker (RnrMergerAndRanker.java:160) - Attempting to create ranker.
2016-05-05 13:01:15,051 INFO  [main] pipeline.RnrMergerAndRanker (RnrMergerAndRanker.java:163) - Writing training data to disk
2016-05-05 13:01:15,095 INFO  [main] pipeline.RnrMergerAndRanker (RnrMergerAndRanker.java:167) - Writing training data to disk done
2016-05-05 13:01:15,105 ERROR [main] pipeline.RnrMergerAndRanker (RnrMergerAndRanker.java:194) - The key [ranker_id] was not in the map
2016-05-05 13:01:15,105 INFO  [main] pipeline.RnrMergerAndRanker (RnrMergerAndRanker.java:196) - Ranker ID: null
2016-05-05 13:01:15,195 INFO  [main] pipeline.PipelineDriver (PipelineDriver.java:133) - PipelineDriver total elapsed time: 1 minute 29 seconds

from professor-languo.

stevenoh93 avatar stevenoh93 commented on August 23, 2024

Hi,
During indexing, the corpus might have more than 50 questions, but what the service is looking for is 50 "good" questions, which means questions that have a selected "correct" answer.
There may be questions in the corpus that have not been answered yet, and the PipelineQuestionAnswerer will remove them to save space since they will be removed during the training phase anyway.

2016-05-05 13:00:58,445 INFO  [pool-3-thread-1] pipeline.RnrMergerAndRanker (RnrMergerAndRanker.java:252) - So far, 40 out of 45 questions have been retained

this line means 5 questions did not have any candidate answers, so they were removed.
Usually, corpus with around 50K questions seemed to yield good results for us.

from professor-languo.

poonamsaini17 avatar poonamsaini17 commented on August 23, 2024

I am not understanding .. Can you please tell in more detail ?
2016-05-06 17:36:44,291 INFO [main] pipeline.RnrMergerAndRanker (RnrMergerAndRanker.java:160) - Attempting to create ranker.
2016-05-06 17:37:09,237 INFO [main] pipeline.RnrMergerAndRanker (RnrMergerAndRanker.java:163) - Writing training data to disk
2016-05-06 17:37:09,255 INFO [main] pipeline.RnrMergerAndRanker (RnrMergerAndRanker.java:167) - Writing training data to disk done
2016-05-06 17:37:09,291 ERROR [main] pipeline.RnrMergerAndRanker (RnrMergerAndRanker.java:194) - The key [ranker_id] was not in the map
2016-05-06 17:37:09,291 INFO [main] pipeline.RnrMergerAndRanker (RnrMergerAndRanker.java:196) - Ranker ID: null
2016-05-06 17:37:09,361 INFO [main] pipeline.PipelineDriver (PipelineDriver.java:133) - PipelineDriver total elapsed time: 2 minutes 58 seconds

Attaching a new output file
Uploading outputOfPipelineRunner.txt…

from professor-languo.

poonamsaini17 avatar poonamsaini17 commented on August 23, 2024

closed by mistake

from professor-languo.

stevenoh93 avatar stevenoh93 commented on August 23, 2024

Sorry if that wasn't clear.

The StackExchange database has a system where the user asking the question chooses a 'correct' answer among the various answers he/she gets.

The ranker needs at least 50 questions to train. However, it ignores any questions that do not have a selected answer, because they have no value in the training process.

Therefore, to save space, the PipelineQuestionAnswerer discards such questions from the training data before they are sent off to the training process.

In the output log, you'll find this message:

So far, 40 out of 45 questions have been retained

This means, that out of 45 questions you had, five of them did not have a chosen answer and was discarded.
So even if you have more than 50 questions to feed into the training process, some of them might be discarded, and you may end up with less than 50 questions at the end. That would result in failure to create the ranker.

Essentially, I believe your data set is still too small for this training process.

from professor-languo.

dgterry avatar dgterry commented on August 23, 2024

Thanks @stevenoh93
@poonamsaini17 assuming it fits your needs, one thing that might make things easier on you is when going to the stack exchange archive site
is to use the english.stackexchange.com.7z file. This is the file we used to host the running version of the app and we can confirm it is large enough to pass the training phase.

from professor-languo.

poonamsaini17 avatar poonamsaini17 commented on August 23, 2024

@stevenoh93
Thanks for clearing the out ..
But yesterday i tried with huge data
170 out 180 have been retained ..
Now the count is more than 50 ..

Still Ranker id happened to be null..

from professor-languo.

poonamsaini17 avatar poonamsaini17 commented on August 23, 2024

outputOfPipelineRunner.txt

from professor-languo.

stevenoh93 avatar stevenoh93 commented on August 23, 2024

Could you attach the res/trainingdata.csv?

from professor-languo.

stevenoh93 avatar stevenoh93 commented on August 23, 2024

Could you also tell us what corpus you are using?

from professor-languo.

poonamsaini17 avatar poonamsaini17 commented on August 23, 2024

English one was taking hell lot of time ..
So i tried with biology..

from professor-languo.

stevenoh93 avatar stevenoh93 commented on August 23, 2024

Do you have the res/trainingdata.csv file? I think that could clear up some problems that you might be having. It's not yet clear to me what the problem is just from the log statements.

from professor-languo.

poonamsaini17 avatar poonamsaini17 commented on August 23, 2024

image

from professor-languo.

poonamsaini17 avatar poonamsaini17 commented on August 23, 2024

I am not allowed to upload csv file here , so i just pasted the screen shot of that file..

from professor-languo.

stevenoh93 avatar stevenoh93 commented on August 23, 2024

Maybe check it in to a repository and share the link? The screenshot is not enough information unfortunately :(

from professor-languo.

stevenoh93 avatar stevenoh93 commented on August 23, 2024

Okay I think I know what the problem is.
Here is a requirement from the documentation

The number of records must be at least 50 times the number of fields that are identified in your Solr configuration. For example, if your collection defines five fields, you must have at least 250 records in your training data.

With the addition of custom scorers, this requirement is not met.
Could you disable all custom scorers by going to src/main/resources/app_config.properties and delete all scorers? The field name is EGA_METADATA_FEATURE_SCORERS, and you want it to look like this:

EGA_METADATA_FEATURE_SCORERS=

After saving, try running it again.

from professor-languo.

germanattanasio avatar germanattanasio commented on August 23, 2024

I'm going to close this. I think @stevenoh93's suggestion fixed the issue

from professor-languo.

Related Issues (6)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.