Comments (11)
@muffpy you can take over
from charabia.
Hey @Abastien1734,
We do not assign people to issues; you can start working straight away!
Happy to know youโre interested in the product. Thanks! ๐
from charabia.
If i want to work on this, does it need to be assigned to me or just fork and start working?
from charabia.
@Abastien1734 are you still working on this or can I take over? ๐
from charabia.
@ManyTheFish Hi, I was checking for the implementation part in the code, which you pointed out in the description of this issue. I was thinking of changes, something like this in the mod.rs file-
if s.chars().all(char::is_numeric) {
Some(s)
} else {
self.current = self.segmenter.segment_str(s);
self.next()
}
Please help me, if I am going in the right direction.
from charabia.
Hello @239yash,
What you suggested could work.
However, it's not sufficient for floating points; you could have dots in the provided string, but it wouldn't match your case.
Could you make sure to add some tests in the codebase?
from charabia.
@ManyTheFish I will add the case for handling floating point numbers too, If the current approach doesn't work out. Let me add that, Will add the test cases also after implementation. Thanks!
from charabia.
if s.parse::<f64>().is_ok() {
Some(s)
} else {
self.current = self.segmenter.segment_str(s);
self.next()
}
I have tested for floating point numbers, but it's not working out as the string "1234.5" is already broken in two pieces when passed to the above code block i.e. - "1234" and "5". This logic is working fine for pure integer strings. Can you help me with this?
Should I raise a PR for this, so that you can have a look into this?
One more thing, I noticed while running test cases for the given below languages, the test segmenter_segment_str is also failing.
segmenter::japanese::test::segmenter_segment_str
segmenter::khmer::test::segmenter_segment_str
segmenter::thai::test::segmenter_segment_str
The error printed is -
---- segmenter::japanese::test::segmenter_segment_str stdout ----
thread 'segmenter::japanese::test::segmenter_segment_str' panicked at charabia/src/segmenter/japanese.rs:144:5:
assertion `left == right` failed:
Segmenter JapaneseSegmenter didn't segment the text as expected.
help: the `segmented` text provided to `test_segmenter!` does not corresponds to the output of the tested segmenter, it's probably due to a bug in the segmenter or a mistake in the provided segmented text.
left: ["้ข่ฅฟ", "ๅฝ้", "็ฉบๆธฏ", "้ๅฎ", "ใใผใ", "ใใใฐ", " ", "1", "2", "3", "4", " ", "ใใใ", "ใ", "ใใ", "ใ", "ใใ", "ใฎ", "ใใก"]
right: ["้ข่ฅฟ", "ๅฝ้", "็ฉบๆธฏ", "้ๅฎ", "ใใผใ", "ใใใฐ", " ", "1234", " ", "ใใใ", "ใ", "ใใ", "ใ", "ใใ", "ใฎ", "ใใก"]
The number "1234" is getting split into individual digit strings. Can you help me with what extra changes could be done for this? I tried a few things but failed in that.
from charabia.
Hello @239yash,
Why don't you go for something simpler, like:
if s.chars().all(|c| c.is_numeric() || c.is_ponctuation()) {
You may know that the separators are customizable in Charabia, meaning that they have already been processed before calling this part of the code.
Let's say the .
is part of the separators, then the text 123.456
will be preprocessed as ["123", ".", "456"]
.
For the test case, it's expected that the digit characters will now be joined together.
from charabia.
Hello @ManyTheFish,
I trying my luck on this first issue and it led me to 2 questions:
- Are the floating point numbers currently supported ?
In the latin segmenter tests32.3
is expected to become["32", ".", "3"]
. - The
segmenter_segment_str
test is usingAhoSegmentedStrIter
directly and notSegmentedStrIter
, so languages like Thai using theFST_SEGMENTER
won't pass the integer test with the fix you proposed. Is it a fix issue or is it a test issue ?
It looks like I got to the same point than @239yash, I will be happy to collaborate if they are still on this issue.
from charabia.
Hello @42plamusse,
Are the floating point numbers currently supported ? In the latin segmenter tests 32.3 is expected to become ["32", ".", "3"].
Yes, you're right, but the test uses the default separator set, including .
which separates the lemmas linked by it. However, this set is customizable, and .
could be removed from the list, meaning that it shouldn't be separated anymore, so it's possible to receive ["32.3"] but not by default.
The
segmenter_segment_str
test is usingAhoSegmentedStrIter
directly and not SegmentedStrIter, so languages like Thai using the FST_SEGMENTER won't pass the integer test with the fix you proposed. Is it a fix issue or is it a test issue?
Yes you're right, the test macro should be updated to separate segmenter_segment_str
expected output from segment
. ๐ค
In an another hand, we could remove numbers from the tests in the specialized tokenizer.
from charabia.
Related Issues (20)
- add support for khmer language
- Fix compilation without `greek` feature enabled
- Chinese segmentation not correct HOT 2
- Tokenization of japan text with disabled default features HOT 3
- Test this library against invalid and strange input
- Implement the `CharNormalizer` trait on the `LowercaseNormalizer` struct
- ร vs ฤ differentiate HOT 7
- Add Khmer support information to README
- remove unnecessary iteration in khmer segmenter HOT 2
- Fix kvariant CI
- Compiler failure without vietnamese feature
- Compilation warnings when not using default features
- Cross-compiling charabia for arm HOT 5
- Normalize "ล" / "รฆ" into "oe" / "ae"
- [Maintainance] Review and amend documentation in files
- Tag and release new version? HOT 1
- Rework Chinese Pinyin normalizer
- latin-camelcase feature make wrong segmentation HOT 11
- The `chinese-normalization-pinyin` feature flag doesn't compile HOT 6
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. ๐๐๐
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google โค๏ธ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from charabia.