Coder Social home page Coder Social logo

Comments (11)

akoumjian avatar akoumjian commented on July 27, 2024

This will be a tricky one, thinking on it!

from datefinder.

ranchodeluxe avatar ranchodeluxe commented on July 27, 2024

@akoumjian: when you get a chance checkout my fix for this issue in my dev branch: 31d3da7

This solution does not deal with returning date ranges, which is one possible meaning of parsing date strings like this. The general idea, using i am looking for dates on 01/01/2013 to 02/01/2016 as an example, is that every match will go through this workflow:

  1. first pass on DateFinder.extract_dates to match smaller tokens >> on 01/01/2013 to 02/01/2016
  2. DateParser._find_and_replace substitutes REPLACEMENT keywords with a special character >> $ 01/01/2013 $ 02/01/2016
  3. second pass with DateFinder.extract_dates and add results to a lazy_stack so we don't change the meaning or functionality of existing generators >> [ '01/01/2013', '02/01/2016' ]

from datefinder.

akoumjian avatar akoumjian commented on July 27, 2024

I agree that we don't want to return any sense of 'ranges', that we just want the individual dates. Your solution appears to work, but as your commit message predicts, I do not love the multiple pass approach. :-P

Something to consider is that including to in the list of extra tokens, along with tokens like due are there purely for extra functionality I wanted in my original use case. The idea was to produce a little extra context about where we are capturing the datetime strings. That way, if you do source=True, you can do further analysis on the source string to determine which of the dates you are getting back is likely the one you are looking for.

One alternative would be allowing the user to pass in an EXTRA_TOKENS list. That way if it breaks things like with to in this date range example, it's just on the user and not on the package.

from datefinder.

ranchodeluxe avatar ranchodeluxe commented on July 27, 2024

I agree, two passes feels odd. I was never totally sold on it. Out of that hack I just liked the whole idea of keeping the solution lazy and open to retries by pushing the work back on the stack.

At first I liked the idea of the user passing EXTRA_TOKENS. But then that's more for them to think about and it really asks them to know something about how this tool works, which feels gross.

There are a couple issues that this ticket and the work on it have exposed:

  1. regardless of what tokens we choose, there could always be a case where it will match dates that wrap an EXTRA_TOKEN. For example, consider I like to list dates such as 01/01/2015, 02/01/2016, 03/01/2017. Seems like a good solution would make sure that doesn't happen regardless of the token.
  2. The above issue begs a simpler base case which I feel like the regex should solve but doesn't
In [6]: list(datefinder_instance.extract_date_strings('01/01/2015 02/01/2016 03/01/2017'))
Out[6]: 
[('01/01/2015 02/01/2016 03/01/2017',
  (0, 32),
  {'days': [],
   'delimiters': ['/', '/', ' ', '/', '/', ' ', '/', '/'],
   'digits': ['01', '01', '2015', '02', '01', '2016', '03', '01', '2017'],
   'digits_modifier': [],
   'extra_tokens': [],
   'hours': [],
   'minutes': [],
   'months': [],
   'seconds': [],
   'time': [],
   'time_periods': [],
   'timezones': []})]

from datefinder.

akoumjian avatar akoumjian commented on July 27, 2024

Okay, this just got way more complicated

from datefinder.

akoumjian avatar akoumjian commented on July 27, 2024

Some follow up thoughts. Two approaches pop into my head:

  1. switch over to using a laundry list of regexes, or using laundry list of regexes for more basic valid types and using them in a second pass fashion.
  2. Using the amount of group type matches to infer how and why to split up the match and then re-run that through the engine as is after splitting.

The second one of course gets extremely messy. Although, the order of delimiters is preserved. That could be our key out of this whole thing! Then we could match against delimiter patterns to infer intention. Again, requires a bit of laundry listing.

from datefinder.

akoumjian avatar akoumjian commented on July 27, 2024

While we're at it, we may also want to consider taking better advantage of dateutil's fuzzy matching and extendability:

http://dateutil.readthedocs.org/en/latest/parser.html

For instance, the fuzzy/fuzzy_with_tokens already lets you pass in a string with common tokens which can be ignored. And you can extend, replace that list via subclassing!

Even with fuzzy options, dateutil still won't recognize multiple datetimes, which is to be expected. The other problem is how incredibly liberal dateutil is in parsing strings.

>>> parser.parse('01')
datetime.datetime(2016, 3, 1, 0, 0)

If that weren't the case, then we could just try splitting on delimiters like commas and spaces.

from datefinder.

akoumjian avatar akoumjian commented on July 27, 2024

I have a terrible idea. PR incoming

from datefinder.

akoumjian avatar akoumjian commented on July 27, 2024

Okay, luckily the terrible idea I had didn't work. I was going to take strings which dateutil couldn't parse and, assuming they contain multiple dates, attempt to parse incrementing substrings until I could find one that wasn't a datetime. Then I would backtrack, call that date1, start the substring slicing at the new index, and repeat.

Thankfully, this didn't work. So onto the next solution.

from datefinder.

Chinmay41018 avatar Chinmay41018 commented on July 27, 2024

Even if parsing is used, for an instance 'How many people were born from May to June 2017?', and the phrase is split on the 'to', it is still unable to identify 'May' as a month, since no year is specified. Also, just the month 'July' as an input text isn't identified. Is there a solution for this?

from datefinder.

itsshavar avatar itsshavar commented on July 27, 2024

Hi, I am not able to extract date from this sentence.
"In continuation with same meets happened on 16-JUL-2020, 06-AUG-2020, 03-SEP-2020, 22-oxytocin challenge test-2020, 12-NOV-2020, 03-DEC-2020, 28-DEC-2020, 18-JAN-2021, 18-JAN-2021, 09-MAR-2021 and on 06-Apr-2021."

Only 06-Apr-2021 is getting extracted.

from datefinder.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.