Comments (11)
This will be a tricky one, thinking on it!
from datefinder.
@akoumjian: when you get a chance checkout my fix for this issue in my dev branch: 31d3da7
This solution does not deal with returning date ranges, which is one possible meaning of parsing date strings like this. The general idea, using i am looking for dates on 01/01/2013 to 02/01/2016
as an example, is that every match will go through this workflow:
- first pass on
DateFinder.extract_dates
to match smaller tokens >>on 01/01/2013 to 02/01/2016
DateParser._find_and_replace
substitutes REPLACEMENT keywords with a special character >>$ 01/01/2013 $ 02/01/2016
- second pass with
DateFinder.extract_dates
and add results to a lazy_stack so we don't change the meaning or functionality of existing generators >>[ '01/01/2013', '02/01/2016' ]
from datefinder.
I agree that we don't want to return any sense of 'ranges', that we just want the individual dates. Your solution appears to work, but as your commit message predicts, I do not love the multiple pass approach. :-P
Something to consider is that including to
in the list of extra tokens, along with tokens like due
are there purely for extra functionality I wanted in my original use case. The idea was to produce a little extra context about where we are capturing the datetime strings. That way, if you do source=True
, you can do further analysis on the source string to determine which of the dates you are getting back is likely the one you are looking for.
One alternative would be allowing the user to pass in an EXTRA_TOKENS list. That way if it breaks things like with to
in this date range example, it's just on the user and not on the package.
from datefinder.
I agree, two passes feels odd. I was never totally sold on it. Out of that hack I just liked the whole idea of keeping the solution lazy and open to retries by pushing the work back on the stack.
At first I liked the idea of the user passing EXTRA_TOKENS. But then that's more for them to think about and it really asks them to know something about how this tool works, which feels gross.
There are a couple issues that this ticket and the work on it have exposed:
- regardless of what tokens we choose, there could always be a case where it will match dates that wrap an EXTRA_TOKEN. For example, consider
I like to list dates such as 01/01/2015, 02/01/2016, 03/01/2017
. Seems like a good solution would make sure that doesn't happen regardless of the token. - The above issue begs a simpler base case which I feel like the regex should solve but doesn't
In [6]: list(datefinder_instance.extract_date_strings('01/01/2015 02/01/2016 03/01/2017'))
Out[6]:
[('01/01/2015 02/01/2016 03/01/2017',
(0, 32),
{'days': [],
'delimiters': ['/', '/', ' ', '/', '/', ' ', '/', '/'],
'digits': ['01', '01', '2015', '02', '01', '2016', '03', '01', '2017'],
'digits_modifier': [],
'extra_tokens': [],
'hours': [],
'minutes': [],
'months': [],
'seconds': [],
'time': [],
'time_periods': [],
'timezones': []})]
from datefinder.
Okay, this just got way more complicated
from datefinder.
Some follow up thoughts. Two approaches pop into my head:
- switch over to using a laundry list of regexes, or using laundry list of regexes for more basic valid types and using them in a second pass fashion.
- Using the amount of group type matches to infer how and why to split up the match and then re-run that through the engine as is after splitting.
The second one of course gets extremely messy. Although, the order of delimiters is preserved. That could be our key out of this whole thing! Then we could match against delimiter patterns to infer intention. Again, requires a bit of laundry listing.
from datefinder.
While we're at it, we may also want to consider taking better advantage of dateutil's fuzzy matching and extendability:
http://dateutil.readthedocs.org/en/latest/parser.html
For instance, the fuzzy/fuzzy_with_tokens already lets you pass in a string with common tokens which can be ignored. And you can extend, replace that list via subclassing!
Even with fuzzy options, dateutil still won't recognize multiple datetimes, which is to be expected. The other problem is how incredibly liberal dateutil is in parsing strings.
>>> parser.parse('01')
datetime.datetime(2016, 3, 1, 0, 0)
If that weren't the case, then we could just try splitting on delimiters like commas and spaces.
from datefinder.
I have a terrible idea. PR incoming
from datefinder.
Okay, luckily the terrible idea I had didn't work. I was going to take strings which dateutil couldn't parse and, assuming they contain multiple dates, attempt to parse incrementing substrings until I could find one that wasn't a datetime. Then I would backtrack, call that date1, start the substring slicing at the new index, and repeat.
Thankfully, this didn't work. So onto the next solution.
from datefinder.
Even if parsing is used, for an instance 'How many people were born from May to June 2017?', and the phrase is split on the 'to', it is still unable to identify 'May' as a month, since no year is specified. Also, just the month 'July' as an input text isn't identified. Is there a solution for this?
from datefinder.
Hi, I am not able to extract date from this sentence.
"In continuation with same meets happened on 16-JUL-2020, 06-AUG-2020, 03-SEP-2020, 22-oxytocin challenge test-2020, 12-NOV-2020, 03-DEC-2020, 28-DEC-2020, 18-JAN-2021, 18-JAN-2021, 09-MAR-2021 and on 06-Apr-2021."
Only 06-Apr-2021 is getting extracted.
from datefinder.
Related Issues (20)
- Major Bug while detecting day in year starting dates HOT 1
- Broken in 0.7.3 : date not detected with specific surrounding text HOT 1
- Can't find specific dates in %d.%m.%Y format HOT 2
- "On <month_name>" gives a weirdly extracted date value HOT 1
- Reference Date
- strict mode fills in and return incomplete dates (that are not supposed to be dates)
- Searching "Date: Tue, 23 Apr 1996 13:28:27 -0400" fails to obtain date HOT 3
- "French 75" is identified as date due to the presence if 75
- Datefind mistakenly identifies "pre-qualification may" as a date resulting in the date list as "on may"
- update print in example to Python 3
- Release version 0.7.4?
- UnknownTimezoneWarning warning HOT 1
- "datefinder" fails with this text "25/7//2023".
- REPLACEMENTS not comprehensive enough?
- Fails to detect correct day if text is '9.6 20:30'.
- Failed To detect the Date : 2022 05 October if Strict is True.
- Fails to detect date if text is '2023-10-04 decision' or '2023-10-04 novel'
- Detects datetime.datetime(2015, 12, 18, 0, 0) when given " 2015 " HOT 2
- Something wrong with no english strin months?
- find_dates fails when there's a number after a date HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from datefinder.