Code for alexlamson.com
alexlamson / datawrangler Goto Github PK
View Code? Open in Web Editor NEWMake quick and dirty data mining made easier in Sublime Text
License: MIT License
Make quick and dirty data mining made easier in Sublime Text
License: MIT License
It tries to be a little smart and guess what kind of spacing is being used between the columns (ex. tabs vs spaces vs commas vs comma w/ spaces)
Options to fix the problem
remember where the columns are when you align the columns (easier)
build detector function that will automatically detect that format (better if the data gets changed after aligning)
String to reproduce error:
ReportId, ActivityId, Name, GroupId, CommonGroupId, GroupListId, CommonGroupListId, IsActive, IsBillable, StartLocalTime, StartUtcTime, EndLocalTime, EndUtcTime, Notes, RelatedActivityId, SourceId, CurrentChangeSequence, CurrentChangeRandomValue, Other
(3, 611, 'ReportId, ActivityId, Name, GroupId, CommonGroupId • (Quantified Self, manictime sample) - Sublime Text (UNREGISTERED)', 62, 12, None, None, 1, None, '2018-06-27 13:39:31', '2018-06-27 17:39:31', '2018-06-27 13:39:36', '2018-06-27 17:39:36', None, None, None, 1500, 1218390469, '{}')
Stack trace:
File "C:\Program Files\Sublime Text 3\sublime_plugin.py", line 1072, in run_
return self.run(edit)
File "C:\Users\Alex\AppData\Roaming\Sublime Text 3\Packages\User\DataWrangler.py", line 205, in run
column_widths = detect_col_widths(self, sep, num_columns)
File "C:\Users\Alex\AppData\Roaming\Sublime Text 3\Packages\User\DataWrangler.py", line 59, in detect_col_widths
column_widths[i] = max(len(cell_string), column_widths[i])
IndexError: list index out of range
AA
BB
CC
DD
EE
The above example doesn't retain the CC
in the output. There should be a setting to enable this behavior.
This may have to exist in a setting or keybind to allow switching this on and off
* add spaces to columns such that there is no vertical overlap between columns
* maybe also re-order the columns such that the narrow column come first?
Example:
aaaa bb ccc
dd eeeeee ff
Becomes:
aaaa bb ccc
dd eeeeee ff
List of stopwords can be found here:
https://gist.github.com/sebleier/554280
For each line, remove it if the string matches any of the following regexes:
^$
^\t+$
^,+$
^ +$
^(, )+$
automatically detecting dtype and column separations could potentially be done by using some pandas libraries. If that's possible, it should be done
Not sure how it should be implemented, but take all lines that are uncommon but within a certain edit distance of a commonly occuring line, and replace the uncommon line with the common line.
Also, it may be wise to have some human oversight into this process while it happens.
Make it use regex or something to separate the words
Use NLTK to do word segmentation on the document. Spaces between words will be replaced with newlines.
convert dates and times to the format: YYY-MM-DD HH:MM:SS.SS (where hours is zero-padded 24-hour time)
when copying from google sheets, there are lines at the end of the doc that are empty sans tabs. this will remove those lines
For instance,
https://github.com/recite/re-cite.org/
The plugin seems useful, btw.
Is your feature request related to a problem? Please describe.
Ex. I'm always frustrated when [...]
Describe the solution you'd like
A clear and concise description of what you want to happen.
Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.
Additional context
Add any other context or screenshots about the feature request here.
Assuming that the document being viewed is a table, this command should delete the column that the cursor is currently in
Given a list of elements, list all pairs of elements for that list
ex.
Input
a,b,c,d
Output
a b
a c
a d
b c
b d
c d
Can be done with the following code
import itertools
list(itertools.combinations(['a','b','c','d'], 2))
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.