Coder Social home page Coder Social logo

karlicoss / hpi Goto Github PK

View Code? Open in Web Editor NEW
1.4K 16.0 60.0 1.54 MB

Human Programming Interface 🧑👽🤖

Home Page: https://beepb00p.xyz/hpi.html

License: MIT License

Python 98.92% Shell 1.08%
quantified-self extended-mind personal-api lifelogging data-liberation

hpi's Introduction

If you’re in a hurry, feel free to jump straight to the demos.

  • see SETUP for the installation/configuration guide
  • see DEVELOPMENT for the development guide
  • see DESIGN for the design goals
  • see MODULES for module-specific setup
  • see MODULE_DESIGN for some thoughts on structuring modules, and possibly extending HPI
  • see exobrain/HPI for some of my raw thoughts and todos on the project

TLDR: I’m using HPI (Human Programming Interface) package as a means of unifying, accessing and interacting with all of my personal data.

HPI is a Python package (named my), a collection of modules for:

  • social networks: posts, comments, favorites
  • reading: e-books and pdfs
  • annotations: highlights and comments
  • todos and notes
  • health data: sleep, exercise, weight, heart rate, and other body metrics
  • location
  • photos & videos
  • browser history
  • instant messaging

The package hides the gory details of locating data, parsing, error handling and caching. You simply ‘import’ your data and get to work with familiar Python types and data structures.

  • Here’s a short example to give you an idea: “which subreddits I find the most interesting?”
    import my.reddit.all
    from collections import Counter
    return Counter(s.subreddit for s in my.reddit.all.saved()).most_common(4)
        
    orgmode62
    emacs60
    selfhosted51
    QuantifiedSelf46

I consider my digital trace an important part of my identity. (#extendedmind) Usually the data is siloed, accessing it is inconvenient and borderline frustrating. This feels very wrong.

In contrast, once the data is available as Python objects, I can easily plug it into existing tools, libraries and frameworks. It makes building new tools considerably easier and opens up new ways of interacting with the data.

I tried different things over the years and I think I’m getting to the point where other people can also benefit from my code by ‘just’ plugging in their data, and that’s why I’m sharing this.

Imagine if all your life was reflected digitally and available at your fingertips. This library is my attempt to achieve this vision.

Table of contents:
  • Why?
  • How does a Python package help?
    • Why don’t you just put everything in a massive database?
  • What’s inside?
  • How do you use it?
  • Ad-hoc and interactive
    • What were my music listening stats for 2018?
    • What are the most interesting Slate Star Codex posts I’ve read?
    • Accessing exercise data
    • Book reading progress
    • Messenger stats
    • Which month in 2020 did I make the most git commits in?
    • Querying Roam Research database
  • How does it get input data?
  • Q & A
    • Why Python?
    • Can anyone use it?
    • How easy is it to use?
    • What about privacy?
    • But should I use it?
    • Would it suit me?
    • What it isn’t?
  • HPI Repositories
  • Related links

Why?

The main reason that led me to develop this is the dissatisfaction of the current situation:

  • Our personal data is siloed and trapped across cloud services and various devices

    Even when it’s possible to access it via the API, it’s hardly useful, unless you’re an experienced programmer, willing to invest your time and infrastructure.

  • We have insane amounts of data scattered across the cloud, yet we’re left at the mercy of those who collect it to provide something useful based on it

    Integrations of data across silo boundaries are almost non-existent. There is so much potential and it’s all wasted.

  • I’m not willing to wait till some vaporware project reinvents the whole computing model from scratch

    As a programmer, I am in capacity to do something right now, even though it’s not necessarily perfect and consistent.

I’ve written a lot about it here, so allow me to simply quote:

  • search and information access
    • Why can’t I search over all of my personal chat history with a friend, whether it’s ICQ logs from 2005 or Whatsapp logs from 2019?
    • Why can’t I have incremental search over my tweets? Or browser bookmarks? Or over everything I’ve ever typed/read on the Internet?
    • Why can’t I search across my watched youtube videos, even though most of them have subtitles hence allowing for full text search?
    • Why can’t I see the places my friends recommended me on Google maps (or any other maps app)?
  • productivity
    • Why can’t my Google Home add shopping list items to Google Keep? Let alone other todo-list apps.
    • Why can’t I create a task in my todo list or calendar from a conversation on Facebook Messenger/Whatsapp/VK.com/Telegram?
  • journaling and history
    • Why do I have to lose all my browser history if I decide to switch browsers?
    • Why can’t I see all the places I traveled to on a single map and photos alongside?
    • Why can’t I see what my heart rate (i.e. excitement) and speed were side by side with the video I recorded on GoPro while skiing?
    • Why can’t I easily transfer all my books and metadata if I decide to switch from Kindle to PocketBook or vice versa?
  • consuming digital content
    • Why can’t I see stuff I highlighted on Instapaper as an overlay on top of web page?
    • Why can’t I have single ‘read it later’ list, unifying all things saved on Reddit/Hackernews/Pocket?
    • Why can’t I use my todo app instead of ‘Watch later’ playlist on youtube?
    • Why can’t I ‘follow’ some user on Hackernews?
    • Why can’t I see if I’ve run across a Youtube video because my friend sent me a link months ago?
    • Why can’t I have uniform music listening stats based on my Last.fm/iTunes/Bandcamp/Spotify/Youtube?
    • Why am I forced to use Spotify’s music recommendation algorithm and don’t have an option to try something else?
    • Why can’t I easily see what were the books/music/art recommended by my friends or some specific Twitter/Reddit/Hackernews users?
    • Why my otherwise perfect hackernews app for Android doesn’t share saved posts/comments with the website?
  • health and body maintenance
    • Why can’t I tell if I was more sedentary than usual during the past week and whether I need to compensate by doing a bit more exercise?
    • Why can’t I see what’s the impact of aerobic exercise on my resting HR?
    • Why can’t I have a dashboard for all of my health: food, exercise and sleep to see baselines and trends?
    • Why can’t I see the impact of temperature or CO2 concentration in room on my sleep?
    • Why can’t I see how holidays (as in, not going to work) impact my stress levels?
    • Why can’t I take my Headspace app data and see how/if meditation impacts my sleep?
    • Why can’t I run a short snippet of code and check some random health advice on the Internet against my health data.
  • personal finance
    • Why am I forced to manually copy transactions from different banking apps into a spreadsheet?
    • Why can’t I easily match my Amazon/Ebay orders with my bank transactions?
  • why I can’t do anything when I’m offline or have a wonky connection?
  • tools for thinking and learning
    • Why when something like ‘mind palace’ is literally possible with VR technology, we don’t see any in use?
    • Why can’t I easily convert select Instapaper highlights or new foreign words I encountered on my Kindle into Anki flashcards?
  • mediocre interfaces
    • Why do I have to suffer from poor management and design decisions in UI changes, even if the interface is not the main reason I’m using the product?
    • Why can’t I leave priorities and notes on my saved Reddit/Hackernews items?
    • Why can’t I leave private notes on Deliveroo restaurants/dishes, so I’d remember what to order/not to order next time?
    • Why do people have to suffer from Google Inbox shutdown?
  • communication and collaboration
    • Why can’t I easily share my web or book highlights with a friend? Or just make highlights in select books public?
    • Why can’t I easily find out other person’s expertise without interrogating them, just by looking what they read instead?
  • backups
    • Why do I have to think about it and actively invest time and effort?
  • I’m tired of having to use multiple different messengers and social networks
  • I’m tired of shitty bloated interfaces

    Why do we have to be at mercy of their developers, designers and product managers? If we had our data at hand, we could fine-tune interfaces for our needs.

  • I’m tired of mediocre search experience

    Text search is something computers do exceptionally well. Yet, often it’s not available offline, it’s not incremental, everyone reinvents their own query language, and so on.

  • I’m frustrated by poor information exploring and processing experience

    While for many people, services like Reddit or Twitter are simply time killers (and I don’t judge), some want to use them efficiently, as a source of information/research. Modern bookmarking experience makes it far from perfect.

You can dismiss this as a list of first-world problems, and you would be right, they are. But the major reason I want to solve these problems is to be better at learning and working with knowledge, so I could be better at solving the real problems.

How does a Python package help?

When I started solving some of these problems for myself, I’ve noticed a common pattern: the hardest bit is actually getting your data in the first place. It’s inherently error-prone and frustrating.

But once you have the data in a convenient representation, working with it is pleasant – you get to explore and build instead of fighting with yet another stupid REST API.

This package knows how to find data on your filesystem, deserialize it and normalize it to a convenient representation. You have the full power of the programming language to transform the data and do whatever comes to your mind.

Why don’t you just put everything in a massive database?

Glad you’ve asked! I wrote a whole post about it.

In short: while databases are efficient and easy to read from, often they aren’t flexible enough to fit your data. You’re probably going to end up writing code anyway.

While working with your data, you’ll inevitably notice common patterns and code repetition, which you’ll probably want to extract somewhere. That’s where a Python package comes in.

What’s inside?

Here’s the (incomplete) list of the modules:

=my.bluemaestro=Bluemaestro temperature/humidity/pressure monitor
=my.body.blood=Blood tracking (manual org-mode entries)
=my.body.exercise.all=Combined exercise data
=my.body.exercise.cardio=Cardio data, filtered from various data sources
=my.body.exercise.cross_trainer=My cross trainer exercise data, arbitrated from different sources (mainly, Endomondo and manual text notes)
=my.body.weight=Weight data (manually logged)
=my.calendar.holidays=Holidays and days off work
=my.coding.commits=Git commits data for repositories on your filesystem
=my.demo=Just a demo module for testing and documentation purposes
=my.emfit=Emfit QS sleep tracker
=my.endomondo=Endomondo exercise data
=my.fbmessenger=Facebook Messenger messages
=my.foursquare=Foursquare/Swarm checkins
=my.github.all=Unified Github data (merged from GDPR export and periodic API updates)
=my.github.gdpr=Github data (uses official GDPR export)
=my.github.ghexport=Github data: events, comments, etc. (API data)
=my.hypothesis=Hypothes.is highlights and annotations
=my.instapaper=Instapaper bookmarks, highlights and annotations
=my.kobo=Kobo e-ink reader: annotations and reading stats
=my.lastfm=Last.fm scrobbles
=my.location.google=Location data from Google Takeout
=my.location.home=Simple location provider, serving as a fallback when more detailed data isn’t available
=my.materialistic=Materialistic app for Hackernews
=my.orgmode=Programmatic access and queries to org-mode files on the filesystem
=my.pdfs=PDF documents and annotations on your filesystem
=my.photos.main=Photos and videos on your filesystem, their GPS and timestamps
=my.pinboard=Pinboard bookmarks
=my.pocket=Pocket bookmarks and highlights
=my.polar=Polar articles and highlights
=my.reddit=Reddit data: saved items/comments/upvotes/etc.
=my.rescuetime=Rescuetime (phone activity tracking) data.
=my.roamresearch=Roam data
=my.rss.all=Unified RSS data, merged from different services I used historically
=my.rss.feedbin=Feedbin RSS reader
=my.rss.feedly=Feedly RSS reader
=my.rtm=Remember The Milk tasks and notes
=my.runnerup=Runnerup exercise data (TCX format)
=my.smscalls=Phone calls and SMS messages
=my.stackexchange.gdpr=Stackexchange data (uses official GDPR export)
=my.stackexchange.stexport=Stackexchange data (uses API via stexport)
=my.taplog=Taplog app data
=my.time.tz.main=Timezone data provider, used to localize timezone-unaware timestamps for other modules
=my.time.tz.via_location=Timezone data provider, guesses timezone based on location data (e.g. GPS)
=my.twitter.all=Unified Twitter data (merged from the archive and periodic updates)
=my.twitter.archive=Twitter data (uses official twitter archive export)
=my.twitter.twint=Twitter data (tweets and favorites). Uses Twint data export.
=my.vk.vk_messages_backup=VK data (exported by Totktonada/vk_messages_backup)

Some modules are private, and need a bit of cleanup before merging:

my.workoutsExercise activity, from Endomondo and manual logs
my.sleep.manualSubjective sleep data, manually logged
my.nutritionFood and drink consumption data, logged manually from different sources
my.moneyExpenses and shopping data
my.webhistoryBrowsing history (part of promnesia)

How do you use it?

Mainly I use it as a data provider for my scripts, tools, and dashboards.

Also, check out my infrastructure map. It might be helpful for understanding what’s my vision on HPI.

Instant search

Typical search interfaces make me unhappy as they are siloed, slow, awkward to use and don’t work offline. So I built my own ways around it! I write about it in detail here.

In essence, I’m mirroring most of my online data like chat logs, comments, etc., as plaintext. I can overview it in any text editor, and incrementally search over all of it in a single keypress.

orger

orger is a tool that helps you generate an org-mode representation of your data.

It lets you benefit from the existing tooling and infrastructure around org-mode, the most famous being Emacs.

I’m using it for:

  • searching, overviewing and navigating the data
  • creating tasks straight from the apps (e.g. Reddit/Telegram)
  • spaced repetition via org-drill

Orger comes with some existing modules, but it should be easy to adapt your own data source if you need something else.

I write about it in detail here and here.

promnesia

promnesia is a browser extension I’m working on to escape silos by unifying annotations and browsing history from different data sources.

I’ve been using it for more than a year now and working on final touches to properly release it for other people.

dashboard

As a big fan of #quantified-self, I’m working on personal health, sleep and exercise dashboard, built from various data sources.

I’m working on making it public, you can see some screenshots here.

timeline

Timeline is a #lifelogging project I’m working on.

I want to see all my digital history, search in it, filter, easily jump at a specific point in time and see the context when it happened. That way it works as a sort of external memory.

Ideally, it would look similar to Andrew Louis’s Memex, or might even reuse his interface if he open sources it. I highly recommend watching his talk for inspiration.

Ad-hoc and interactive

What were my music listening stats for 2018?

Single import away from getting tracks you listened to:

from my.lastfm import scrobbles
list(scrobbles())[200: 205]
[Scrobble(raw={'album': 'Nevermind', 'artist': 'Nirvana', 'date': '1282488504', 'name': 'Drain You'}),
 Scrobble(raw={'album': 'Dirt', 'artist': 'Alice in Chains', 'date': '1282489764', 'name': 'Would?'}),
 Scrobble(raw={'album': 'Bob Dylan: The Collection', 'artist': 'Bob Dylan', 'date': '1282493517', 'name': 'Like a Rolling Stone'}),
 Scrobble(raw={'album': 'Dark Passion Play', 'artist': 'Nightwish', 'date': '1282493819', 'name': 'Amaranth'}),
 Scrobble(raw={'album': 'Rolled Gold +', 'artist': 'The Rolling Stones', 'date': '1282494161', 'name': "You Can't Always Get What You Want"})]

Or, as a pretty Pandas frame:

import pandas as pd
df = pd.DataFrame([{
    'dt': s.dt,
    'track': s.track,
} for s in scrobbles()]).set_index('dt')
df[200: 205]
                                                                       track
dt                                                                          
2010-08-22 14:48:24+00:00                                Nirvana — Drain You
2010-08-22 15:09:24+00:00                           Alice in Chains — Would?
2010-08-22 16:11:57+00:00                   Bob Dylan — Like a Rolling Stone
2010-08-22 16:16:59+00:00                               Nightwish — Amaranth
2010-08-22 16:22:41+00:00  The Rolling Stones — You Can't Always Get What...

We can use calmap library to plot a github-style music listening activity heatmap:

import matplotlib.pyplot as plt
plt.figure(figsize=(10, 2.3))

import calmap
df = df.set_index(df.index.tz_localize(None)) # calmap expects tz-unaware dates
calmap.yearplot(df['track'], how='count', year=2018)

plt.tight_layout()
plt.title('My music listening activity for 2018')
plot_file = 'hpi_files/lastfm_2018.png'
plt.savefig(plot_file)
plot_file

https://beepb00p.xyz/hpi_files/lastfm_2018.png

This isn’t necessarily very insightful data, but fun to look at now and then!

What are the most interesting Slate Star Codex posts I’ve read?

My friend asked me if I could recommend them posts I found interesting on Slate Star Codex. With few lines of Python I can quickly recommend them posts I engaged most with, i.e. the ones I annotated most on Hypothesis.

from my.hypothesis import pages
from collections import Counter
cc = Counter({(p.title + ' ' + p.url): len(p.highlights) for p in pages() if 'slatestarcodex' in p.url})
return cc.most_common(10)
The Anti-Reactionary FAQ http://slatestarcodex.com/2013/10/20/the-anti-reactionary-faq/32
Reactionary Philosophy In An Enormous, Planet-Sized Nutshell https://slatestarcodex.com/2013/03/03/reactionary-philosophy-in-an-enormous-planet-sized-nutshell/17
The Toxoplasma Of Rage http://slatestarcodex.com/2014/12/17/the-toxoplasma-of-rage/16
What Universal Human Experiences Are You Missing Without Realizing It? https://slatestarcodex.com/2014/03/17/what-universal-human-experiences-are-you-missing-without-realizing-it/16
Meditations On Moloch http://slatestarcodex.com/2014/07/30/meditations-on-moloch/12
Universal Love, Said The Cactus Person http://slatestarcodex.com/2015/04/21/universal-love-said-the-cactus-person/11
Untitled http://slatestarcodex.com/2015/01/01/untitled/11
Considerations On Cost Disease https://slatestarcodex.com/2017/02/09/considerations-on-cost-disease/10
In Defense of Psych Treatment for Attempted Suicide http://slatestarcodex.com/2013/04/25/in-defense-of-psych-treatment-for-attempted-suicide/9
I Can Tolerate Anything Except The Outgroup https://slatestarcodex.com/2014/09/30/i-can-tolerate-anything-except-the-outgroup/9

Accessing exercise data

E.g. see use of my.workouts here.

Book reading progress

I publish my reading stats on Goodreads so other people can see what I’m reading/have read, but Kobo lacks integration with Goodreads. I’m using kobuddy to access my my Kobo data, and I’ve got a regular task that reminds me to sync my progress once a month.

The task looks like this:

* TODO [#C] sync [[https://goodreads.com][reading progress]] with kobo
  DEADLINE: <2019-11-24 Sun .+4w -0d>
[[eshell: python3 -c 'import my.kobo; my.kobo.print_progress()']]

With a single Enter keypress on the inlined eshell: command I can print the progress and fill in the completed books on Goodreads, e.g.:

A_Mathematician's_Apology by G. H. Hardy
Started : 21 Aug 2018 11:44
Finished: 22 Aug 2018 12:32

Fear and Loathing in Las Vegas: A Savage Journey to the Heart of the American Dream (Vintage) by Thompson, Hunter S.
Started : 06 Sep 2018 05:54
Finished: 09 Sep 2018 12:21

Sapiens: A Brief History of Humankind by Yuval Noah Harari
Started : 09 Sep 2018 12:22
Finished: 16 Sep 2018 07:25

Inadequate Equilibria: Where and How Civilizations Get Stuck by Eliezer Yudkowsky
Started : 31 Jul 2018 22:54
Finished: 16 Sep 2018 07:25

Albion Dreaming by Andy Roberts
Started : 20 Aug 2018 21:16
Finished: 16 Sep 2018 07:26

Messenger stats

How much do I chat on Facebook Messenger?
from my.fbmessenger import messages

import pandas as pd
import matplotlib.pyplot as plt

df = pd.DataFrame({'dt': m.dt, 'messages': 1} for m in messages())
df.set_index('dt', inplace=True)

df = df.resample('M').sum() # by month
df = df.loc['2016-01-01':'2019-01-01'] # past subset for determinism

fig, ax = plt.subplots(figsize=(15, 5))
df.plot(kind='bar', ax=ax)

# todo wonder if that vvv can be less verbose...
x_labels = df.index.strftime('%Y %b')
ax.set_xticklabels(x_labels)

plot_file = 'hpi_files/messenger_2016_to_2019.png'
plt.tight_layout()
plt.savefig(plot_file)
return plot_file

https://beepb00p.xyz/hpi_files/messenger_2016_to_2019.png

Which month in 2020 did I make the most git commits in?

If you like the shell or just want to quickly convert/grab some information from HPI, it also comes with a JSON query interface - so you can export the data, or just pipeline to your heart’s content:

$ hpi query my.coding.commits.commits --stream # stream JSON objects as they're read
  --order-type datetime  # find the 'datetime' attribute and order by that
  --after '2020-01-01' --before '2021-01-01' # in 2020
  | jq '.committed_dt' -r  # extract the datetime
  # mangle the output a bit to group by month and graph it
  | cut -d'-' -f-2 | sort | uniq -c | awk '{print $2,$1}' | sort -n | termgraph
2020-01: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 458.00
2020-02: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 440.00
2020-03: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 545.00
2020-04: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 585.00
2020-05: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 518.00
2020-06: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 755.00
2020-07: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 467.00
2020-08: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 449.00
2020-09: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 1.03 K
2020-10: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 791.00
2020-11: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 474.00
2020-12: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 383.00

See query docs for more examples

Querying Roam Research database

I’ve got some code examples here.

How does it get input data?

If you’re curious about any specific data sources I’m using, I’ve written it up in detail.

Also see “Data flow” documentation with some nice diagrams explaining on specific examples.

In short:

  • The data is periodically synchronized from the services (cloud or not) locally, on the filesystem

    As a result, you get JSONs/sqlite (or other formats, depending on the service) on your disk.

    Once you have it, it’s trivial to back it up and synchronize to other computers/phones, if necessary.

    To schedule periodic sync, I’m using cron.

  • my. package only accesses the data on the filesystem

    That makes it extremely fast, reliable, and fully offline capable.

As you can see, in such a setup, the data is lagging behind the ‘realtime’. I consider it a necessary sacrifice to make everything fast and resilient.

In theory, it’s possible to make the system almost realtime by having a service that sucks in data continuously (rather than periodically), but it’s harder as well.

Q & A

Why Python?

I don’t consider Python unique as a language suitable for such a project. It just happens to be the one I’m most comfortable with. I do have some reasons that I think make it specifically good, but explaining them is out of this post’s scope.

In addition, Python offers a very rich ecosystem for data analysis, which we can use to our benefit.

That said, I’ve never seen anything similar in other programming languages, and I would be really interested in, so please send me links if you know some. I’ve heard LISPs are great for data? ;)

Overall, I wish FFIs were a bit more mature, so we didn’t have to think about specific programming languages at all.

Can anyone use it?

Yes!
  • you can plug in your own data
  • most modules are isolated, so you can only use the ones that you want to
  • everything is easily extensible

    Starting from simply adding new modules to any dynamic hackery you can possibly imagine within Python.

How easy is it to use?

The whole setup requires some basic programmer literacy:
  • installing/running and potentially modifying Python code
  • using symlinks
  • potentially running Cron jobs

If you have any ideas on making the setup simpler, please let me know!

What about privacy?

The modules contain no data, only code to operate on the data.

Everything is *local first*, the input data is on your filesystem. If you’re truly paranoid, you can even wrap it in a Docker container.

There is still a question of whether you trust yourself at even keeping all the data on your disk, but it is out of the scope of this post.

If you’d rather keep some code private too, it’s also trivial to achieve with a private subpackage.

But should I use it?

Sure, maybe you can achieve a perfect system where you can instantly find and recall anything that you’ve done. Do you really want it? Wouldn’t that, like, make you less human?

I’m not a gatekeeper of what it means to be human, but I don’t think that the shortcomings of the human brain are what makes us such.

So I can’t answer that for you. I certainly want it though. I’m quite open about my goals – I’d happily get merged/augmented with a computer to enhance my thinking and analytical abilities.

While at the moment we don’t even remotely understand what would such merging or “mind uploading” entail exactly, I can clearly delegate some tasks, like long term memory, information lookup, and data processing to a computer. They can already handle it really well.

What about these people who have perfect recall and wish they hadn’t.

Sure, maybe it sucks. At the moment though, my recall is far from perfect, and this only annoys me. I want to have a choice at least, and digital tools give me this choice.

Would it suit me?

Probably, at least to some extent.

First, our lives are different, so our APIs might be different too. This is more of a demonstration of what’s I’m using, although I did spend effort towards making it as modular and extensible as possible, so other people could use it too. It’s easy to modify code, add extra methods and modules. You can even keep all your modifications private.

But after all, we’ve all sharing many similar activities and using the same products, so there is a huge overlap. I’m not sure how far we can stretch it and keep modules generic enough to be used by multiple people. But let’s give it a try perhaps? :)

Second, interacting with your data through the code is the central idea of the project. That kind of cuts off people without technical skills, and even many people capable of coding, who dislike the idea of writing code outside of work.

It might be possible to expose some no-code interfaces, but I still feel that wouldn’t be enough.

I’m not sure whether it’s a solvable problem at this point, but happy to hear any suggestions!

What it isn’t?

  • It’s not vaporware

    The project is a little crude, but it’s real and working. I’ve been using it for a long time now, and find it fairly sustainable to keep using for the foreseeable future.

  • It’s not going to be another silo

    While I don’t have anything against commercial use (and I believe any work in this area will benefit all of us), I’m not planning to build a product out of it.

    I really hope it can grow into or inspire some mature open source system.

    Please take my ideas and code and build something cool from it!

HPI Repositories

One of HPI’s core goals is to be as extendable as possible. The goal here isn’t to become a monorepo and support every possible data source/website to the point that this isn’t maintainable anymore, but hopefully you get a few modules ‘for free’.

If you want to write modules for personal use but don’t want to merge them into here, you’re free to maintain modules locally in a separate directory to avoid any merge conflicts, and entire HPI repositories can even be published separately and installed into the single my python package (For more info on this, see MODULE_DESIGN)

Other HPI Repositories:

If you want to create your own to create your own modules/override something here, you can use the template.

Related links

Similar projects:

Other links:

Open to any feedback and thoughts!

Also, don’t hesitate to raise an issue, or reach me personally if you want to try using it, and find the instructions confusing. Your questions would help me to make it simpler!

In some near future I will write more about:

  • specific technical decisions and patterns
  • challenges I had so solve
  • more use-cases and demos – it’s impossible to fit everything in one post!

, but happy to answer any questions on these topics now!

hpi's People

Contributors

alaq avatar almereyda avatar ddrone avatar jhermann avatar karlicoss avatar kianmeng avatar mreishus avatar rosano avatar seanbreckenridge avatar thetomcraig avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

hpi's Issues

module suggestion: firefox history/bookmarks/etc.

I think a module for accessing Firefox data would be very useful.
This documentation on the Mozilla website details how the data is stored.

I am thinking that a separate script could be used to walk through that database and generate a JSON dump (similarly to how rexport works), and then an HPI module could provide access to that data.

I'm new to this project so I am not yet really familiar with how modules work, but when I have time I will attempt it and submit a PR.

Setting up HPI - config file

Hi, struggling to install orger for which I need HPI :)

I read:

After installing HPI, run hpi config create.

This will create an empty config file for you (usually, in ~/.config/my), which you can edit. Example configuration:

import pytz # yes, you can use any Python stuff in the config

class emfit:
    export_path = '/data/exports/emfit'
    tz = pytz.timezone('Europe/London')
    excluded_sids = []
    cache_path  = '/tmp/emfit.cache'

class instapaper:
    export_path = '/data/exports/instapaper'

class roamresearch:
    export_path = '/data/exports/roamresearch'
    username    = 'karlicoss'

(1) I have no my/config file under ~/.config
(2) Is the config file /Users/user/HPI/my/config.py?
(3) What exactly should I add in here? I don't see where to include my personal user data to access hypothesis.

Thanks so much!

Make logs configurable from CLI/mention HPI_LOGS

Is currently nowhere in the docs -- could probably just add a --debug flag here on the top-level click group which modifies HPI_LOGS so that when mklevel is called from elsewhere its set to logging.DEBUG

Should also add something to the docs/setup that describes HPI_LOGS incase someone is trying to debug cachew

Will create a PR if that all sounds fine

Think about abandoning timezone abbreviations map in my.core.time

With the DST shift in the UK, DST fails to parse, because the current version of the provider depends on 'current time'.

pytz.timezone(x).localize(datetime.now()).tzname(): pytz.timezone(x)

While it's possible to get rid of this, TZ abbreviations are inherently ambiguous, so there is no way to guess this correctly without user overrides.

Perhaps the best is to just treat them as local time, and rely on the location provider #96

mypy check fails in hpi config check: FileNotFoundError

Does hpi config check work in a virtual environment? I get an error when running hpi config check or checking a module, e.g. hpi doctor my.hypothesis:

» hpi config check                       
✅ import order: ['/Users/jussi/me/hpi/my']
✅ config file : /Users/jussi/me/hpi.config/my/config/__init__.py
✅ syntax check: /Users/jussi/me/hpi/bin/python3.9 -m compileall /Users/jhu/projects/mememo/hpi.config/my/config/__init__.py
❌ mypy check: failed
   Traceback (most recent call last):
     File "/usr/local/Cellar/[email protected]/3.9.9/Frameworks/Python.framework/Versions/3.9/lib/python3.9/runpy.py", line 197, in _run_module_as_main
...
...
   FileNotFoundError: [Errno 2] No such file or directory: '.mypy_cache/3.9/my/config/__init__.data.json'

I think this is not a mypy error because running mypy directly works:

» MYPYPATH=../hpi.config mypy --namespace-packages -p my.config 
Success: no issues found in 1 source file

I suspect the issue is related to how mypy is ran using subprocess.run():
https://github.com/karlicoss/HPI/blob/master/my/core/__main__.py#L47

Maybe this method changes the current working directory unlike running mypy directly on command line?

If I patch my.core.__main__ to print the directory just before the failing run(), it points to a temp directory (I'm on a MacOS, using venv for HPI):

    print("My path", os.path.abspath(".mypy_cache"))
    # My path /private/var/folders/j0/bc42cwsn7b59s623gyrtywsw0000gn/T/hpi_temp_dir/.mypy_cache

The mentioned .../hpi_temp_dir/.mypy_cache/... exists but the erroring file __init__.data.json is not found there. Instead, such file is in the console's current working dir, under the .mypy_cache/... path.

Does hpi config check work in a virtual environment?

This issue doesn't prevent usage. Also, feel free to close this issue if you think this doesn't belong here (but maybe on https://github.com/python/mypy/issues). However, I like the HPI's idea of being able to check the configuration programmatically. It's a cool feature especially for new users trying to set up the (cool) framework.

Use click for arguments?

As mentioned in #138 , may want to switch to click for argument parsing, as its more composable and offers lots of features that the stdlib doesn't. Also handles printing colored strings if colorama is installed, prompt/confirming prompts in a nice way, along with other things like progress bars if you want to mess with that.

Since this is decorator based, also means a user has the possibility to extend the CLI by adding a hook somewhere

Once #138 has been merged, I'd be willing to write an initial PR for this, as I've been using click consistently for a few years now.

Error "KeyError: 'date'" when testing my.pdfs.

I'm seeing this error when trying to test my.pdfs on a local PDF file. Here's the full traceback:

>>> pdfs.get_annots(Path('./Ousterhout_2018_A philosophy of software design.pdf'))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/tomcraig/.pyenv/versions/3.7.0/envs/hpi/lib/python3.7/site-packages/my/pdfs.py", line 94, in get_annots
    return [as_annotation(raw_ann=a, path=str(p)) for a in annots]
  File "/Users/tomcraig/.pyenv/versions/3.7.0/envs/hpi/lib/python3.7/site-packages/my/pdfs.py", line 94, in <listcomp>
    return [as_annotation(raw_ann=a, path=str(p)) for a in annots]
  File "/Users/tomcraig/.pyenv/versions/3.7.0/envs/hpi/lib/python3.7/site-packages/my/pdfs.py", line 61, in as_annotation
    dates = d['date']
KeyError: 'date'

It appears that the input to get_annots expects the input, raw_ann to have a date attribute that it does not have. I put in a print line and found that this is the value of raw_ann, coming from my input file;

{
    'page': <my.config.repos.pdfannots.pdfannots.Page object at 0x1113cd780>,
    'tagname': 'Highlight',
    'contents': None,
    'rect': [
        72, 334.555, 539.829, 420.955
    ],
    'author': 'Tom Craig',
    'text': 'In\t an\t ideal\t world,\t each\t module\t would\t be\t completely independent\tof\tthe\tothers:\ta\tdeveloper\tcould\twork\tin\tany\tof\tthe\tmodules\twithout knowing\tanything\tabout\tany\tof\tthe\tother\tmodules.\tIn\tthis\tworld,\tthe\tcomplexity of\ta\tsystem\twould\tbe\tthe\tcomplexity\tof\tits\tworst\tmodule. Unfortunately,\t this\t ideal\t is\t not\t achievable ',
    'boxes': [
        (220.556, 406.555, 539.753, 420.955),
        (72, 389.275, 539.829, 403.675),
        (72, 371.995, 539.741, 386.395),
        (72, 354.715, 400.028, 369.115),
        (93.6, 334.555, 340.65, 348.955)
    ]
}

EDIT: Looks like the Annotation object can be created successfully without the date attribute present; I just confirmed by changing the dict access to use .get(). If there are not downstream issues I'm unaware of, I made a PR here: #68

my.media.youtube returning []

So I've set up HPI on Windows WSL2 Ubuntu 20.04.
But for the life of me I can't seem to get my.media.youtube to work

python3 -c 'import my.media.youtube as yt; print(yt.watched())'
Just returns []

Also:
python3 -c 'import my.media.youtube as yt; print(yt.stats())'
{'watched': {'count': 0, 'warning': 'THE ITERABLE RETURNED NO DATA'}}

python3 -c 'import my.google.takeout.paths as tk; print(tk.get_last_takeout())' does return the takeout zip file.
And that file does have \Takeout\My Activity\YouTube\My Activity.html

hpi doctor my.google.takeout.paths
Comes back all green

but
hpi modules --all lists [disabled: marked explicitly (via __NOT_HPI_MODULE__)] for my.google.takeout.paths

I'm not sure where to go from here!
Any ideas what I could try next? 😎

my.pdfs error; updates from pdfannots

pdfannots has made quite a lot of updates in the past few months, including Annotation and Document classes which replace the typical return values for the process_file, and some of the kwargs.

Currently just throws an error since the kwargs dont match the previous interface, but it also returns different data now

$ hpi query my.pdfs.annotations | jq '.[0]'
{
  "error": "Exception: process_file() got an unexpected keyword argument 'emit_progress'\n"
}

I would try and update this myself but not very used to working with annotations/pdfs and don't want to remove something which may be useful.

Also, even though there is a annotation class there, I think we should keep to a NT/Dataclass based one for cachew reasons. In addition, the fields there are nullable, which makes things a bit annoying.

This is why #179 is currently failing (also reminder that 3.6 macos CI is still broken here, I fixed it in that PR, but you'd probably have to do it again before thats merged in; will merge on top of it to get the changes)

Fix spelling mistake for my.coding.commits

Has been mentioned in other issues, but thought I'd just create an issue to track this:

In lots of places in: https://github.com/karlicoss/HPI/blob/master/my/coding/commits.py#L60

commited should be spelled as committed

Comments from #76 (comment)

(me)

Perhaps a @property wrapper?

(karlicoss)

Yep, I think this is the way to go with such things. This particular module is probably used by very few people, but I guess best not to break it unless absolutely necessary.

Also a deprecation warning would be nice, I was thinking of using something like tantale/deprecated library, or just vendorizing a small, simple decorator to avoid an extra core dependency.

Possible feature: Parse binary data using Kaitai Struct

Parsing binary files has historically been a challenge for users who want access to the data contained within them, but I’ve recently come across (and successfully used) Kaitai Struct to do exactly that.

While of course HPI users can write their own binary parser, output JSON or some other human-friendly format, and then write an importer for that output, Kaitai has a library of parsable files already as well as a repo that accepts pull requests to add to that library.

Adding this as a suggestion, though it’s more of a way to open the discussion to HPI devs and community

Add core.query/serialize?

Not sure if this is something you'd be interesting in having in the core here/in utils, is the leftover helper modules I created while maintaining my fork.

As I was writing HPI_API I added a core.serialize and core.query to my HPI fork as well.

It is quite magical (as it just resolves the function name with a string) but it lets me do some simple queries pretty easily, and play around with pipelines in the shell without having to worry about how to interop with python/dumping something from the REPL

https://github.com/seanbreckenridge/HPI/blob/master/my/utils/query.py
https://github.com/seanbreckenridge/HPI/blob/master/my/utils/serialize.py

and then a script that exposes that info:

https://github.com/seanbreckenridge/HPI/blob/master/scripts/hpi_query

As some examples, 5 songs I listened to recently:

$ hpi_query my.mpv history | jq -r '.[] | .path' | grep -i 'music' | head -n 5
/home/sean/Music/Radiohead/1994 - The Bends/11 - Sulk.mp3
/home/sean/Music/Radiohead/1994 - The Bends/02 - The Bends.mp3
/home/sean/Music/Nujabes/Nujabes - Metaphorical Music (2003) [V0]/10. Next View.mp3
/home/sean/Music/Earth, Wind & Fire/Earth Wind And Fire - Greatest Hits - [MP3-V0]/16 - After The Love Has Gone.mp3
/home/sean/Music/Darren Korb/Darren Korb - Transistor Original Soundtrack[2013] (V0)/14 - Darren Korb - Apex Beat.mp3

I also use this in my menu bar, to print how many calories/water I've drank today:

image

Like:

#!/bin/bash
# how many water I've had today

((BLOCK_BUTTON == 3)) && notify-send "$("${REPOS}/HPI/scripts/water-recent")"

HPI_QUERY="${REPOS}/HPI/scripts/hpi_query"
{
	"${HPI_QUERY}" --days 1 'my.food' 'water' | jq -r '.[] | .glasses'
	echo 0
} | datamash sum 1

Have even had some fun creating graphs like this in the terminal:

hpi_query my.food food | jq -r '.[] | "\(.on)\t\(.calories)"' | datamash groupby 1 sum 2 | sort | termgraph | head

2020/09/26: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 1380.00
2020/09/27: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 1150.00
2020/09/28: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 1155.00
2020/09/29: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 2200.00
2020/09/30: ▇▇▇▇▇▇▇▇▇▇▇ 870.00
2020/10/01: ▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 1070.00
2020/10/02: ▇▇▇▇▇▇ 505.00
2020/10/03: ▇▇▇▇▇▇▇▇▇▇▇▇▇ 995.00
2020/10/04: ▇▇▇▇▇▇▇▇ 640.00

could probably remove the click/simplejson dependencies, they were just thrown in there because it was fast and I always have those installed anyways

Maybe query could be a subcommand on the hpi script? instead of installing it as a separate script

Social Media - Aggregate Updates

I want to have daily updates aggregated from all creators I am following in different social networks, including Youtube and Patreon, - how can HPI help me to achieve this?

hpi install/update command

Had an idea to make the install process perhaps easier, or at least automate some things - thought I'd leave it here -- personally I feel the most convenient way to use is through local editable installs, especially if someone is installing multiple (e.g. yours and mine) but I can imagine setting that up and managing updating it by git pulling can perhaps feel tedious for some users

So, this could add (one or two commands, or perhaps behind a click.group) to help manage that

Of course this isn't required, but for people who aren't as familiar with pip, it would automate some of the process.

Would define a directory like ~/.local/share/HPI where the repos sit and then running something like:

hpi repo install karlicoss/HPI clones your repo and pip install -es it
hpi repo install seanbreckenridge/HPI installs mine into another directory there (could manage name conflicts and the like)

hpi repo install could also probably just default to karlicoss/HPI

this would just for the most part subprocessing the pip/git commands.

And then, running hpi repo update would cd to those and git pull, printing updates to your console with git commit info

It does seem a bit like a monorepo command/plugin system like oh-my-zsh or doom something -- Not advocating for that, this is just to manage the editable installs. I guess they have the update commands for a reason, because would decrease the barrier to entry for someone not having to manage the editable installs or git

This would also be accessible by just doing pip install HPI, (which then would be uninstalled and the editable version would be reinstalled, which is fine if we're just execing a command like pip uninstall -y HPI && pip install -e /path/to/local/clone), and then let the user maintain local directories instead of the pypi installed HPI, so the instructions for an editable install could just be:

pip install HPI
hpi repo install

and then periodically

hpi repo update

which could loop over my.__path__, checking if a .git directory exists relative to it to update by git pulling

In [1]: import my

In [2]: my.__path__
Out[2]: _NamespacePath(['/home/sean/Repos/HPI/my', '/home/sean/Repos/HPI-fork/my'])

Can work on this if its something youre interested in adding

NotImplementedError: cannot instantiate 'CPath' on your system

I get this error when importing my.reddit on Windows:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "c:\users\migue\documents\github\hpi\my\reddit.py", line 119, in <module>
    def _get_events(backups: Sequence[Path]=get_sources(), parallel: bool=True) -> Iterator[Event]:
  File "c:\users\migue\documents\github\hpi\my\reddit.py", line 20, in get_sources
    res = list(map(CPath, files)); assert len(res) > 0
  File "C:\Users\migue\Anaconda3\lib\pathlib.py", line 1013, in __new__
    % (cls.__name__,))
NotImplementedError: cannot instantiate 'CPath' on your system

Thoughts on adding email?

A way to sync email locally and search it nicely would be nice, since there are many things you get emails for that would be hard to get notifications for otherwise, purchases for instance.

In general I feel it would be really easy to hack stuff together if you have a nice API to work with your emails

module: Signal support

Hi,

I'd love to see this project support Signal database backups. These cannot be created through the Signal app, but it is possible to extract the DB from a desktop installation. I would be happy to provide the file formats and a couple rows of my database backups if that would be helpful.

Does RSS support only work with Feedbin and Feedly?

Hi, I'd like to keep track of links that I've submitted to Reddit and Hacker News, and I figured I could easily automate that using their RSS feeds. However, I looked at the rss code, and it seems like it only supports Feedly and Feedbin. Is that correct?

HPI local installation caches Reddit exported data and does not refresh

I am experimenting with HPI as I was looking for a system that would allow me to create a repository of my digital traces: cool stuff.

I've installed HPI according as per the local/editable option.

I'm testing it with Reddit.
I've configured the path to the Reddit export file in $HOME/.config/my/my/init.py by adding:

export_path = "/home/ubuntu/hpi/reddit/*.json"

Rexport is using the information in secret.py to dump the Reddit data:
python3 -m rexport.export --secrets $HOME/git/rexport/secrets.py > ./reddit/"export-$(date -I).json"

This piece of code I've found in the documentation should report the list of the 4 subreddits with most saved posts:

import my.reddit.all
from collections import Counter
print(Counter(s.subreddit for s in my.reddit.all.saved()).most_common(4))

But what happens is that the information processed by my.reddit gets cached in $HOME/.cache and does not update when I rerun the rexport script

ubuntu@MARS:~/.cache/my$ ls -la
-rw-r--r-- 1 ubuntu ubuntu 1433600 Nov 21 15:35 my.reddit.rexport:comments
-rw-r--r-- 1 ubuntu ubuntu 1400832 Nov 21 15:34 my.reddit.rexport:saved
-rw-r--r-- 1 ubuntu ubuntu   94208 Nov 21 15:35 my.reddit.rexport:submissions
-rw-r--r-- 1 ubuntu ubuntu  561152 Nov 21 15:35 my.reddit.rexport:upvoted

To see the refreshed dump I must first delete the cached files.

What am I missing?

Thanks
s.

Allow spinning up a web server and JSON api?

This is more of a fun demonstration of what's it capable of, but could also help for integrating with other programming languages.

Should be fairly easy and with almost no boilerplate because namedtuples/dataclass map nicely into JSON.

Document config vars to be set for different modules

I have been starting to look at the repo and the project seems really cool but I am having major problems with configuring things. For example, in the reddit module. I followed the instructions by searching for where the config was needed and saw that u need to specify the export_path and also a my.config.repos object. This is not documented and it's quite unclear what this module does. I think I am not the only one with this opinion and it could be useful to document these type of configs.

Thanks for your work :)

Migrating module dependencies to proper PIP packages

Just figured I should document this.. At first, I tried to keep data access layers as minimal as possible (i.e. ideally in a single file), but it really seems to make more trouble than convenience:

  • somewhat annoying to keep track of that in the config
  • dependencies need to be installed separately via requirements.txt
  • non-transparent to static analysis, and to check it with mypy, still need a proper environment

With proper setup.py

  • one can simply git pip install <github repo url>, without cloning anything in a temporary location
  • you can use virtualenv, if you prefer, to avoid mixing HPI dependencies and rest of the packages
  • with --editable install, you can develop as if it was a symlink
  • and you still can manually symlink code if you prefer for some reason

Basically, the only downside is maintaining setup.py. I keep it very minimal, merely with the package name, py.typed file for MyPy and the dependencies, since I'm not planning to upload to PIP (and no one really looks at the classifiers/reads documentation on PyPi anyway).

UPD: also it seems that for proper mypy integration it's necessary to have __init__.py (see my comment here https://github.com/karlicoss/endoexport/blob/be5084aa45aaac206ff86624244f40c08b439340/src/endoexport/__init__.py#L1-L5 ). If anyone knows how to get around this, please let me know!

Related discussions:

Migrated:

Tagging @seanbreckenridge as you were interested in that too, let me know if you have some thoughts!

Support different error handling policies

As described here.

E.g. might make sense to implement:

  • fail fast, i.e. throw instead if yielding exceptions
  • fully defensive, i.e. ignore exceptions instead of yielding

That would require cooperation from either underlying data providers, or we could decorate them or something. Perhaps, makes more sense to do it on the HPI level.

Figure out where to get 'fake data'

It would be nice to have a public repository of raw data from different services, so it would be easy to test HPI and demonstrate without having to give up your own data. Does such a thing exist?

P.S. maybe this issue rather belongs here, and I'll tranfer it.

configuring `all` modules: catching AttributeErrors on missing blocks?

Something I've been thinking about for a while now -- currently in the process of splitting my mail module to have an all.py file

So previously, there was no class mbox, it was the nested:

class mail:
    class imap:
        mailboxes = []
    # would be added for new module:
    # class mbox:
    #   mailboxes = []

So, the all.py imports from both imap.py and the new mbox.py: https://github.com/seanbreckenridge/HPI/blob/ffbae2767b8c11e2b093b8cf28a941b0a8dfa4f6/my/mail/all.py

Then, when running the my.mail.all doctor, which imports from both:

✅ OK  : my.mail.all                                       
❗      - stats:                      computing failed
   Traceback (most recent call last):
     File "/home/sean/Repos/HPI-karlicoss/my/core/__main__.py", line 267, in modules_check
       res = stats()
     File "/home/sean/Repos/HPI/my/mail/all.py", line 45, in stats
       return {**stat(mail)}
     File "/home/sean/Repos/HPI-karlicoss/my/core/common.py", line 454, in stat
       res = _stat_iterable(fr)
     File "/home/sean/Repos/HPI-karlicoss/my/core/common.py", line 486, in _stat_iterable
       count = ilen(eit)
     File "/usr/lib/python3.10/site-packages/more_itertools/more.py", line 496, in ilen
       deque(zip(iterable, counter), maxlen=0)
     File "/home/sean/Repos/HPI-karlicoss/my/core/common.py", line 469, in funcit
       for x in it:
     File "/home/sean/Repos/HPI/my/mail/all.py", line 34, in mail
       yield from unique_mail(
     File "/home/sean/Repos/HPI/my/mail/common.py", line 158, in unique_mail
       yield from unique_everseen(
     File "/usr/lib/python3.10/site-packages/more_itertools/recipes.py", line 413, in unique_everseen
       for element in iterable:
     File "/home/sean/Repos/HPI-karlicoss/my/core/source.py", line 45, in wrapper
       res = factory_func(*args, **kwargs)
     File "/home/sean/Repos/HPI/my/mail/all.py", line 27, in _mail_mbox
       from . import mbox
     File "/home/sean/Repos/HPI/my/mail/mbox.py", line 24, in <module>
       class config(user_config.mbox):
   AttributeError: type object 'mail' has no attribute 'mbox'

That imports like:

from my.config import mail as user_config

class config(user_config.mbox)

which is what causes the error - since the user doesnt have an mbox defined on their class mail in their config.

They may have not set it up yet, or they may just not want to use it

In the latter case, theres no way to configure this without modifying the all.py file and removing the source. That is a solution, but I think it would be nice to catch this AttributeError and send a warning to add the module to their disabled_modules or something instead?

That way it doesn't fatally error when they're using the all.py -- the so-called entrypoint to the whole module

This only happens when there are no additional dependencies for a module. If there were, import_source would correctly catch the import error and return the default

I guess at the top of mbox could do:

user_config = getattr(user_config, "mbox", object)

but feels not mypy friendly and pretty hacky

Can create a PR if catching the AttributeError sounds good -- would probably try and do regex to match it more specifically, to prevent catching unrelated errors

docs: add instructions on how to setup google_takeout_parser

Just opening this here so I don't forget to add instructions/examples to the modules, since people will likely look for instructions here when setting up the module for promnesia

likely also include some info on whether or not things can be zipped/expected structure/link to google_takeout_parser doc to describe what to include in the takeout

Discussion: extending base HPI/overlays/overrides

Related issues: #12, #46; but I think worth a separate discussion.

From my experience, it's pretty hard to predict how other people want to use their data:

  • you might miss some attributes they care about
  • some people want to be more paranoid or more defensive (e.g. timezone handling/None safety/etc)
  • they might want to do some extra filtering
  • they might want to merge in extra data sources or suppress existing

The list is endless! So it would be nice if it was possible to easily override small bits of HPI modules/mix in other data, etc.

The main goals are:

  • low effort: ideally it should be a matter of a few lines of code to override something.
  • good interop: e.g. ability to keep with the upstream, use modules coming from separate repositories, etc.
  • ideally mypy friendly. This kind of means 'not too dynamic and magical', which is ultimately a good thing even if you don't care about mypy.

Once again, I see Emacs as a good role model. Everything is really decentralized, you have some core library, you have certain patterns that everyone follows... but apart from that the modules are mostly independent.
Many people still use 'monolith' base configurations (e.g. Doom/Spacemacs), because it's kinda convenient, as long as you have a maintainer. Arguably this is what this repository is at the moment, although it's obviously not as popular as Emacs distributions.

Emacs fits these goals well:

  • low effort: the simplest way to confugure something is to override a variable in your config (thanks to dynamic scope, it 'just works').
    You can even literally override whole functions as a means of quickly getting the behaviour you want.
  • good interop: yes, unless the developer broke some APIs, usually you can safely update the upstream module.

How to achieve this within HPI:

For combining independent modules together (say, something like my/youtube.py and my/vimeo.py coming from different repositories), the easiest is to use:

  • symlinks (at least if you have just a few files/directories to mixin)
  • namespace packages (more on them later)

Now, the tricky case is when you want to partially override something.
The first option is: fork & apply your modifications on top. For example: https://github.com/seanbreckenridge/HPI

  • effort: very straightforward
  • interop: merging with the upstream a bit manual, but if you use atomic commits & interactive rebase/cherry pick, should be manageable
  • at least not any more magical than the original repository

Not sure if there is much to discuss here, so straight to the second and a more flexible option.

Once again, we rely on namespace packages! I'll just explain on a couple of examples, since it's easier.

  • example: mixing in a data source

    The main idea is that you can override all.py (also some discussion here), and remove/add extra data sources.
    Since all.py is tiny, it's not a big problem to just copy/paste it and apply your changes.

    Some existing modules implemented with this approach in mind:

    (I still haven't settled on the naming. all and main as the entry point kind of both make sense)

  • example: my.calendar.holidays

    As you can guess, this module is responsible for flagging days as holidays, by exposing is_holiday function.
    As a reasonable default, it's just using the user's country of residence and flags national holidays.
    However, you might also want to mix in your work vacation, and this is harder to make uniform for everyone, and it's a good candidate for a custom user override:

    import my.orig.my.calendar.holidays as M
    from   my.orig.my.calendar.holidays import *
    
    is_holiday_orig = M.is_holiday
    def is_holiday(d: DateIsh) -> bool:
        # if it's a public holiday, definitely a holiday?
        if is_holiday_orig(d):
            return True
        # then check private data of days off work
        if is_day_off_work(d):
            return True
        return False
    M.is_holiday = is_holiday

    Thanks to namespace packages, when I import my.calendar.holidays it will hit my override first, monkey patch the is_holiday function, and expose the rest intact due to import *.
    For example, hpi doctor my.calendar.holiday will run against the override, reusing the stats function or any other original functions.

    My personal HPI override has more example code, and I'll gradually move some stuff from this repository there as well
    (for example most things in my.body don't make much sense for other people).

Things I'm not sure about with this approach:

  • To import the 'original' module and monkey patch it, you need some alternative way of referencing it.
    • for now, I'm using a symlink (/code/hpi-overlay/src/my/orig -> /code/hpi/src/my)

      This is simple enough, but maintaining the symlink manually, referencing the 'original' package through my.orig .. meh.
      Also not sure what to do if there are multiple overrides, e.g. 'chain' (although this is probably a bit extreme).

    • it's probably possible to do something hacky and dynamic. E.g. take __path__, remove the first entry (which would be the 'override'), and then use importlib to import the 'original' module.

      The downside is that it's gonna be unfriendly to mypy (and generally a bit too magical?).

    • another option is to have some sort of dynamic 'hook', which is imported before anything else.

      In the hook code, you import the original module and monkey patch. Same downsides, a bit too dynamic and not mypy friendly, but possible.

Caveats I know of:

  • packages can't contain __init__, otherwise the whole namespace package thing doesn't work

  • you need to be careful about the namespace package resolution order. It seems that the last installed package will be the last in the import order.

    • so you'd need to run pip install -e /path/to/override and then pip install -e /path/to/original (even if it's already installed).

    • another option is to reorder stuff in ~/.local/lib/python3.x/site-packages/easy-install.pth manually, but it's not very robust either (although at least it clearly shows the order)
      hpi doctor my.module displays some helpful info, but it's still easy to forget/mess it up by accident.

       $ hpi doctor my.calendar.holidays  
       ✅ import order: ['/code/hpi-overlay/src/my', '/code/hpi/my']
      
  • import * doesn't import functions that start from the underscore ('private').

    Possible to get around this dynamically, but would be nice to cooperate with mypy somehow..

Happy to hear suggestions and thoughts on that. Once there's been some discussion, I'll move this to doc/, perhaps.


TODOS:

  • also thought that it should e possible to reuse the configuration in ~/.config/my as the 'default' overlay. In fact, treating it like a proper namespace package (at the moment it's a bit of dynamic hackery) might make everything even cleaner and simpler.
  • find some good tutorial on monkey patching and link? Wouldn't want to duplicate the efforts twice..
  • add some examples of motivation for overrides, just for documentation purposes
  • update docs here https://github.com/karlicoss/HPI/blob/master/doc/SETUP.org#addingmodifying-modules

Google Location data accessible through kml request

Hi! I'm working on setting this up for my purposes and reading over the code. I wanted to let you know that I was looking on this site to see if there was a link to an API: https://www.google.com/maps/timeline?t=0&authuser=0&pb=!1m2!1m1!1s2020-09-18
When I clicked on the gear icon near the bottom, a new window pops up and downloads the location data for the day. That url is this: https://www.google.com/maps/timeline/kml?authuser=0&pb=!1m8!1m3!**1i2020!2i8!3i18**!2m3!1i2020!2i8!3i18

I'm not sure what all those numbers mean, but I was able to make sense of a few of them! I bolded the "meaningful" numbers above - they seem to be the ones starting after the 3rd exclamation point and ending right before the 6th. The first is the year, the second is the month (offset by one), and the last is the day. The following numbers were probably supposed to be an "end date" judging by the content of the file, but it seems it only returns the data of one day.

For instance, the file https://www.google.com/maps/timeline/kml?authuser=0&pb=!1m8!1m3!1i2018!2i1!3i1!2m3!1i2020!2i8!3i18 shows "Location history from 2018-02-01 to 2020-09-18", but the data ends at the start of 2018-02-02. Still, I was able to get data back from my account.

So this definitely could get data from the account. I also downloaded my cookies from that page and tested them using wget, and the file saved with no issues - they must not be checking user-agent (luckily). Anyway, I wanted to make you aware of this. I've cloned this and I'm currently writing a module to make that work - I could submit a pull request when I'm done if you want.

merge rexport, pushshift and gdpr reddit data

Was able to get pushshift as mentioned in the README working to export old comments. Thought I'd mention it here.

It's possible to use pushshift to get data further back, but I'm not sure if it should be part of this project, since some of the older comments don't have the same JSON structure, I can only assume pushshift is getting data from multiple sources the further it goes back. It requires some normalization, like here and here.

The only data I was missing due to the 1000 limit on queries from using rexport were comments. It exported the last 1000 but I have about 5000 on reddit in total.

Regarding HPI:

Wrote a simple package to request/save that data, with a dal (whose PComment NamedTuple has similar @property attributes to rexports DAL), and a merge function, and now:

In [15]: from my.reddit import comments, _dal, pushshift_comments

In [16]: len(list(_dal().comments())) # from dal.py in rexport
Out[16]: 999

In [17]: len(list(pushshift_comments())) # from pushshift
Out[17]: 4891

In [18]: len(list(comments())) # merged data, using utc_time to remove duplicates
Out[18]: 4893

In [19]: comments
Out[19]: <function my.reddit.comments() -> Iterator[Union[rexport.dal.Comment, pushshift_comment_export.dal.PComment]]>

Its possible that one could write enough @property wrappers to handle the differences in the JSON representations of old pushshift data, unsure if thats something you want to pursue here.

TypeError when importing my.coding.github

When executing the short short demo retrieving some github_data, I get
the following TypeError: init() missing 1 required positional argument: 'gdpr_dir'
occuring during
import my.coding.github as github_data

Here are the log traces:

In [7]: from orger import Mirror

In [8]: from orger.inorganic import node, link

In [9]: from orger.common import dt_heading

In [10]: import my.coding.github as gd
/usr/local/lib/python3.9/site-packages/HPI-0.2.20201004.dev40-py3.9.egg/my/coding/github.py:3: UserWarning: my.coding.github is deprecated! Please use my.github.all instead!
warnings.warn('my.coding.github is deprecated! Please use my.github.all instead!')

TypeError Traceback (most recent call last)
in
----> 1 import my.coding.github as gd

/usr/local/lib/python3.9/site-packages/HPI-0.2.20201004.dev40-py3.9.egg/my/coding/github.py in
4 # todo why aren't DeprecationWarning shown by default??
5
----> 6 from ..github.all import events, get_events
7
8 # todo deprecate properly

/usr/local/lib/python3.9/site-packages/HPI-0.2.20201004.dev40-py3.9.egg/my/github/all.py in
3 """
4
----> 5 from . import gdpr, ghexport
6
7 from .common import merge_events, Results

/usr/local/lib/python3.9/site-packages/HPI-0.2.20201004.dev40-py3.9.egg/my/github/gdpr.py in
26
27 from ..core.cfg import make_config
---> 28 config = make_config(github)
29
30

/usr/local/lib/python3.9/site-packages/HPI-0.2.20201004.dev40-py3.9.egg/my/core/cfg.py in make_config(cls, migration)
17 if k in {f.name for f in fields(cls)}
18 }
---> 19 return cls(**params) # type: ignore[call-arg]
20
21

TypeError: init() missing 1 required positional argument: 'gdpr_dir'

In [11]:

location fallback system

Am just posting/ideating this here to see if its something you're interested in adding, since this is a bit more state-ful than HPI typically is (this would include a CLI which saves data to disk based on prompting the user). I thought about making this a separate project but I think dealing with all this messy data is sort of HPI is good for, so it would be nice to have an interface thats easily extendible in here

Ive been creating something similar for my where_db CLI and we already have home locations as fallback, so I thought it would be nice to formalize it into an API thats easier to extend, like:

my/location/fallback/all.py
my/location/fallback/via_home.py (rename locations/homes.py to here)
my/location/fallback/common.py
my/location/fallback/via_prompt.py
my/location/fallback/via_photos.py
my/location/fallback/via_ip.py

via_photos because I have lots of photos on disk but not all of them I actually want to use for location, so this would let me confirm which ones to do that for (could also maybe improve on photos.py somehow)

via_prompt being a manual entry of the namedtuple described below

Namedtuple would be something like:

class FallbackLocation(NamedTuple):
    lat: float
    lon: float
    dt: datetime
    accuracy: Optional[float]
    elevation: Optional[float]
    duration: Optional[timedelta]
    end_dt: Optional[datetime]

    def to_location(self) -> Location:
          return Location(...)

Either duration (how long this location is valid for) or end_dt (when this location is no longer valid) has to be set.

And then all would call some function that prioritizes/picks which source to use, based on the order theyre returned in, like:

def estimate_location(dt: time) -> FallbackLocation:
    return estimate_from(
         dt,
         fallbacks=(via_prompt, via_photos, via_ip, via_home)
    )

It would call each fallback function, and then use the one with the closest accuracy (which typically would be entered or estimated through a config block in each module)

This would require the user to prompt the user, so could probably just have a main function and run this like python3 -m my.location.fallback.via_prompt? Or could hook it into the CLI, either way.

Strategy in general Ive thought of:

  • sort all valid locations by some accuracy threshold, find dates which dont have accurate locations
  • use other sources (ips, manual entry (via_prompt), homes) as fallbacks for each of those days
  • pick the one with the highest accuracy

The choices the user picks would cached (so they arent re-prompted), so this would need to save to a JSON file or something, e.g. to ~/data/locations/via_prompt.json, ~/data/locations/via_ip.json

Im a bit split between actually saving data, or saving a 'sourcemap/transform' of sorts -- I recently did something similar for my scramble_history project, which lets me define keys to look for on any object, and then it defines a mapping file which converts one model into another -- that way were not actually duplicating data, its a bit more programatic, and you can sort of define 'behavior' to transform one piece of data (e.g. IPs) into locations

Not heavily attached to any of these ideas

--

As a sidenote, I had another idea for sorted locations as we discussed in #237 -- I had an idea to create a custom class which just yields the iterator, but then in the merge in all.py, we can just do an isinstance check on the iterator itself to check if its the class, like

class SortedLocationIterator:
    def __init__(self, sorted_locations):
        self.location = iter(sorted_locations)

    def __iter__(self):
        return self

    def __next__(self):
        return next(self.location)


In [1]: import test

In [2]: x = test.SortedLocationIterator(range(5))

In [3]: x.__class__
Out[3]: test.SortedLocationIterator

In [4]: next(x)
Out[4]: 0

That way its still all backwards compatible and only if all sources are SortedLocationIterator do we do a mergesort of sorts.

This could all be added behind a flag as well, but would really speed up sorting locations

Then in tz.via_location for example, if all.py return type is a SortedLocationIterator, we know its been sorted by the user and we dont have to sort it ourself

Allow additional sources for my.location

Dont expect any changes on this problem any time soon, just creating this issue to track the problem I'm having

Currently I have to overlay time.tz.via_location, since I use a different data source (combined from gpslogger, ips from google, facebook, discord etc)

On this repo, it uses location.google to grab that info. Slightly unrelated, but I've also parsed the takeout using lxml instead, so my structure there is different

I would prefer if there was a common entrypoint (like my.location.all) that could take multiple entrypoints as input, falling back to empty iterators if they aren't enabled/fail to be imported, as that would localize my overlayed changes to the my.location package

You can see the current structure for my.location here

.
├── all.py
├── gpslogger.py
├── ip.py
└── models.py

I created this following the discussion we had regarding merging pushshift data

I've also slightly modified the Location NT, so it can track whether this source was from an accurate (e.g. Google or gpslogger) or estimate (geolocation based on ip)

class Location(NamedTuple):
    lng: float
    lat: float
    dt: datetime
    # approximate accuracy
    # true means exact, false means its based on ip/auxiliary info
    accuracy: bool

Am a bit conflicted on how to handle this many data sources...

Would need some modifications, would probably create individual files for:

  • google
  • apple (locations from gdpr export)
  • ip-based (need inidividual empty fallbacks for blizzard, facebook, discord)
  • gpslogger

Some of those could stay on my branch if you're not interested in having them here, I think its more important to have the following here:

  • a common.py file, including:
  • a Location NT which all other location providers would convert their NT/DTs to
  • a merge_locations function, with the typical set/emitted behavior

Then, to enable additional location providers, I could either just overlay the all.py file, including my additional imports -- which probably wouldn't ever have to be changed

Something like what was described in the comment here

If there are no issues you foresee here, I'm willing to implement this at some point in the future.

Will probably not touch the location.google file here, except to create a standard interface across all the submodules which all.py would then import from.

Also, unsure if you settled on using all.py or main.py, I tend to prefer all.py for namespace packages which are merging multiple data sources

Figure out the 'core' and the module system

I can manage, say, 30 modules I'm using personally.

If there are 200 modules, I will be completely overwhelmed (i.e. see what happened to oh-my-zsh or spacemacs).

I guess I need to figure out the 'core' of the system and a good way to make 'plugins', so you can use third party modules without merging them in the main repository/maintaining a fork.

Python packages kind of work, but modules need to be more lightweight. Ideally you don't need to make a proper python package from a module, as long as you're accepting you manage dependencies yourself.

How do I set up my.coding.github?

I did the following.

  • pip install --user HPI (python 3)
  • hpi config create

Now inside ~/.config/my, I have.

└── my
    └── config
        ├── __init__.py
        └── __pycache__
            └── __init__.cpython-38.pyc

(The __init__.py is empty and I didn't bother looking at the cache stuff).

Now I want to set up my.coding.github. I went to MODULES.org but didn't see anything about coding.github. Then I tried to read github.py.. I saw it pulls in the config, but then went a few steps out to .export_dir, inputs(), _dal(), ghexport and got lost. How do I configure it?

Encoding error on macOS

I am getting an error similar to the error discussed in karlicoss/rexport#5 . I just cloned this repo on macOS 10.14.6 using python 3.7 in a virtual env. I am trying to access the reddit exports using the simple example shown at the top of the HPI README.

Traceback (most recent call last):
  File "/Users/mtvogel/Documents/PythonScripts/HPI/main.py", line 11, in <module>
    for s in my.reddit.saved():
  File "/Users/mtvogel/Documents/PythonScripts/HPI/env/lib/python3.7/site-packages/cachew/__init__.py", line 764, in wrapper
    for chunk in ichunks(datas, n=chunk_by):
  File "/Users/mtvogel/Documents/PythonScripts/HPI/env/lib/python3.7/site-packages/cachew/__init__.py", line 55, in ichunks
    chunk: List[T] = list(islice(it, 0, n))
  File "/Users/mtvogel/Library/Application Support/my/my/config/repos/rexport/dal.py", line 157, in saved
    for s in self._accumulate(what='saved'):
  File "/Users/mtvogel/Library/Application Support/my/my/config/repos/rexport/dal.py", line 139, in _accumulate
    for f, r in self.raw():
  File "/Users/mtvogel/Library/Application Support/my/my/config/repos/rexport/dal.py", line 133, in raw
    yield f, json.load(fo)
  File "/usr/local/Cellar/python/3.7.3/Frameworks/Python.framework/Versions/3.7/lib/python3.7/json/__init__.py", line 293, in load
    return loads(fp.read(),
  File "/Users/mtvogel/Documents/PythonScripts/HPI/env/bin/../lib/python3.7/codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 3131: invalid start byte

Ideas: visualize HPI data in a dashboard

For quantified self stuff some heavy manipulation with pandas is usually necessary (e.g. I'm doing this in my dashboard), so I doubt a general purpose web dashboard would cover it (unless it support JS snippets, which would be very helpful!). But it would be cool to have a quick way of overviewing/visualizing/aggregating data in the browser even if it's not perfect.

Quick googling results in:

Both of them rely on some database (e.g. sqlite). While it's a bit inconvenient, probably good enough as the first order approximation. Since I'm extensively using NamedTuples/dataclasses, it's possible to adapt the data automatically without any boilerplate. In addition, cachew already dumps sqlite databases, which can be used as input data.

It would be also cool to have a native app (less hassle + better performance), but I'm not sure how to even start googling for that.

Related: KrauseFx/FxLifeSheet#34 (I think we have similar goals!)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.