datamade / how-to Goto Github PK

View Code? Open in Web Editor NEW

74.0 7.0 12.0 5.06 MB

📚 Doing all sorts of things, the DataMade way

License: MIT License

Dockerfile 7.00% Shell 1.99% Python 43.44% CSS 4.66% JavaScript 11.38% HTML 31.36% SCSS 0.16%

how-to's Introduction

how-to

📚 Doing all sorts of things, the DataMade way

What's this?

Here at DataMade, we do a lot of computer programming. In the spirit of better living through documentation, we're preserving guides to how we do that here.

In alphabetical order and including links to external repository-based documentation.

Contributing

The process for making changes to the DataMade Stack, and by extension this repo, is documented in CONTRIBUTING.md.

Code of Conduct

The Code of Conduct in this repo follows the DataMade Anti-Harassment Policy and Procedures, and is documented in CODE_OF_CONDUCT.md.

how-to's People

Contributors

Stargazers

Watchers

Forkers

ramyakhare anuragsinghchaudhary profintegra ws-pittman reclaim-eritrea ghbook reginafcompton huydanihelgroup shubham-666 austinkn123 iamtoshal

how-to's Issues

Evaluate GitHub Actions as a replacement for Travis CI

I just got beta access to GitHub Actions, a repo automation service that includes (among other things) native CI/CD.

We've been thinking about alternatives to Travis for some time now, considering that the financial health of the parent company isn't great. GitHub Actions is a strong contender because it would remove an integration layer between source control and CI/CD.

Set up a GitHub Actions pipeline for testing and deployment on https://github.com/jeancochrane/pytest-flask-sqlalchemy and evaluate its feasibility for DataMade projects. Relevant questions include:

Pricing
Concurrency
Ease of setup
Modularity across different apps
Integration with our deployment providers (CodeDeploy, PyPi, Netlify, potentially Heroku)

Template containerization artifacts for local development

Documentation request

We've arrived at a functional setup for local development with containers. Let's template a Python development environment with an optional Postgres/Postgis service:

Dockerfile
docker-compose.yml
tests/docker-compose.yml

Postgis requires a few extra packages in the application image (see committee-oversight) – might be a good idea to conditionally install them in the templated Dockerfile, since we use it quite frequently.

Adopt standard Python tooling for schema validation

Broken off from #35 (comment), both @jeancochrane and I have been experimenting with marshmallow.

Complex queries with the Django ORM

Documentation request

We have several Django projects that include inline SQL for complex queries. Sometimes, this is the only option. But there are also downsides to mixing SQL with the Django ORM, not least of all that it's brittle to changes in the models, making things like using OCD models in Councilmatic and adding fields to the BGA import more difficult and far-reaching changes than they could have been.

The Django ORM supplies functions for complicated querying, including aggregation and window functions. See datamade/bga-payroll#362. I've made some headway refactoring the raw SQL in Payroll to ORM operations, so it's easier to calculate additional summary statistics. Let's use lessons learned from that and other refactors (e.g., Councilmatic) to take a principled stance on when to write raw SQL in Django code, then document it with our Django practices.

Documentation for remote practices

Documentation request

See https://github.com/datamade/hr/issues/21#issuecomment-539699466 and https://github.com/datamade/hr/issues/21#issuecomment-540813023.

Named entity recognition with deep learning

Background

Recent advances have made deep learning much easier and more cost effective. These advances include:

Robust open source libraries like PyTorch and Tensorflow
Cheap and highly available GPUs on services like AWS EC2
A large and expanding marketplace of services focused specifically on document analysis with deep learning, including AWS SageMaker, AWS Textract, and Google Document Understanding

These developments have implications for a number of common problems that our clients face, including OCR and entity resolution. To get a better sense of the current landscape and what possibilities it offers for our business, I want to focus in on named entity recognition, a document analysis task that has proven challenging in the past.

Proposal

I propose a medium-size R&D project to evaluate the feasibility of deep learning for document analysis tasks for DataMade. I'll use named entity recognition on either Chicago Councilmatic data or WhoWasInCommand data as a starting point for evaluation.

As this project involves exploring a developing field, I'm proposing to do more reading and writing than I have for R&D projects in the past. I plan to start with a proposal for the specific task I want to accomplish with a particular set of documents. Then, I'll perform a field scan to identify possible solutions to the problem. Finally, I'll try as many solutions as I can, and produce a report evaluating the costs and benefits of each solution.

Deliverables

Deliverables for this R&D will include, in the following order:

A research outline for named entity recognition on either Chicago Councilmatic data or WhoWasInCommand data, detailing what specific task will be accomplished
A field scan
A document comparing a few different approaches

Timeline

I think I can get this R&D done in four R&D days (two months). My anticipated timeline is:

Day 1: Draft research outline, request a review, and start field scan
Day 2: Finalize research outline, finish drafting field scan, and request review
Day 3: Begin evaluating solutions
Day 4: Finish evaluating solutions and request review on a report

There's a good possibility that I won't have enough expertise in some solutions to be able to evaluate them completely. When this happens, I'll prioritize opening up an issue and moving on to another solution.

document expectations of work artifacts and practices

we have some scattered and/or implicit expectations of written artifacts of work. let's centralize those, and any helpful templates, here.

project collateral checklist: https://github.com/datamade/ops/issues/582
github issue template & maintenance practices
- no code until there's a relevant issue tracking progress
- link relevant exceptions: #15
- log progress with sufficient detail and concrete next steps in the comments (deals offers a good pattern here, imo)
pull request template & link to thoughtbot guide to code review
- description of the changes
- specific feedback?
- testing instructions
handoffs: some recommendations (a la "questions to ask" at the top of the sow template) and positive examples

potentially a home for this dusty old draft of git practices?

How to Recharts

Background

We have an old how-to card in which @jeancochrane wrote up some problems with our current go-to charting library, Highcharts:

Very large filesize
Restrictive license
Bloated API and docs (difficult to get a quick chart off the ground)
Difficult to customize charts
Doesn't play well with reactive frameworks like React

As DataMade now has several projects using Gatsby, a React-based framework, we have an opportunity to reevaluate our charting library. The courts project is using Recharts, for which I've already done some development and will need to do more, so it makes sense to start with that one.

Proposal

I'd like to take an investment day to step back and explore the possibilities of Recharts. To do this, I'll set up a simple Gatsby app in a separate repo and set up a series of visualizations using either generated sample data or an external API. I'll use that repo to track the documentation I find useful and any useful patterns I land on.

Integrating Django with Frontend JavaScript Frameworks

Background

Our R&D work on Gatsby (#7) has given us a taste of the power of contemporary JavaScript frameworks like React. Through their native support for ES6, their stateful component-oriented APIs, and their modern developer tooling, contemporary frontend frameworks make JavaScript development more fun while simultaneously expanding the horizon of possibility for complex user interfaces.

As we noted in our recommendation of adoption for Gatsby, however, these frameworks have a steep learning curve, and if we want to use them more extensively we need to adopt them incrementally. Most immediately, we need a way to integrate frontend frameworks like React into our standard Django stack in a way that allows us to continue to leverage as much of our Django expertise as possible while we get acquainted with the new paradigms offered by contemporary JavaScript frameworks.

Proposal

I propose to research approaches to integrating contemporary JavaScript frameworks with our standard Django app architecture. My goal will be to produce a clear path forward whereby we can use a frontend framework like React for views that require particularly complex interactivity, while falling back to standard Django views and templates for simpler views like List and Detail pages.

In sum, my focus will be on developing what many developers call a "hybrid" approach: one where we can isolate our use of the frontend framework only to specific views, instead of following the more common pattern of using Django exclusively as a data layer API while delegating all user-facing logic (like templating and routing) to the frontend framework.

Deliverables

This R&D project will proceed in two phases: research and development.

In the research phase, I plan to read articles and solicit advise from other developers about hybrid approaches to integrating Django and frontend frameworks. While my main focus will be on React, I expect I may open up my research to Vue.js as well, since it follows a similar conceptual paradigm as React and is advertised as being optimized for hybrid apps and incremental adoption.

In the development phase, I plan to produce a sample project that implements the most promising hybrid approach as identified in the research phase. Once this sample project has been approved by the R&D team, I plan to adapt it into a template that we can use for a future client app.

Timeline

I expect this R&D project to take somewhere between one to three months (two to six R&D days). The main reason for my uncertainty is that I don't yet have a good sense of how much prior work has gone into hybrid approaches like this one: if clear best practices already exist, this R&D project may be as simple as adapting an existing project based on a blog post; but if (as I suspect) there hasn't been much reusable work on this kind of approach, it will take longer to forge a new path.

Update onboarding docs

Documentation request

Go through https://github.com/datamade/ops/wiki/How-to-DataMade-(Resources-for-new-hires)#setting-up-your-environment and make sure it's up to date. Where possible, move docs that can be public from that Wiki to this repo.

Also: Make sure to add a note about installing Docker on local environments.

user research and testing, the datamade way

@jmithani commented on Mon Oct 01 2018

@reginafcompton and I talked about putting together a document of best practices for user research and testing after our focus groups with CSH.

I have a few books on the subject / knowledge I can synthesize. Would this be useful?

@jmithani commented on Tue Oct 16 2018

Consider some of the projects where we've done user research and how it has helped the success of the project
@hancush and Jean did work for dedupe
@reginafcompton and @jmithani will work on it, lower priority but try to work to have it done by the Just Spaces kickoff (which is ???)

@jmithani commented on Wed Nov 14 2018

Examples:
Erikson Risk & Reach User Research Summary
Erikson Risk & Reach User Research Notes

@derekeder commented on Sun Dec 02 2018

We've kicked off Just Spaces and are using the Erikson examples as a rough template. It would be good to document this process as a way to jump-start this resource

@jmithani commented on Mon Dec 03 2018

@derekeder "this" being Just Space or Erikson?

Recommended Resources

Listed in the order I would suggest reading.

In general, I recommend Ideo's A Field Guide To Human-Center Design as a place to start for anyone new to user research / design
Needfinding: Design Research and Planning, which I can't find a PDF of online but is $10 on Amazon
This seven-page article defining needfinding by the author of the book above
This brief introduction to needfinding methods from Stanford's HCI group
This more nitty-gritty review of user research, more heavily focused on quantitative methods

@derekeder commented on Mon Dec 03 2018

I was thinking that we could start documenting our process for Just Spaces for a first draft of the user research and testing guide

@derekeder commented on Wed Dec 05 2018

I will draft out our overall user research process as a starting point and use it to guide the Just Spaces research component

@jmithani commented on Thu Dec 20 2018

@derekeder is there a draft i can help work on today?

@derekeder commented on Tue Jan 22 2019

Closing the loop on this - we have an overview guide on Project Research and Interviews at DataMade: https://docs.google.com/document/d/1NgmcCfBE3B9bTYF2TyTSY5m4QA6xLu2DxESeUTQYxUo/edit

Sorry for not following up on your offer to help on this @jmithani!

@derekeder commented on Fri Jan 25 2019

I think the next step on this is to create a new document for how we do user testing. I see this as separate and distinct from doing user interviews and reports as we have done for Erikson and Just Spaces.

This document on user testing should be framed around how we've done it on past projects, but informed by additional resources (like the ones from @jmithani's comment above https://github.com/datamade/ops/issues/567#issuecomment-443787532). I will admit that I have done some user testing, but not a lot to have super strong opinions about it, other than 'dont ask people what they want, see what they do (or try to do)'.

@jmithani is that enough for you to go on to get started on a first draft?

@jmithani commented on Wed Feb 27 2019

i haven't made any progress on this—i'll have something to present at next ops.

@jmithani commented on Wed Mar 20 2019

something to share, a titled blank document https://docs.google.com/document/d/1ofmxwOVIGJJ84pLGCQu7ALn7OfTpu5lT5Z0UbIbseXs/edit?usp=sharing

@derekeder commented on Wed Apr 17 2019

@fgregg I requested permission to share the Just Spaces User Research Report. It's attached here for reference
Just Spaces User Research Report.pdf

@derekeder commented on Thu Apr 18 2019

@fgregg we have permission to share this report.

R&D: Containerized deployments with docker-machine and AWS

If we do not identify a container orchestration service that feels right to us in #19, an intermediate step could be to leverage docker-machine to provision hosts and administer containerized applications.

Docker Machine is a tool that lets you install Docker Engine on virtual hosts, and manage the hosts with docker-machine commands. You can use Machine to create Docker hosts on your local Mac or Windows box, on your company network, in your data center, or on cloud providers like Azure, AWS, or DigitalOcean.

...

Typically, you install Docker Machine on your local system. Docker Machine has its own command line client docker-machine and the Docker Engine client, docker. You can use Machine to install Docker Engine on one or more virtual systems. These virtual systems can be local (as when you use Machine to install and run Docker Engine in VirtualBox on Mac or Windows) or remote (as when you use Machine to provision Dockerized hosts on cloud providers). The Dockerized hosts themselves can be thought of, and are sometimes referred to as, managed “machines”.

One immediate challenge is that docker-machine was created as a quick and easy way for a single developer to manage machines, i.e., there's not a native way to support access by multiple developers. There are work arounds, but the first question I'd like to answer is: Do we want to move from one work around to another?

If the answer is Yes, invest some R&D time to stand up an application using this pattern.

R&D: Advanced Netlify: Functions, Forms, and Identity

@jeancochrane commented on Mon Apr 22 2019

Motivation

From my perspective, Netlify provides a huge productivity boost for developing small static apps by eliminating backend infrastructure management. However, there are a few consistent features that clients ask for in these kinds of apps that have historically pushed us to use Django instead of developing a static app:

Admin login, along with admin-specific views
Advanced search
Data collection via forms
Any others that I'm missing?

Netlify provides "serverless" add-ons for accomplishing each of these tasks in the form of Functions, Forms, and Identity. I'd like to spend some time building a toy app that uses each of these add-ons in order to evaluate how feasible it would be for us to deploy similar projects without a backend server.

Key questions include:

Do the add-ons work as advertised?
How difficult are they to learn?
What are the limitations of the add-ons? Which small-to-medium sized projects would not be good candidates?
How reasonable is the pricing? At what scale would these add-ons cease to be free (or at least cheaper than two EC2 instances)?

Proposal

I propose a rapid R&D project to build a toy project that integrates each of the Netlify add-ons for admin login, search, and data collection.

I still haven't yet decided which project it would be best to develop against. I'll start by reading Forms, Auth, and Serverless Functions on Gatsby and Netlify, and I may use the app designed in that article as a sandbox for trying things. But ideally I'd like to pick a DataMade project that I could quickly port to a static app and test these features against. Recommendations are welcome.

Deliverables

A sample app hosted on Netlify with admin login, advanced search, and data collection via forms
A lunch&learn report-back answering the key questions above

Timeline

I plan to start this project on Friday 4/26. My first priority will be rapid prototyping to test whether these add-ons are useable at all. If the add-ons seem promising, I'll continue work on a custom project and finish it on Friday 5/10.

@jeancochrane commented on Thu May 02 2019

Follow-up blog post with my findings here: https://jeancochrane.com/blog/netlify-identity-dealbreakers I focused mostly on Identity because it turned out to be pretty confusing and it was the add-on that I had the least experience with.

TL;DR: Identity isn't ready for prime-time yet, which drags the whole system down. Functions provide a nice and easy way to deploy simple AWS Lambda functions -- I would be interested in trying them out for dynamic search, but they don't yet integrate well with Identity in dev, which means we can't use them for admin views.

I think the next step in this research is to focus exclusively on deploying Functions for search, but I'll leave that for another R&D issue.

Best practices for monitoring

Document how we use monitoring services (primarily Sentry). See: https://github.com/datamade/ops/issues/619.

Document Heroku learning

Documentation request

Based on the work we do on our UofM pilot project, produce some documentation for Heroku to live in this repo.

Supercedes #19.

Headless Django CMS with Gatsby frontend

Our research in #18 revealed that CMSes for static site generators are not currently mature enough for our use cases. However, there's another popular strategy for managing content in static sites that we didn't consider: using a headless CMS to manage content, and consuming that content in a static site via an API.

Broadly, there are two ways of consuming the content in a frontend app:

Write the frontend app so that it retrieves all content dynamically from the headless CMS API on page load
Use the headless CMS API as a data source for the static site, and retrieving data at build time

I'm interested in doing some research to test out 2) above. Specifically, I'd like to update LISC to:

Install Wagtail CMS as a headless CMS
Hook into Wagtail signals to tell the Netlify API to restart a build of the frontend app when an editor publishes a page
Display a status on the Wagtail admin interface telling the user when the site is building, and whether it's deployed
Update the frontend app to consume content from the headless CMS

Part of my motivation here is that the authors of Wagtail recently implemented this stack for their own marketing site, so we'll have some foundation to build on.

Best practices docs for Leaflet maps

Adapted from https://github.com/datamade/ops/issues/578.

Let's put together some documentation for how to make Leaflet maps with JQuery.

Scheduling tasks on Heroku

Background

One of the last pieces that is keeping us from feeling confident in deploying complex apps on Heroku is that we don't yet know how to schedule jobs.

Proposal

I want to take a look at the Heroku Scheduler and related apps to define a way of scheduling tasks on Heroku.

Deliverables

I plan to update the Heroku documentation to include my guidance on how to do this.

Timeline

I expect this to take 2-3 R&D days.

Migrate active deploy-a-site documentation to how-to

When #46 is landed, our AWS documentation in deploy-a-site will become obsolete. Archive that repository, then migrate any documentation with remaining relevance to this repo.

Containerize development environments

In 2019, containers are software canon. As DataMade's first structured foray into containerization, invest in building in-house expertise in local development with Docker containers.

This investment will make development easier and safer (via consistency) across machines. Also, DataMade apps tend to have a long shelf life. Containers will greatly reduce the pain of maintaining apps running on older versions of Python, Postgres, or other services.

To build this expertise, stand up Dedupe.io and its tests in a containerized environment.

Assuming this exercise goes well and we recommend containerizing applications as SOP, we will explore deployment patterns in a separate R&D project.

Related issues:

RMarkdown for literate analyses

Currently, we used PWeave for literate analyses. I would like to explore using RMarkdown instead.

Here's are advantages of RMarkdown:

Pretty good code caching. One recurrent pain point in working with pweave is that every time you want to update the results from one code block, all code block were rerun. For longer analyses with expensive queries, this could lengthen feedback cycles to many minutes
Very good editor support. RMarkdown is much, much more popular than PWeave so text editors have much better support for it: sublime, EMACS, RStudio to name a few
Generally much better supported and widely used. RMarkdown is officially supported by RStudio which is a big R company (RMarkdown is to the R ecosystem as Jupyter notebooks are to Python)
Easy to switch to other markdown authoring modes, like latex.

Disadvantages of RMarkdown

It's in R, which is not part of our current stack.

Actually, that's this only disadvantage versus PWeave I can think of. It's a big one though.

Some amelioration of this disadvantage.

You can actually write python (or event other languages) in the code blocks). You still need some R to get things off the ground, but it's pretty minimal. Code caching only partially works with non-R blocks. (rstudio/reticulate#167)
We are not in love with pandas as data analysis option and have been considering R as a replacement.

Adopt testing standards for JavaScript

Background

We use flake8 and pytest for code style and testing in Python. Many moons ago, @reginafcompton and I researched and proposed using the Node.js style guide and JSHint for style and jasmine for testing in JavaScript, but they weren't broadly adopted by the team, in large part because we made limited use of JavaScript. As of 2020, that's changing!

Proposal

We're learning JavaScript as a team, and the language offers many ways to achieve the same thing. Let's get ahead of the proliferation of idiosyncratic code and formally adopt style and testing standards.

As we've moved on from ES5, we should pick a style guide that aligns with ES6, and update our default .jshintrc to reflect the change.

We should also revisit our chosen test framework. Does jasmine play will with Gatsby and React? If not, identify and adopt an alternative.

Deliverables

This research project should deliver a javascript/ directory containing a pointers to our style guide and default linting config, and to our documentation on JavaScript testing.
It should also make a light revision to our JavaScript testing guidelines.
Ideally, it would also add basic JavaScript tests to at least one project. If we're willing to absorb the time spent to implement tests as research, we could also add tests to a new client project, e.g., the stormwater credit exchange site, which will be implemented in Gatsby.

Timeline

I expect this project to take 1-2 days, depending on whether we can stick with jasmine or if we need to select a new testing framework.

Connects #22.

How to Matplotlib

Continued from datamade/data-analysis-guidelines#13

After spending some hours with Matplotlib for https://github.com/datamade/nyu-journo, I don't absolutely love it but I see some paths forward for a standard DataMade approach to static charts. Two blog posts I found very helpful and would recommend adding to our documentation:

Effectively Using Matplotlib by Chris Moffitt
"Artist" in Matplotlib by skotaro

A couple of the high-level takeaways:

Stick with the object-oriented interface and its finer grain of control; avoid pyplot as much as possible, which makes simple things quicker but obscures too much
Before starting your own styling, look through the Matplotlib styles and see if any provide a good starting point for what you want
Make sure you're well acquainted with what Matplotlib names different parts of a chart. This can be unintuitive! This graphic from "Effectively Using Matplotlib" is helpful:

Something I could have done a better job with for the NYU project was separating data processing and charting into fully separate steps. This would allow us to create some handy chart helpers that can be used between projects (h/t @hancush!).

I don't know if there are any immediate next steps for this, as static charting is not something we've been doing very often. I'd be happy to write the above points into a brief .md file for this repo, or someone else could pick it up for R&D.

Django GeoMultipleChoiceField plugin

Background

So far in two separate client apps (https://github.com/datamade/erikson-edi and https://github.com/datamade/just-spaces/) we've had to implement something I call a "GeoMultipleChoiceField" -- a Django form field and widget that allows the user to select multiple geometries on a map and store foreign keys to those geometries the database.

In Just Spaces I tried a bit harder to modularize this functionality, as you can see in https://github.com/datamade/just-spaces/blob/master/surveys/widgets.py and https://github.com/datamade/just-spaces/blob/2c1b312e6e14692eaccadc272a278a04eb6a4b11/surveys/forms.py#L219-L234, but it's still pretty tightly coupled to the app and I don't think the API design is very good.

Since this is functionality that has come up twice, I think it's a great candidate for an open source library.

Proposal

I propose to develop an open source Django plugin providing a GeoMultipleChoiceField.

Deliverables

A repo and PyPi package for django-geomultiplechoicefield, providing both a form field and a widget for this functionality.

Timeline

Much of the logic here is already done, but it needs to be modularized and the API needs improvement. I think this would take somewhere around 20 hours to finish.

R&D: GatsbyJS + CPS SSCE onboarding

@jeancochrane commented on Tue Mar 26 2019

Motivation

GatsbyJS is a sorta-static-site-generator sorta-progressive-web-app-framework that uses React components for markup/interactivity and GraphQL as its data API. The basic pitch is that it gives you lots of the benefits of modern React development (reusable, data-driven components; highly interactive pages that load fast and work offline; integrated HTML/CSS/JS through the JSX paradigm) while being designed to integrate with a much wider array of data sources. It also comes recommended by David Eads.

On its surface, GatsbyJS seems like it could be a good choice of frontend framework for us, since it purports to have a tighter scope and to allow a wider range of source data than frameworks like React, Vue, or Angular. But some outstanding questions remain:

How well does it integrate with PostgreSQL and Django?
How steep is the learning curve? (Can you pick up the basics in an afternoon, like Django, or does it require intense training, like React?)
How does it handle our existing JS data viz toolkit (Highcharts and Leaflet)?

I want to answer these questions by experimenting with integrating Gatsby in a DataMade project. Since development on the CPS SSCE dashboard is gearing up soon, I figure it'd be a good opportunity to A) test drive a framework I've heard good things about while B) getting onboarded to a project I'll be working on soon.

Proposal

I propose a rapid R&D project to test drive GatsbyJS in the context of the CPS SSCE dashboard. I'll attempt to migrate the frontend of the app to GatbsyJS using the existing backend API.

These changes will be reflected in a pull request which I'll open up for comment. The changes will be purely experimental, and I won't pull them in when I'm finished.

Deliverables

A pull request against the CPS SSCE dashboard repo with a prototype environment
A short writeup documenting my answers to the questions above, as well as next steps for evaluation if Gatsby seems promising

Timeline

I plan to start and finish this R&D project on Friday, 3/29/19.

@jeancochrane commented on Tue Mar 26 2019

CC @fgregg @derekeder @hancush

@derekeder commented on Tue Mar 26 2019

I'd be curious what @beamalsky thinks of this and how it would impact the
work we are planning. We are drafting a report to share with CPS that will,
with their input, determine the work we do. I'd like to get some clarity on
that direction before proceeding with this R&D on SSCE.

On Tue, Mar 26, 2019 at 10:29 AM Jean Cochrane [email protected]
wrote:

CC @fgregg https://github.com/fgregg @derekeder
https://github.com/derekeder @hancush https://github.com/hancush

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/datamade/ops/issues/599#issuecomment-476703681, or mute
the thread
https://github.com/notifications/unsubscribe-auth/AA4IH5plwp7jMQyiO7x_Owe7i0f6ih9Fks5vajzvgaJpZM4cLlwN
.

--
Derek Eder
Founder and Partner

DataMade
New address: 233 N Michigan Ave, Suite 1800 Chicago, IL 60601
Tel: (312) 725-0195
Web: datamade.us

@jeancochrane commented on Tue Mar 26 2019

Just to be clear @derekeder, my end goal here isn't to do development on SSCE, since I don't plan on merging the changes. SSCE just provides a sandbox environment to play in that approximates a real-world client app. I could do this R&D with any other existing client app (e.g. EDI or SFM, although SFM is tougher since it's many degrees more complex) but it seemed like a good opportunity to get onboarded to SSCE.

I hear you saying that there's a chance we won't do more technical development on SSCE soon, in which case I think it makes sense to wait. If on the other hand we feel confident that we're going to do more development on SSCE, do you see a downside to onboarding me?

@derekeder commented on Tue Mar 26 2019

Here's the report we just sent to CPS: https://docs.google.com/document/d/17FJl78y8ALMRHGUAgVpEIDy5p6nk0xgVy6S5KhYN95I/edit#

If they proceed with our recommendations, there won't be much development for a while as we get them trained up and clean up the source spreadsheet.

I do think bringing you into the CPS project for this would be fine. The LISC CNDA map might also be a good candidate since you are already engaging on that project

@beamalsky commented on Tue Mar 26 2019

This sounds great to me! SSCE should provide a solid and not-to-complex backend to work with. I'll second what @derekeder said: we're not likely to make major additions to frontend any time soon, so it makes sense to me to keep @jeancochrane's work experimental and unmerged.

@jeancochrane commented on Thu Apr 04 2019

Notes from my learning lunch on Gatsby: https://gist.github.com/jeancochrane/705dda18da74fafe4b8182d15284114d

Based on this R&D, we're going to trial Gatsby in the LISC CNDA project (https://github.com/datamade/lisc-cnda).

Evaluate static site CMSes

Background

Django's CMS solutions are not great. To boot, we don't have any CMS options at all for static sites.

Netlify CMS promises to be a lot better. It claims to be an open source static app that you can plug in to any app built with a static site generator in order to expose content editing via Git. Content is staged for editing as GitHub PRs and can be reviewed with Netlify's preview builds. In essence, it's a frontend on top of GitHub, Markdown, and Netlify.

Relevant questions include:

What are our authentication options? Do clients need to know how to use GitHub?
Can we restrict permissions so that clients can only write to content, or do they have to have write access to the whole repo?
How hard is it to integrate Netlify CMS with a Gatsby app? Is it as plug-and-play as the docs claim?
What is the editing interface like? How extensible is it?
What kinds of content can you edit? Pages? Posts? Arbitrary data?
What sorts of changes to our standard static app deployment SOP would we need to enact to use Netlify CMS?
Is Netlify CMS promising enough to pilot on the next iteration of LISC CNDA?

Deliverables

I'd like to test Netlify CMS by plugging it into LISC CNDA to see if it would be a workable alternative to Django CMS for static sites. The deliverable will be a pull request incorporating Netlify CMS into LISC CNDA, either with a recommendation for further research, or with my reasons for abandoning it.

Timeline

I propose to start and finish this work on Friday, 6/21.

Best practices docs for search

Adapted from https://github.com/datamade/ops/issues/547.

Create a doc in this repo out of our guidelines: https://docs.google.com/document/d/139rP-FfNyf50VfW5DB1vNVwGrdrguRz1VhrFz-rmegg/edit#heading=h.lliyh751n4p4

R&D: Deploying Django and Postgres on Heroku

@jeancochrane commented on Thu May 02 2019

Motivation

In DataMade's DevOps endeavors, we currently face three related problems:

We spend more time than we would like to doing unpleasant server maintenance
Our zero-downtime deployment strategy is mature but brittle, which indicates to us that perhaps we're overengineering our solution
We would like to containerize our production apps, but we don't know where to start, and we've had bad experiences with trying to orchestrate containers on AWS ECS and on bare EC2 instances

These are three problems that the Heroku Platform promises to help solve. The Heroku Platform includes three services that I'd like to investigate: Runtime, Postgres, and Flow.

The Heroku Runtime service is supposed to build, deploy, and orchestrate application containers (Heroku calls them "dynos") from source code or from Dockerfiles with minimal configuration. The Heroku Postgres service is supposed to provide a managed database experience similar to AWS RDS that integrates with application dynos. The Heroku Flow service is supposed to provide a CI/CD solution that runs tests and builds preview apps using dynos for each PR.

The marketing materials for these services paint a picture of one possible future for our server-bound deployments: ephemeral, container-based, rebuilt on every push, with a GitHub collaboration experience similar to Netlify. However, since this is basically ad copy, I don't feel like I have enough information to make an informed decision about whether we should pilot deploying with Heroku.

Proposal

I'd like to stand up a simple app on the Heroku Platform in order to test out the services I listed above. Key questions will include:

How much of a conceptual leap is it from our current dev/deployment practices to Heroku? What would it take to convert an app?
What is monitoring like? Do we get shell access to services, or do we have to use a UI console? If we get shell access, how are permissions configured?
How do the CI/CD features work? Do we indeed get fully ephemeral applications out of the box for PR/staging/production? How much extra does this cost?
What is the integration like between Heroku Runtime and Heroku Postgres? Can apps easily communicate with a protected database as in a VPC, or does it require lots of custom configuration?
What is secrets management like? Can we use Blackbox? Can we download secrets from a remote source like S3? Or do we have to use a custom Heroku solution?
How is networking configured? Can we point DNS to a Heroku load balancer, as with Netlify? Or do we need to do more complicated DNS delegation?
What is dyno performance like at the pricing levels that are comparable to our current practices (two small EC2 instances, one for staging and one for production)?

I'd like to use Just Spaces as a toy app to stand up the testing stack. Just Spaces is a good candidate for a toy project because it incorporates most of the key elements of our stack -- a Django backend, a frontend served with Django over Nginx, and a Postgres database -- while still being simple enough to stand up quickly.

Deliverables

An instance of Just Spaces deployed on the Heroku Platform
Either a written report or a lunch&learn detailing the things I learned and my recommendations for further Heroku R&D

Timeline

I plan to start and end this prjoect on Friday 5/10.

CC @hancush @fgregg

@hancush commented on Thu May 09 2019

this is great, @jeancochrane – thank you! i think that having firsthand experience with an alternative approach will really enrich our upcoming discussion of deployment dreams (https://github.com/datamade/devops/issues/90). it's a bonus that this approach seems to tick a lot of desirable boxes, namely ephemerality. we're pretty intimately familiar with the challenges of our current approach, and it will be nice to have that grounding context for a leading alternative, too.

@fgregg commented on Wed May 22 2019

R&D issues are going to be move to the new R & D repo.

Dedupe.io upload refactor

The upload service for Dedupe.io is kind of a mess (see https://github.com/dedupeio/dedupe-service/issues/1449). Consider taking some R&D time to do a serious refactor of it.

Static app template

Background

So far we've produced two different static app templates for Gatsby apps: https://github.com/datamade/how-to-recharts and https://github.com/datamade/static-app-template/tree/jfc/init-app. We should consolidate these two efforts into an officially-sanctioned template so that it's clear to developers how they should start a static app.

Proposal

I propose we consolidate our static app templates into one Cookiecutter template that will live in how-to/docker/templates/new-gatsby-app.

Timeline

I expect this will probably take two R&D days, given we'll need a round of feedback.

Write template for Django applications

Related to #22, it would be nice to establish a template for new Django apps that includes ES6 support via django-compressor by default.

R&D: Single sign-on / external authorization services

We're plugging our suite of BGA news apps into single sign-on provider Auth0. More broadly, it's nice when you can leverage an existing account to log into new services. Log thoughts on Auth0, as well as any other external authorization services, here.

N.b., @jeancochrane actually kicked off this convo with "Four Dealbreakers in Netlify Identity", out of #6.

R&D: Enabling ES6 JavaScript development in Django

Continuation of https://github.com/datamade/ops/issues/593, related to https://github.com/datamade/ops/issues/599 & #12.

@jeancochrane has already enumerated the benefits of syntactic changes in ES6, as well as JavaScript compilation. In addition, they have identified ES6 as in important pre-requisite skill to Gatsby, and it is our recommendation to:

Adopt ES6 as our new standard for JavaScript, and invest R&D time in creating a templated build environment for using it with Django projects

This issue is to track progress on that front.

Rapid R&D: Functional UI testing with Selenium

We tried this last summer for Dedupe.io and had a bad time trying to log in, then do stuff. It'd be great to figure out a good pattern for this.

Expand Docker documentation

Documentation request

Namely, add:

Links to useful Docker documentation on core concepts (images, containers, Dockerfile best practices, docker-compose)
The life cycle of images, containers, and networks as it occurs within our recommended practices
Troubleshooting tips, e.g., removing your database volume and starting again

docker-entrypoint.sh

Documentation request

@fgregg will describe how he used docker-entrypoint.sh in la metro

Revisit Gatsby Incremental Builds on Netlify

Background

Gatsby has a new feature called Incremental Builds where it can regenerate your app by only updating things that have changed. Netlify in turn supports this feature through its own new feature, Netlify Builds, that allows you to configure plugins for your builds.

Proposal

I'd like to test out Incremental Builds on Netlify by following Netlify's documentation. I'm going to do this by updating https://github.com/datamade/lisc-cnda/ and https://github.com/datamade/lisc-cnda-map, two projects that have had some trouble in the past with builds taking too long and erroring out.

Deliverables

Pull requests to LISC CNDA repos using Incremental Builds
Some documentation on how to configure Incremental Builds on Netlify

Timeline

I expect this to take one R&D day.

R&D: Advanced Netlify: Functions, Forms, and Identity

@jeancochrane commented on Mon Apr 22 2019

Motivation

Admin login, along with admin-specific views
Advanced search
Data collection via forms
Any others that I'm missing?

Key questions include:

Do the add-ons work as advertised?
How difficult are they to learn?
What are the limitations of the add-ons? Which small-to-medium sized projects would not be good candidates?
How reasonable is the pricing? At what scale would these add-ons cease to be free (or at least cheaper than two EC2 instances)?

Proposal

I propose a rapid R&D project to build a toy project that integrates each of the Netlify add-ons for admin login, search, and data collection.

Deliverables

A sample app hosted on Netlify with admin login, advanced search, and data collection via forms
A lunch&learn report-back answering the key questions above

Timeline

@jeancochrane commented on Thu May 02 2019

I think the next step in this research is to focus exclusively on deploying Functions for search, but I'll leave that for another R&D issue.

[ STUB ] Dedupe core R&D

I'm going to write up a more detailed issue about this on Friday, but I'm leaving this as a quick note that I plan to work on Dedupe core issues in roughly the following sequence:

Allowing 32-bit floats instead of 64-bit doubles in fastcluster
Improving the connected component search algorithm to make it less memory-intensive
Defining a test harness for testing different performance metrics
Using blocks as a feature for the classifier
Researching different approaches to sampling record pairs for active labelling
Researching different learning routines (connects #55)

Incremental Improvements to Dedupe Core

Background

There are a number of relatively small improvements that we would like to make to Dedupe, each of which requires a little bit of research. These improvements include:

Adjusting the clustering library to allow use of 32-bit floats instead of 64-bit
Replacing the connected components algorithm with one that uses less memory
Creating a performance testing suite

This represents the first and easier half of #60.

Proposal

I propose to make incremental contributions to the Dedupe core library as a way of becoming more familiar with the library internals and developing my knowledge of C.

Deliverables

I plan to merge pull requests into Dedupe core, one for each item above.

Timeline

I expect to take roughly two months (four R&D days) to complete the issues above. The issues are in order of least to most complexity, and I expect them to take a roughly proportional amount of time.

Rapid R&D: Investigate Divio for Django deployments

I'm demoing Django CMS for the NOF folks using Divio's Django CMS demo page that lets you spin up a temporary Django CMS project to play with.

I was pretty impressed with the platform, and I noticed that they offer containerized Django deployments. On a first glance, it seems like it meets a lot of our needs (optimized for Django, AWS under the hood, fast setup).

I'd love to take an R&D day to do a rapid evaluation of Divio and see if we should consider it alongisde Heroku and ECS as a possible containerized deployment provider.

Solr: Paths forward

Background

We currently recommend Solr for advanced search implementations. There's a lot to like about it: It's powerful, flexible, and infinitely configurable.

But there's also a catch. Collectively, we understand relatively little about Java and Solr's internals, which makes debugging difficult and, sometimes, quite scary. As we transition from Heroku to EC2, there is the added downside that the most basic Solr instance costs $20/month, with production instances starting at $60/month. This is a huge increase from hosting our own instances on EC2.

Proposal

I'd like to learn more about how Solr works and expand our documentation to include key concepts and advice for troubleshooting and tuning Solr instaces. I'd especially like to focus on settings that will allow us to operate within the constraints of Websolr. For example, the production Councilmatic Solr index uses almost 8GB of storage because we store huge text fields in the index -- the equivalent Websolr instance would cost $299/month.

Deliverables

Use Metro councilmatic to experiment with stored/indexed fields, with the goal of reducing the index size to < 1 GB (the limit for a small Websolr index).
Revise Solr setup documentation to impart key concepts and emphasize DIY, rather than rote copy/pasting.
Issue strong guidance on Haystack index setup, e.g., when to index/store a value.
Add section on tuning and troubleshooting, especially Java-specific issues, like heap size.

Timeline

1-2 days

How to manage a client project

Documentation request

It feels like we have some ad-hoc practices that have emerged for managing the full lifecycle of a client project, e.g. kickoff calls, deadline setting, weekly email updates, and post-project retrospectives. Let's immortalize this wisdom in a document in the project-management/ directory so that we can have some consistent standards for managing projects.

Content Management Systems: The DataMade Way

Documentation request

We've done a lot of CMS work beyond Django's default admin interface. Some recent ventures include:

feimcms3 to power the IHS website.
django CMS to power the NOF website.
Wagtail to power the upcoming Congressional Oversight Hearing Map.

Each of these comes with its own pros and cons, both for the developers and the clients. Let's think on those and pick either our default CMS stack or a document similar to Searching Data pairing options with their tradeoffs.

(One big blind spot is non-Django CMS stacks, FWIW.)

Semi-related to #35, definitely related to #18.

Alternatives to custom data management interfaces

Clients often ask us to build them a custom data management interface (CPS SSCE, SFM, Erikson, BGA, etc.). Very rarely does the client wind up using the interface successfully without substantial help from DataMade.

Investigate alternative approaches to this problem by choosing an approach and refactoring an existing project to use a different data CMS. Some possible approaches might be:

Google Sheets workflows
Airtable (https://airtable.com/)
Building a custom, pluggable solution

Slow R&D: Pilot container orchestration services

Ported from https://github.com/datamade/devops/issues/90:

Identify a service to orchestrate the deployment of containers. In addition to orchestrating container deployment, this service should at minimum provide a web interface to container logs and provisioning container access by application. Ideally, it would also allow managing by access by user and centralize other key devops tasks, such as DNS and SSL management. We have identified two immediate candidate services: ECS and Heroku. We propose evaluating these services by standing up a containerized application in both and selecting the one we prefer. If we prefer ECS but desire further abstraction, we then propose trialing Fargate.

Depends on firming up containerization practices over in #17.

Add anti-harassment policy to this repo

Similar to a code of conduct, add our anti-harassment policy as a doc in this repo.

Do we want to continue to support Flask?

Flask and Django both use Python but are substantially different frameworks. Currently, we support documentation and templates for both types of apps, although the Flask docs are used much less frequently (as far as I can tell we only have two active projects that are written in Flask, Dedupe.io and Parserator, both of which are legacy apps).

I'd love to have a conversation about whether or not it's a good idea for the company to continue to support Flask. If so, when are times that we would prefer to use Flask over Django? If not, what's our path forward for minimizing the maintenance burden of our Flask apps?

Recommendations for HTML5 forms validation

Adapted from https://github.com/datamade/ops/issues/570.

We should put together some guidance for advice on how to handle HTML5 form validation -- specifically, whether or not we recommend turning it off in Django with use_required_attribute.

Deep embeddings for schema matching and record linkage

Background

In their 2019 paper "Recognizing Variables from their Data via Deep Embeddings of Distributions", Alex Smola and Jonas Mueller use learned embeddings -- lower-dimensional representations of columns produced by neural networks -- to match columns across datasets and make predictions about which columns refer to the same attributes (a task often called "schema matching").

Since schema matching is a task that we've been interested in researching for a while, we could get some value out of investigating Mueller and Smola's approach. What's more, I suspect that the approach may also transfer to the deduplication domain, allowing us to deduplicate and link datasets of much larger sizes. (For more detailed thoughts on this possibility, see my blog post on the paper).

Proposal

I propose to investigate deep embeddings for schema matching and deduplication. I'd like to start by attempting to reproduce Mueller and Smola's results from their paper. Then, I'd like to see if learned embeddings can transfer to deduplication and record linkage in order to address the clustering problem.

Deliverables

A blog post summarizing the paper: https://jeancochrane.com/blog/schema-matching-with-deep-embeddings
A repo that reproduces Mueller and Smola's results
A Dedupe extension that uses learned embeddings and approximate nearest neighbors for clustering instead of logistic regression and hierarchical clustering with centroid linkage

Timeline

There's enough uncertainty in this R&D that it's hard to give a good estimate of how long it will take. My best guess is somewhere between three to six months.

I feel much more confident about reproducing Mueller and Smola's paper than I do about trying to extend Dedupe; the paper at least is putatively a solved problem, and the challenge will be getting it to work according to spec. Accordingly, I'll need more support for the work on Dedupe, and I'll try to focus on validating the approach as quickly as possible to determine if we need to abandon it or not.

datamade / how-to Goto Github PK

how-to's Introduction

how-to

What's this?

Contents

Contributing

Code of Conduct

how-to's People

Contributors

Stargazers

Watchers

Forkers

how-to's Issues

Documentation request

Documentation request

Documentation request

Background

Proposal

Deliverables

Timeline

Background

Proposal

Integrating Django with Frontend JavaScript Frameworks

Background

Proposal

Deliverables

Timeline

Documentation request

Recommended Resources

Motivation

Proposal

Deliverables

Timeline

Documentation request

Background

Proposal

Deliverables

Timeline

Background

Proposal

Deliverables

Timeline

Background

Proposal

Deliverables

Timeline

Motivation

Proposal

Deliverables

Timeline

Background

Deliverables

Timeline

Motivation

Proposal

Deliverables

Timeline

Background

Proposal

Timeline

Documentation request

Documentation request

Background

Proposal

Deliverables

Timeline

Motivation

Proposal

Deliverables

Timeline

Incremental Improvements to Dedupe Core

Background

Proposal

Deliverables

Timeline

Background

Proposal

Deliverables

Timeline

Documentation request