artsy / readme Goto Github PK

View Code? Open in Web Editor NEW

1.1K 56.0 117.0 13.68 MB

:wave: - The documentation for being an Artsy Engineer

License: Creative Commons Attribution 4.0 International

TypeScript 100.00%

docs culture art markdown documentation

readme's People

Stargazers

Watchers

Forkers

sweir27 dleve123 erikdstock haroenv dblock andy-solomon joeyaghion ashkan18 pepopowitz starsirius marcoareval0 yatintaluja mbilokonsky bandokal optimusbits kaigrolik samrozen juanitofatas mdole broerjuang williardx jonallured nocontent oxaudo montylounge driesvs gwaping9 adam410 glambert kubatruhlar anandaroop eessex jackieg017 monoglotmark hiwelo ansor4 davidcava06 amarflybot18 kevinheins jasonjho tonymrakovcic pvinis menabebawy kevnm67 thrillberg renatoruk bcordobaq domomi-rentals eschaefer joinjanay-js azaviruha kihongheo roryduncan thaker svirins-zz zscole barbosa imcyee dgolant rosamcgee jamemackson thescripted ghballiet alicanakyuz aonrobot propellingbits juliosilvamb teacherstrange pishonx yiw008 zkan freddneos hmacky mhaidarh mdechert gkartalis dblandin muhammadwajahatalikhan roberttravispierce rodrigomachado9 hy9be africademy ankushgoel2712 drynnbavis emilym-h edvin-johansson enixdark siratl phillipe-bojorquez thelebdev abouariu-nodalview superskyyy tabuchid rnjailamba coodoo evanpurkhiser sad-s jochenstu nehz karimebrahemabdelaziz

readme's Issues

Add a note about making a non-technical thread for critical support issues

Example: Technical thread - Non-technical thread

What do you think @sweir27 ?

RFC: Completely Remove Team Lead Updates

Proposal:

Drop the Team Updates agenda item in Engineering Open Standup.

Reasoning

I successfully got Reduce Duplication at Engineering Open Standup merged last month. Part of the reasoning behind that proposal was that Engineers would have just been in the Sprint Kickoff Meeting prior to Open Standup, so the Team Updates were pretty redundant from that fact alone. However, since that RFC was merged, the Sprint Kickoff Meeting morphed into Product Team Office Hours.

This sorta leaves us in a state where it's worth reviewing our choice here to make sure we're going in the right direction. I was chatting with @ashfurrow about this and his thought was that we should either drop these updates completely or bring them back every week. I opted for the former for this RFC.

I don't get much value out of the team updates at Open Standup, but would love to know how others feel. I've been happy with getting this type of info from skimming the Sprint Overview email that goes out. On that last RFC there was def some interest in dropping these updates completely or changing the focus so those ideas are certainly in scope as well.

One additional advantage of dropping these updates is that the meeting might have a slightly less Product vibe and more of a meeting where Engineering can just be Engineering. 🤓

Exceptions:

There are times when teams have to share info with the rest of the group and Open Standup is a great avenue for this, so we should ensure there's still a good way for this to happen! If the info to share is covered by Cross-dependencies / Requests for Pairing or New Milestones / Repos / Blog posts sections, then great. Otherwise, the Closing Announcements section could be a good catch all.

Additional Context:

The Reduce Duplication at Engineering Open Standup RFC from last month.

[RFC] Update the RFC process to promote faster and clearer resolutions

Proposal:

Our current RFC process is a fantastic mechanism for raising important conversations around large or controversial changes. I believe there are still optimizations we implement in the process to make it even better.

It's easy for RFCs to lose traction and stay open without movement for long periods of time
Resolving an RFC is still somewhat of a murky process

I propose that we

Introduce a state of stalled for RFCs. If an RFC goes a week without comment then it will be automatically set as stalled and closed the next Monday at 12PM EST.
~~Introduce the idea of an RFC sponsor. This person will be responsible for facilitating the conversation and will ultimately be who decides how an RFC is resolved.~~

Edit: I've updated the proposal to remove the second part. It seems like preliminary feedback is against that, and I'm not so invested in that part of this RFC to push for it. I would still like to tackle the problem of resolving RFCs being a little murky. Open to ideas on what that might look like.

Reasoning

Introducing stalled RFCs

(props to @orta for the suggestion on this one)

This just ensures we're doing our best to resolve RFCs in a timely manner.

Introducing an RFC sponsor to our process

~~To describe a sponsor's role I'll borrow a part of coming to consensus from Mark Shepard:~~

A chosen facilitator can help consensus by keeping the discussion on track, encouraging good process, and posing alternatives that may resolve differences. But a facilitator is a servant, not a director, and assumes a neutral role. If a facilitator wishes to take a stand on an issue, the task of facilitating is handed to someone else.

~~So a sponsors role will be two fold:~~

~~1. Help guide the conversation and resolve differences~~
~~2. Take the responsibility for deciding the resolution state~~

~~The sponsor should be neutral in the conversation, open to any resolution, and willing to help facilitate. If an RFC loses its sponsor it's considered stalled until a new sponsor is found.~~

RFC: All GraphQL API servers have a root `_schema.graphql` file

Proposal:

Metaphysics, Exchange, Convection and Gravity (for GravQL) and (any others I can't think of all, impulse/pulse?) have a _schema.graphql file in the root of their respective repos. We add a pre-commit hook that ensures the schema is always up-to-date.

Reasoning

Mainly to move PR review up a level of abstraction to easily discuss schema design, secondary to improve tooling.

Doing this means that people can see directly in the PR, in a language agnostic way, the changes to an API schema
_schema.graphql is a weird name yes, but I want it to be at the top of the PR every time.
Having this file means that we can build consistent tooling around all the repos. Without thinking too deeply:
- Metaphysics stitching updates can be scripted in MP
- Client schema updates can be trivially scripted too
- Peril can poke #graphql when there are schema changes in any repo (easing comms burden on folks)
- Peril can offer weekly changelogs of all schemas
- We can build cross-language linter rules like "please always use unions in mutation"

Exceptions:

Nothing I can think of? Probably worth adding a CI check though incase people decide to skip the

Additional Context:

A hat-tip to @cjoudrey who brought this idea up and discussed its merits within GitHub's API at the NYC GraphQL meetup.

[RFC] Updates to On-Call Process: Jira Ops + Status Page

Proposal (mini)

More below, but this proposes a new process to keep track of incidents and communicate their status via tools including Jira Ops and Status Page.

Reasoning

As our on-call process has evolved, we’ve identified a few areas of improvement:

It’s non-trivial to analyze trends of incidents and keep track of follow-ups. Currently this happens with a combination of people updating a trello board, making Jira tickets, and manually tracking incidents after-the-fact in a spreadsheet.
We don’t yet have a good way to alert external parties when we’re having downtime. See #108 for more detail.
It’s difficult to separate the act of solving an issue with communicating impact to stakeholders. See #63 for a potential solution to this problem.

Goal of these proposed changes

To make it easy for anyone at Artsy to raise incidents, keep track of which incidents are occurring/have occurred, and their potential impact and follow-up.

There are many ways to improve our fledgling on-call support process, and these changes do not attempt to solve all of them.

Our ultimate goal is to create a process that is transparent for both incident responders and incident reporters. Accomplishing this will at least involve changes to our process on engineering, changes to how stakeholders interact with our team, and potential additional tools and budget.

The changes proposed here focus only on the first part: how we can improve the Engineering team’s process.

Exceptions

Intentionally not included here are:

Updates to the process for incident-reporters. We still expect them to go through the #incidents Slack channel.
A way of addressing our “notification problem” and formal escalation paths to ensure that we have true 24/7 support. We know that setting up slack notifications is not necessarily sufficient to wake someone up in the middle of the night, and there are tools we can pay for that are designed to help with this. This point is important but to keep the changes here manageable, it will be a follow up.
Updates to our process for triaging and fixing bugs. Bugs will still be created on the universal bug backlog and looked at by product managers.
A formal way of escalating bugs that recur. Although, a side effect of the changes proposed could potentially help this without requiring additional processes.
Workflow optimizations like slack notifications or a slackbot to help people raise incidents with a slash command.

Some of these, especially the first two items in that list, are top-of-mind. They're not in this proposal because we felt it would be too much to introduce all at once and want to give this process a try first.

Additional Context

Join the #on-call-working-group slack channel to see discussions around this and other topics related to how we want to handle on-call.

See https://github.com/artsy/potential/pull/134 for the initial explanation behind our current process (which has since evolved!).

How is this RFC resolved?

If we decide to move forward with this RFC, the next steps will be to:

Update the existing support playbook in this repo to reflect these process changes.
Add a link to the Jira Ops board in the slack topic in #incidents
Schedule a training session to walk through the new process in real life 🎉
Add new fields to the Jira Ops issue template:
- Reported at
- Acknowledged at
- Resolved at

Proposal (meaty)

If this is added to the master support playbook, we'd also like to include:

A diagram of the lifecycle of an incident
A separate section to describe incident severity (right now it's included inline)
Add pictures 😅

(New) Incident Response Process

Step 1: Alert

Incidents can be raised by automated (based on alerts that we’ve set up) or manual means. In both cases, the on-call team is notified of incidents first through a message to the #incidents slack channel.

Step 2: Raise

When an incident is raised in the channel, the on-call team first tracks the incident by creating a ticket in Jira Ops, based on the information they know at that time. Some workflow steps:
- Link to the slack thread where the incident was first reported in the description of the ticket.
- Take your best guess at the other information, it can always be edited later.
In the original slack thread, link to the newly-created incident

Step 3: Assess and Identify

A member of the on-call team marks the issue as FIXING. This person will be automatically assigned to the issue, and they are designated the “Incident Manager” for the duration of the incident resolution. This person is responsible for facilitating the resolution. They may or may not be the person equipped to fix the issue. Other responsibilities, like communication, should be delegated to the other on-call team member if necessary.
Determine the severity of the incident by trying to reproduce the problem and trying to capture the answers to the following questions in the Jira Ops Incident:
- How many people (or partners, etc.) does this affect?
- Have you been able to reproduce?
- Is this a new problem or something that we are aware of?
Update the severity on the ops ticket. The following definitions are borrowed from Atlassian's playbook:
- Critical: A critical incident with very high impact (i.e. customer data loss; customer-facing service like CMS is down)
- Major: A major incident with significant impact (i.e. a large number of images are not showing up; a small number of partners cannot access CMS)
- Minor: A minor incident with low impact (i.e. a single user cannot place a bid in an auction; a page linked from the app is showing a 404)
If the reported incident does not qualify for an immediate response:
- Direct stakeholders in the slack thread to a product team’s channel to ensure that a Jira Issue is tracked for this possible issue
- Link the Jira incident to any relevant bugs that are tracked (and create an additional one if necessary)
- Mark the incident as CANCELED

Address Incident

Investigate --> Identify --> Escalate --> Fix
- The Incident Manager uses the Jira Ops slack integration to start a slack channel for the incident, and invites people who they think could help investigate and fix the issue.
  - Pro tip: If you subscribe to many channels, this one could get lost. Star it so it appears at the top of your list during the duration of the incident.
  - The slack channel is for discussing the incident's resolution. Keep communication to stakeholders confined to the original slack thread, if possible.
- [The rest of this follows our current process in the playbook]
Communicate
- The Incident Manager (or someone else they have delegated to) handles communicating any updates to the incident. This includes:
  - Posting updates in the Jira Ops issue when new information is discovered. This will create a useful timeline that we can look back on as a post mortem.
  - Posting updates/answering questions in the slack thread in #incidents for stakeholders.
  - Keeping our public status page up to date. Any incidents with a severity of Major or Critical should be communicated externally via the status page. Communication does not have to be detailed. The most important thing is acknowledging issues and communicating progress towards a resolution. Use the status page integration on the Jira Ops ticket where possible.

Resolve

Incidents are resolved when the immediate issue has been addressed. After that, the team needs to make sure all follow-up items are tracked, and the incident clearly reflects this state.
1. Create any necessary follow-up tickets either in the bug backlog or a specific team’s project board. Link these from within the incident ticket.
2. Post any final updates to the incident ticket.
  - Update timestamps
  - Mark the ticket as RESOLVED
  - Add any additional labels or fill out fields that may be relevant later.
3. Resolve any outstanding incidents on status page
4. Post a final update to the original slack thread
5. If the incident was Critical or Major, create a post mortem following the guidelines in our post_mortems repo.

What is Artsy Open Source by Default?

Our “open-source by default” approach has helped create a world class Engineering brand, but is being tested with open vs. closed onboarding and public vs. private documentation. There’s a clear need for writing down the definitions around this area beyond the short introduction of who we are.

Also #2

[RFC] Sharing Teams Retro Boards

Proposal:

Share links to different teams retro boards publicly across PDDE.

Reasoning

Retros are great places to discuss our internal team processes and possible places for improvements. Whats possibly missing in our current approach is sharing the results of these retros and finding possible common pain points and improvement areas.

For example i've been in retros from different teams where something like missing integration tests has come up few times. Currently these stay within the team but if we share these boards across PDDE, these common patterns will emerge and we can act upon them easier.

In the past places, each team's retro boards were actual post its hanging on the wall for the rest of the week so you could just pass by the boards and see whats going on in other teams and possibly detect some common patterns and do a wider team effort to fix them.

Exceptions:

None?

Additional Context:

The only issue I can think of is, knowing these retro boards are currently within only each team, I can imagine some people may not feel comfortable about opening out to public in PDDE but I think overall we'd benefit more from having them as public.

How is this RFC resolved?

Finding a proper place to share links to retro boards

[RFC] Defining PR roles

Proposal:

Specify responsibilities for PR reviewers/assignees.

I propose that the following guidelines apply to all PRs created in the Artsy org:

Reviewers

People assigned to review a PR should ideally review before merge. It is not required that all reviewers complete reviews, but it is recommended. Authors should also do their best to limit the number of reviewers in order to keep PRs moving quickly.

Developers who should be notified of the PR but whose review is not required before merge should be cc'd in the PR body instead of assigned to review.

Reviewers should make a good-faith effort to review within one business day of being assigned, unless the PR's author specifies otherwise in the body of the PR. If a reviewer requires more time for any reason, they should communicate with the author of the PR so that the author can designate a different reviewer if need be.

Assignees

PRs should be assigned to a single person. This helps to keep expectations clear and stops PRs from getting stuck ("oh, I thought the other person was the more important assignee").

The person assigned to the PR is only responsible for moving the PR forward (merging or re-assigning the PR), and not necessarily for reviewing. If the author wants the assignee to review, they should also assign them as a reviewer.

It is still the responsibility of the PR's author to get the PR to the point where it is ready to be merged. It is also the author's responsibility to notify the assignee when they believe the PR is ready to be merged, at which point the assignee should make a good-faith effort to merge or request changes within one business day.

Once an assignee has finished reviewing, they can:

Merge the PR if comfortable doing so and the criteria below are met,
Re-assign to the original author so that they can merge, or
Request additional review from other developers

These criteria should be fulfilled before a PR is merged:

All CI status checks should be green
Peril/Danger status checks should be green
At least one review approval should have been submitted

PRs should not initially self-assigned (though they may be assigned back to the author by the original assignee). This ensures that there is a final check from a fresh pair of eyes on before the PR is merged.

Reasoning

Different engineers have different practices when it comes to PRs. Some ask 3 - 4 people to review, some only 1 person. Some self-assign their PRs, while others assign them to teammates.

This can create a lot of confusion, especially when engineers transition between teams and have to adjust to the styles of their new collaborators. It also slows the pace of progress, as it maybe be unclear who is responsible for making sure a PR is merged as fast as safely possible.

Exceptions:

Ideally, this would be applied across teams and individual engineers so that everyone has the same expectations for what happens when they create a PR and designate reviewers and assignees. This would require close to 100% agreement on what these roles are and mean, so I'll do my best to make sure that everyone has a chance to voice their opinions or concerns.

The timelines may also change depending on the urgency of the PR. Ideally, authors will note things like "this is a high-priority PR and would ideally be merged within the next few hours" or "this is a low-priority PR and as such could wait until so-and-so gets back from vacation to review, since they're the most familiar with this system." Could be a good case for a peril rule + auto-tagging, actually.

How is this RFC resolved?

If we're able to agree on guidelines, those guidelines would become part of Artsy's README and would be surfaced to new hires. They would also be raised in Engineering standups and in the #dev channel to ensure that the whole team is aware and onboard.

Add instructions on how to take a "tour of Artsy"

cc @xtina-starr for the idea!

One of the items in our onboarding talks about taking some time to tour around Artsy.net in a staging environment, doing things like bidding in an auction and making an inquiry.

It would be cool if we had a doc that outlined some of these and provided some high-level instructions on how to actually do them. 😄

Remove single line breaks in Markdown files

Proposal:

Remove any single line breaks in Markdown files.

Reasoning

It seems like many have an 80ish character line break, which seems completely unnecessary as all editors do just fine with wrapping text.

Additional Context:

The extra CR/LFs are driving me bananas.

This could be automated via Danger.

[RFC] License the Artsy Engineering blog under MIT/CC-BY-4.0

Proposal:

Add a license to the Artsy Engineering blog: https://github.com/artsy/artsy.github.io

License the code under the standard MIT license.
License the blog posts themselves under a Creative Commons Attribution 4.0 License.

This applies to over 200 existing blog posts, so I thought it was important enough to warrant an RFC.

Reasoning

I was surprised to learn that our blog repo currently lacks a license – maybe because it's one of our oldest OSS repos and we didn't know better? In any case, we should license it. Our standard license is MIT, but that license is generally used for code, and not necessarily for content. For this reason, I suggest we use the CC BY 4.0 for the blog content itself. I've been doing this on my blog to make sure people know that they can use what they read on the blog. (The Attribution clause would require them to, if they modify the blog's contents and re-release them, to attribute the original post to us.)

We could, alternatively, omit the CC BY 4.0 license if we're feeling averse to getting into the weeds of licensing, but I think it's worth it.

Exceptions:

None that I can think of.

Additional Context:

This came up while developing the new Artsy Engineering blog. Our decision in this RFC would apply to that new repo, too.

How is this RFC resolved?

Adding a LICENSE file with the following contents:

LICENSE contents

MIT License

Copyright (c) 2019 Artsy Engineering

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

Make the following changes to README.md:

README.md changes

+
+## License
+
+The code in this repository is released under the MIT license. The contents of the blog itself (ie: the contents of
+the `_posts` directory) are released under
+[Creative Commons Attribution 4.0 International License](https://creativecommons.org/licenses/by/4.0/).

[RFC] Establish acceptance criteria for SEO/Accessibility

When creating or updating new apps, code changes often do not take into account SEO or accessibility. These often become work items much later on when surfaced as bugs.

A few examples:

Because many of these items are considered best-practices, I believe it would be a good idea to start assembling a list of acceptance criteria for new and redesigned pages in relationship to basic SEO and accessibility standards.

A simple baseline might be a minimum lighthouse score, and running this check will usually give a list of straightforward improvements. I'd imagine the SEO team would have ideas on further points to double check before launching new products.

Add doc on ways to grow as an engineer @ Artsy

For someone interested in stretching in different ways, would be nice to support that with resources/concrete ways of improving.

Growth could be in terms of technology (and we may have docs on this already in README).

Growth could also be in terms of skills that aren't strictly technical. We should have a high-level list of these + links to specific resources for each. Some examples are:

mentoring
leadership (management + technical leadership)
interviewing
public speaking
writing

... there are probably more!

Include a list of the banter-y slacks

E.g. add these links into the slack doc

RFC: Add a Front-End Engineering Office Hours

Proposal:

We add a new recurring meeting to the dev open calendar, "Front-End Engineering Office Hours"

A place where you can come in and ask for input/pairing on everything that powers front-end product work; from usage or setup of tooling, to help debugging that pesky situation, and/or working on features/bugs in libraries in our Omakase stack or beyond.

The aim being to provide a space where someone can dial-in or come to talk to @alloy and myself about all of the above.

Reasoning

We want a space so that people can have structured time to fix up their setup, or ask questions about how/why things work and the office hours format feels like a good mix.

We think we can get the 27 classroom right after Monday dev standup, so it can be a matter of staying in the room / zoom a little bit longer for people.

Exceptions:

Nothing so far.

RFC: Rename the Artsy Omakase to [Something]

Proposal:

Take the Artsy Omakase (vid vid2):

React / React Native
TypeScript
Relay
GraphQL
Jest
Styled Components
Storybooks
VS Code

and give it a name.

Reasoning

We always have to explain the word omakase. It totally fits, yeah, and it references ruby on rails In a way I particularly enjoy. Yet, it's better that we can give the stack a real name rather than a referential one.

Studio reflects the idea that it's a setting / place for getting things done. Which kinda fits how we settled on this stack. We didn't create any of these major projects, but our studio is where we got them all to work really well with each other to help keep it focused.

I think once named from “the artsy omakase” we can do:

a consolidated docs site, for internal and external use
a CLI for quickly creating particular components trivially (and maybe bootstrapping projects)
have a way to debate/discuss the core deps (e.g. is react-tracking in there?)

Ideas

Studio is/was the forerunner
Palette would have been a good name, but taken for the design system
Easel/Mural are also great names for the past framework-y names
There was a debate about ohm / Ω as a reference to omakase, but we have a private repo call ohm, and that's gonna be confusing

Additional Context:

Older slack discussion

Expand on Informationals

Add:

Idea on how do describe Artsy externally
Example topics, and questions
A cheat sheet overview

[RFC] Align on "technical spike"

Reasoning

Following the PDDE Q1 retro, while teams can operate and execute sprints differently, it'd be useful the organization aligns on some terminologies and shares best practices. The Galleries team had a discussion about what a technical spike is, and I feel it's a good example to get alignment.

Proposal:

The goal of a technical spike

“A spike is an investment to make the story estimable or schedule-able." (1)

When we are unable to estimate a use story from a technical perspective, it usually indicates the story is too big or uncertain. If it's too big, we should break it down to be reasonable and estimable. If it's uncertain because of technical unknowns, a technical spike can be used to reduce uncertainty. The goal of a technical spike should be to unblock the team to be confident making priority and planning decisions.

Some examples when a technical spike can be used:

The team may not have knowledge of a new domain, and spikes can be used for research to familiarize the team with a new technology or domain.
The user story may contain significant technical risk, and the team may have to do some research or prototyping to gain confidence in a technological approach.

A technical spike is not...

"Small" projects: in the past, "spike" has often been used when discussing a small unit of work. The definition of a technical spike in this RFC would exclude these cases, but a technical spike could lead to small projects that are well-ticketed reasonably.
Unclear issues: under this definition, spikes are meant to specifically provide clarity, and they have a clear start and end point. So while a spike should be considered when we need to better understand a problem, it should not be applied without intent, goals, and time criteria. In other words, a spike is meant to address a specific scenario or issue, and it shouldn't be a matter of "this ticket is confusing, let's do a spike on it and see if we can figure out what it means."

Guidelines

A spike should have clear questions to answer or acceptance criteria.
A spike should be time-boxed.
A spike should be demonstrable. This brings visibility to the team and builds shared responsibility.
The team should consider implementing the spike in a separate sprint from the resulting stories.

How is this RFC resolved?

The team agree on the terminology and can use it to facilitate planning on product teams.

[RFC] Package for triggering when element enters viewport

New Dependency

Name: react-intersection-observer ~~or react-visibility-sensor~~

URL: https://www.npmjs.com/package/react-intersection-observer ~~and https://www.npmjs.com/package/react-visibility-sensor.~~

EDIT: After digging into both libraries, I propose we only consider react-intersection-observer. While both libraries are actively maintained and in use, react-intersection-observer is written in Typescript (which integrates nicely with Artsy's preferred stack) and includes a component with similar functionality to the VisibilitySensor in react-visibility-sensor.

Focus

Allows us to track impressions of components when they enter viewport. Could be used in the future for triggering other actions when components enter viewport

Check List

Have read over the source code
Has had a release in the last year, or looks done and stable
Could you fit this codebase in your head after reading the source?
Is this the stand-out obvious answer to a particular domain problem?
Do you expect your team to be the only people who know about this dependency?
Is this obviously being used in production by the maintainers, and thus battle-tested?
Do you feel well versed in the domain of this dependency and/or could maintain it if that needs to become an
option?

Alternatives

Roll our own implementation with intersection observers.

[RFC] Trial Atlassian OpsGenie for Engineering On-Call rotation scheduling

Proposal

Trial Atlassian OpsGenie for Engineering On-Call rotation scheduling.

Reasoning

By adopting an automated system for on-call rotation scheduling, we will save our team time and unlock future process improvements.

The current process (documented in artsy/README) depends upon an organizer to create a sign-up sheet, manage participation, and translate the sign-up sheet to Google Calendar events. This is a time-consuming process which doesn't result in a resource that can easily integrate with other products or processes.

OpsGenie provides a scheduling system that focuses specifically on common requirements for engineering organizations. This system would allow us to group engineering teammates into two rotations, resulting in two engineers on-call simultaneously, aligning with our current process.

We will no longer have an open signup process. Teammates will be able to swap rotations with others as necessary via overrides (docs).

Many of our documented recommendations would remain relevant in the new system. For example, we have a policy of leaving teammates out of an upcoming round if they doubled up in the previous round. In OpsGenie, an organizer could look at the previous round's roster and overrides to determine which participants had doubled up.

OpsGenie has support for alerting on-call colleagues in response to incidents and/or alerts. This functionality addresses a concern discussed during our last On-Call retrospective meeting and we're planning for a following iteration which takes advantage of this feature.

Exceptions

The trial will be scoped to the next round of on-call scheduling, targeting rotations starting May 6th, 2019. We'll continue to use the current process for our active on-call schedule.

Additional Context

We evaluated four different services within the On-Call Working Group. Notes on the comparison can be found here.

Atlassian JiraOps (beta) will soon be folded into the OpsGenie product, giving us another reason to trial OpsGenie over other solutions.

Our last On-Call Retrospective took place on Februrary 6th, 2019 (notes).

What happens if the team accepts this RFC?

April 22: Start an OpsGenie trial (lasting 2 weeks)
By April 24: Invite all engineering members. Create first schedule and rotations to begin May 6th
Document updated process for both organizer and participants in artsy/README

How to gain insights into where we are spending our time?

Following-up from my email on finding ways to reduce waiting time in-between tasks, I wanted to get a bit more concrete with all your input.

To give an example of what I could imagine; recently we visited the office of a large tech company, and they showed us dashboards that gave the engineers insights into all sorts of metrics on where and for how long engineers were spending time, which means that they can understand how to affect that themselves. For instance, they were recording how much time people had to wait for their tests to run, which made it very clear that spending a little bit of time optimising things led to an immediate huge impact on time saved per engineer, and working more efficiently made everybody happier–nobody likes sitting around idly.

I want you all to pitch in with your pet peeves of where you feel you are being blocked from getting the things done that you want to get done and then we can jointly discuss which of these things to tackle and how.

Link to "onboarding" or "overview" docs for each product team in onboarding

Somewhere in our onboarding docs, we should link out to where one can find a high-level view of what each product team is responsible for.

This may or may not already exist for those teams (and will likely be in Notion?). Possible topics include:

Link to the team's JIRA board
Rough schedule
What the team is responsible for
Who is the current tech lead/PM/design/analyst
What are the main systems that the team uses for their day-to-day work?
For someone just browsing the team, what are notable projects that the team has done or is working on/thinking about?

RFC: Metaphysics PRs to include data generating code

Proposal:

We've all agreed that any time we PR any changes to MP schemas we have already merged and deployed changes to staging in upstream applications. I think we should consider including some ruby code in every MP PR that generates the stub data required to demonstrate all cases of the new MP code.

At its most basic it should just be some code that we can paste into the console, something like:

user = User.where(email:'[email protected]').first
show = Show.create!(whatever)
show2 = Show.create!(whatever2)
follow = FollowShow.create(user: user, show: show)
user.id

That's a simple example - it creates everything necessary to make some assertions, and it ends by logging out the ID you'd look up in your root query in metaphysics. The PR would then include various GraphQL queries against that exact data as well as their results, and anyone can independently verify it.

Reasoning

It can be a nontrivial enterprise to put Gravity into a state that exposes the exact nuances that some new MP feature requires. Logging into the console and tweaking stuff until it works is fine, if you're willing to have your changes blown away over the weekend. This also requires any PR reviewers to figure out how to generate gravity state that'll demonstrate the PR's correctness.

Exceptions:

If we're writing unit tests all of this just gets fabricated - we understand how important it is to actually test these things in a predictable, reliable way. This is specifically for those PRs that include features that cross service boundaries.

Additional Context:

Perhaps we can have some template for the ruby code that wraps it in a guard clause - regenerate this or return it if it already exists.

How is this RFC resolved?

Every metaphysics PR includes ruby code to (re)generate its upstream data.

I thought about potentially tracking these blocks of code in git, maybe adding a data_sets top-level folder, but that won't do well over time since we don't track full migrations.

I think I'd be happy with a comment in the PR containing a block of code?

RFC: Add front-end and platform practice updates to dev standup

Proposal:

Take the gap in dev standup from #56 and fill it with updates from the two major practices instead.

Reasoning

We recently removed team updates - but maybe practice updates are a good in-between? The platform updates were always the most globally useful and with #86, maybe we're in a better spot for that too. Maybe they're high level enough, and applicable overall that we can give it a shot.

Exceptions:

None

Additional Context:

This came out of discussion from the future of platform.

[RFC] New Hire Buddies

Insight: assigning someone as a mentor to a new hire that has been at Artsy for a while is good because they know lots of things, but it's bad because it's probably been a long time since they set certain things up. Another new hire is in a much better position to assist with these types of things because they just solved these problems!

For example, @pepopowitz just started and @starsirius asked if I'd like to be a mentor for him and I was happy to. I was able to help with a couple questions he had. Later he was asking about creating AWS and Jira accounts - something I haven't really thought much about.

@javamonn (for example) would be better suited to help getting someone setup on Jira since he's just done it and I actually don't even remember how I did it. Similarly, @ashleyjelks has just setup her AWS creds and ran into gotchas - she's in a much better position to help him out.

As a bonus, it feels good when you're new to a company to be asked to help with things. Makes one feel like they know stuff and are being helpful!

So, I wonder if there's a way to connect new hires so that they can learn from each other. It could be as informal as a slack group or formalized by assigning buddies in some way. Open to ideas here!

Add onboarding instructions/resources for being on-call

This should likely be co-located with the other support docs, but we should make sure to link to it from the onboarding checklist.

This doc may include (brainstorming):

Link to the video that @izakp did explaining kubernetes/monitoring
"What you should to have set up" when you start being on-call (i.e. hokusai, heroku CLI, etc.)
Additional information on how to debug issues/monitor
Maybe links to past notable incidents and their resolution

RFC: Ensure dependent services are set up in production before allowing a PR requiring schema changes to be merged

Proposal:

An example might be easier to think about:

Let's say you have made changes to Exchange (which Metaphysics depends on) adding a new field to a type in it's API. You will need to make a PR to metaphysics to merge those changes into the metaphysics global schema. In making this PR, if the changes in Exchange haven't been deployed to production haven't been shipped to production then your PR to metaphysics will fail.

Roughly

If you want to make a change further down this list, the things above you need to be deployed in prod when you are making GraphQL schema changes.

This is currently happening on a force deploy with artsy/force#3061 - but we'd like to move that behavior to run on metaphysics for its dependencies ( and eventually in Emission/Reaction too)

Reasoning

This came up in the platform practice that the ability to be sure that all your dependencies are set up is valuable enough to introduce some friction to the process.

Doing this:

Ensures we move dependencies into production more often
Ensures that PRs which do introduce changes to the system won't be deployed because another team is working in the same repo and does a deploy

Additional Context:

You can see our discussion in slack here

Add pre-gravity history to highlights.md

e.g. Heroku Bartender, the old gravity etc

https://github.com/artsy/README/blob/master/culture/highlights.md

[RFC] Deprecate Peril

Proposal:

We stop adding new things to Peril
If we want to do something that would normally be a Peril rule, we see if a third-party service allows us to do it
Once GitHub Actions launches later this year, we create a task force to port existing Peril functionality to it

Reasoning

Debugging is incredibly difficult. Access to logs is limited since the services Peril uses are owned & paid for by Orta, and there is no work being done to make logs more accessible. The logs themselves are also unhelpful when troubleshooting
Peril is hosted on Infrastructure owned by Orta (AWS Lambda functions etc.). If we want to continue to use and maintain Peril, we would have to invest in self-hosting it
Since GitHub announced Actions last year, Peril has not been very actively developed
In the framework Sam discussed at a recent H2 meeting, Peril falls in the Outsource bucket—it is not mission critical and it is a context project
We can recreate existing Peril rules with GitHub Actions by making a Danger action

TL;DR it would take a lot of developer time to truly own Peril. While Peril brings a lot to our team culture, it’s not critical to the success of Artsy’s business—and we can use third-party services to accomplish the same cultural automation that Peril has enabled.

Additional Context:

Discussion of this issue in Platform Practice
Research document completed by Justin + Matt

How is this RFC resolved?

Announcement at Engineering Standup with the final decision made
Matt/Justin to follow up about task force creation once Actions launch

[RFC] Make each commit a working version and have linear history.

Proposal:

For each commit in a repo’s mainline branch (a.k.a master branch) to theoretically be deployable; as in, the test suite/linter/etc are all green.

And for related changes to be a single commit or next to each-other in the commit history.

Reasoning

Some of us work more often with a repo’s commit history, either to debug a software issue or otherwise gain information about why certain changes were made. This is made difficult when something that could semantically be considered a single change is spread out over multiple commits.

Squashing before merging means that all the related changes exist in a single commit and thus in context of each other and the commit message.

Each commit a working version

Multiple commits is problematic when trying to find a commit that introduces a change you’re looking for through tools such as git bisect, as there’s the opportunity to come across commits that are known to be broken (which are typically followed up in the PR with a ‘fix‘ commit) and only interrupt the workflow.

Understanding historical decisions

A git merge may result in commits being interleaved with changes from other PRs, which is problematic when scanning the history for changes you may be interested in and having to keep a mental model of all the commits you came across. Especially those that have a commit message such as ‘fix’ only add noise and make this harder than it needs to be.

Exceptions:

In some cases, a PR may contain a larger set of changes that may not all be directly related to each other. In these cases the author should take care to squash the directly related changes into single commits and then request the assignee to ‘Rebase and merge’ instead, which will ensure they will exist next to each-other in the destination branch but otherwise leaves them in tact.

Additional Context:

Some reading material related to topics mentioned:

How is this RFC resolved?

Offer some form of training on git merge vs rebase basics. (Short L&L?)
Disable the ’Merge button‘ on repos.
By default, ‘Squash and merge’ PRs, unless otherwise requested by adding the ‘Rebase and merge’ label.

Additionally an assignee may be able to glance from the commit history that the few commits are each their own distinct change and can decide to ‘Rebase and merge’ without the author explicitly asking for it.
Peril’s ‘Merge on green’ feature uses the label to choose either ‘Squash and merge’ or ‘Rebase and merge’.

[RFC] Improve definition of point people on projects

Proposal

Provide definition and guidance around the point people for a project.

We want projects to be owned by teams overall, but lots of temporary ownership over projects makes it hard to address technical debt, or to ensure a long-term vision on a project. One solution to this is to define point people, who are responsible for the long-term technical direction of a project, especially as a project can change ownership between teams during its lifetime.

The point people for a project should change over time as people come and go, and also grow if the project evolves in being more significant to us.

This RFC is about trying to document a consensus on what point people’s responsibilities are, and how we can improve code ownership at Artsy.

Reasoning

Having a clear definition of what it means to be a point person on a project at Artsy removes ambiguity. Defining what this role means helps those new to Artsy get up to speed more quickly.

We want to be confident that someone is always looking out for the overall vision of our projects so having an organic way to change who these people are helps with this confidence.

Providing a path for new people to own things helps them feel like they are a part of the team.

Project Impact and Point People

The impact of our projects varies and this affects the role of point people and the number required to provide confidence that the long term direction of the project will be covered adequately.

Consider these projects:

One way to distinguish between these projects is the level of impact each has on Artsy's business - as a project has more impact it should be staffed by more point people.

The aim of this RFC is to provide:

A definition for those stages
A process for either bumping a project up/down
A process for requesting changes for a point person

Exceptions

We aim to have every active project in Artsy assigned to a team (see the linked project list below) and there’s a reasonable argument to be made that this can be enough for some projects. We’d have to get to that when we get there.

Additional Context

The Artsy Project List in artsy/potential.

We want people to feel personal growth at Artsy and one way for engineers to do that is to leverage their business impact. By providing straightforward definitions and processes we can ensure a fair playing field regardless of anyone’s assertiveness.

How is this RFC resolved?

A document in README defining the responsibilities of point people
A document in README on the stages of a projects maturity and impact
A recommend way to describe the project in it’s README via these stages
A discussion on ways to evaluate or request an evaluation of the point people (ideally we do
this somewhat regularly too)
PRs to a bunch of our popular repos updating their Meta sections on their
READMEs

Note: this RFC co-authored with @orta 👬

RFC: Creating a public facing status page

Proposal:

We'd like to create a public facing status page for our gallery partners.

We believe we could use Atlassian's StatusPage for this so hopefully the amount of dev work needed is low.

The most important things to monitor externally are likely the following to start:

Artsy.net
cms.artsy.net
writer.artsy.net

Who actually will update this page in the event of an outage/major disruption is up for discussion still. It may make most sense for one of the two engineers on-call in #incidents to update the status page.

Reasoning

Having a public facing status page would greatly improve Gallery Relations' ability to handle outages or service disruptions by allowing partners to 'self-serve'.

Currently, if there is an outage or major service disruption (i.e. Conversations is down), it requires Partner Support to send at least 3 messages per user who writes in:

Confirming it is the outage
A holding message letting partners know we're working on it
Finally a follow-up notifying them it's been fixed.

Generally, there is only one person on-shift at a time. If 25 galleries write in, that's ~75 messages for one person to send on top of the normal queue. Having a way for partners to check if something is resolved themselves could minimize this for Gallery Relations.

An added benefit could be having a history of issues as well.

Exceptions:

There may be other external services we do not need to monitor. The Genomer applications come to mind but there may be more.

Additional Context:

Our current status.artsy.net page uses pingdom uptime checks which may not give us the flexibility we need externally.

Alexander is happy to do as much of it as possible to minimize the amount of time needed from engineers if possible.

You can see our discussion in slack here

Describe how/when to write post-mortems

... specifically in the context of the on-call process. When does an #incident turn into a post-mortem-worthy event?

We already have https://github.com/artsy/post_mortems which links to helpful docs but doesn't answer that ☝️question. This could be part of our on-call onboarding docs or something separate.

RFC: New Incident Review meeting

Proposal:

Start a new biweekly "Incident Review" meeting series.

Agenda:

An incident will be added to the agenda after it's resolved.
For each incident we'll review the post mortem, discuss follow up actions items and share learnings.

Invite list:

All engineering will be optional.
People most involved with the incidents will be added as mandatory participants.

Meeting should be cancelled if there's nothing on the agenda.

Reasoning

We want to maximize the learnings from past issues and reduce the likelihood of repeating the same mistakes. Incidents are an amazing opportunities to reflect and identify what we can do better.

Exceptions:

¯\(ツ)/¯

Additional Context:

This was discussed previously in a platform pratice meeting and we agreed to have that discussion here instead before moving forward.

[RFC] Document the rationale for why Artsy's various closed source repositories aren't open

Proposal

Document the rationale for why Artsy's various closed source repositories aren't open in their respective readmes.

Reasoning

Artsy Engineering defines Open Source by Default as its first Engineering Guiding Principle. Consequently, closed source projects are – in a sense – an exception. As engineers, we generally try to document exceptions and edge cases, and so documenting why a certain repo is closed would be congruent with our documentation practices.

This would introduce a slight bit of friction when creating new repositories, forcing engineers to consider why or why not a repo should be closed. This also lets Artsy Engineering revisit the decision to close a project if and when circumstances change.

Some examples of rationales:

Gravity is closed because the data entity modelling exposes details of Artsy's business that make us unique and valuable.
Osmosis is closed because we'd be discussing details of how strictly to interpret our various privacy/GDPR obligations and when/how to implement specific things, which feels unsafe to do in public.

Exceptions:

I can't think of any exceptions to this. The rationales would, definitionally, be secret, so there's no real risk. We could always whitelist individual repos from this requirement as we come across them.

Additional Context:

We can probably include this in the repo's meta section and enforce it with Peril. See this issue for more context around meta sections.

How is this RFC resolved?

A PR to peril-settings adding the check, run periodically.

20% Time Needs a Clear Definition

Proposal:

Define what "20% time" at Artsy is.

Reasoning

20% time is important in determining project estimation, in giving engineers time for career growth, in keeping our systems happy, and more. But it lacks a clear definition – you can ask five PDDE staff what 20% is and get five different answers. Its relationship to Future Fridays is also not well-understood.

This leads to some teams, like Sell, having a strong culture of 20% time. However, other teams frequently mention a lack of using 20% in their sprint retrospectives. This leads to unnecessary feelings of ambiguity and uncertainty.

PDDE Leadership should define 20% time so we can reference it, apply it consistently, and iterate on the practice. I don't want to see comments attempting to define 20% time on this RFC; I'd prefer to follow-up after this RFC is accepted and define it somewhere authoritative.

Some key questions:

Is it 20% of time per person or per team?
Is 20% time for learning new skills? (online courses, etc)
Is 20% for product work only? For bug fixes only?
Should 20% time work be ticketed in Jira. Should all 20% time work be ticketed?

Feel free to add more questions in the comments – if the RFC is accepted, PDDE Leadership can address them.

Exceptions:

None.

Additional Context:

I brought this up in our Q3 retrospective.

[RFC] Classify systems by criticality

Proposal:

Define a framework for system "criticality." For each system, evaluate how critical it is and where it fits in this framework. For each framework level and its corresponding systems, decide things like:

suitable level of support
expectations for development process, testing, deployment, etc.
service level objectives

Reasoning

This is based on work @dleve123 started on a service health framework (PLATFORM-1098) as well as discussions with many of you. I also think it aligns nicely with the professionalization of our on-call process. In particular, we would like to identify gaps in the health and resilience of our platform as well as prioritize efforts to address them. However our platform comprises dozens of systems with very different needs, so we should recognize those differences with a shared vocabulary, and try to set expectations accordingly.

This implies several sets of decisions, all of which are up for discussion in this RFC and/or later:

define the framework levels
decide which systems belong at each level
agree to expectations and objectives that correspond with each level

But I'll try to provide a sensible starting point below:

Level 3 ("critical"):

Systems that are essential to basic business operations such as registration, authentication, browsing, inquiring, bidding, and buying. These systems understandably experience a relatively high throughput, and any disruptions can have sizable financial and brand impact.

Examples of systems: Gravity, Force, Volt, Causality, Positron, Exchange

Expectations include all of those from lower levels plus:

automated test coverage >=90%
one-click development environment set-up
dedicated monitoring dashboard
monitors and alerts of business metric anomalies
availability >= 99.99%
average latency <=500ms
incident response time <=5min
daily database backups
known vulnerabilities (e.g., from Github's reports) patched within 7 days

Level 2 ("important?"):

Systems with limited throughput or public-facing functions. Disruptions may interfere with certain business operations or have mild financial or brand impact.

Examples of systems: Induction, Kaws, Diffusion, Prediction, Pulse, Vibrations

external availability monitoring
error reporting
application instrumentation (datadog or similar)
threshold-based monitors of latency and error rate
zero-downtime deployment
incident response time <=30 min during waking hours
availability > 99.9%
weekly database backups
known vulnerabilities (e.g., from Github's reports) patched within 30 days

Level 1 ("move fast and...?"):

Internal utilities or systems with only occasional usage. Experimentation should be cheap and easy, and some tools serve only a few specific individuals or roles.

Examples of systems: Candela, Phonon, Waves, Apogee, Doppler, Torque, Helix, APRB, Vanity

fixes are prioritized with teams' backlogs rather than #incidents
up-to-date set-up instructions in README
continuous integration

How this framework can be used:

By associating each system with a criticality level, we can identify obvious gaps where systems don't [yet] satisfy the corresponding expectations.
We can employ this vocabulary when systems need upgrading or downgrading within the framework. E.g., when Prediction approaches daily usage for live sales and wants to provide stronger support, visibility, and guarantees, level 3's expectations provide a clear to-do list. Conversely, as buy/offer sales begin to dominate purchase requests, we can consider downgrading Lewitt's criticality.
These expectations can inform what automation and rigor and testing is "good enough" for the many systems we deal with. Systems or teams can strive to satisfy higher levels' expectations, but these baselines help us prioritize what's necessary vs. nice-to-have.
As we formalize more processes, this framework can scope our effort. E.g., if we decide we need a more rigorous code review process or security audits, we can decide what levels any new expectations apply to.

Exceptions:

I'm not sure how to integrate iOS apps into this, since they have special CI and instrumentation challenges. Some like Eigen are obviously business-critical so we should strive for Level 3-like support.
Some auction systems tend to have occasional service interruptions but are extra-critical around auction events, like Causality and Prediction. Again, I think we should strive for Level 3-like support where possible. Less critical systems like Refraction can be treated accordingly.

Additional Context:

Service health framework: https://artsyproduct.atlassian.net/browse/PLATFORM-1098

Service level indicators, objectives, and agreements: https://landing.google.com/sre/sre-book/chapters/service-level-objectives/

Maturity model: https://martinfowler.com/bliki/MaturityModel.html

Create a playbook for where to put docs

[RFC] Merging Hokusai's commit deployment and application configuration strategies

Proposal:

Automate the rollout of configuration changes to Kubernetes yaml specifications along with code changes. Adopt ideas from or integrate Helm into Hokusai to create a deployment manifest for each deployment, interpolating the deployment tag (Git commit SHA1) and timestamp, and save each manifest in Kubernetes as an audit log, which would facilitate rolling code and configuration forward and backwards in concert.

Reasoning

Hokusai was originally designed to decouple rolling out code changes (i.e. hokusai [staging|production] deploy {COMMIT_SHA1}) from configuration changes pertaining to that application's environment (i.e. hokusai [staging|production] update). Initially the overhead of Helm seemed unnecessary given our configuration changes were very infrequent as compared to code changes, and Helm is designed for semantic versioned releases. However, this has caused us problems.

We have seen cases where this decoupling is a source of confusion, as the update command runs the command kubectl apply -f {path-to-local-hokusai/staging-or-production.yml} against the local checkout of the repo, so in essence, the developer's local git state becomes the source of truth.

Also maintaining the staging and production image tags as pointers to the currently-deployed Git-SHA1 tag has forced us to bust the docker cache and always check the latest image for these tags, which has had downstream effects resulting in downtime due to ECR unavailability.

This proposal is to adopt an org-wide deployment strategy that is fully automated for code as well as configuration changes. However, it will be a breaking change and require migration for projects as well the version of Hokusai running against those projects. The release 0.5.1 introduces a compatibility check to facilitate this migration.

Exceptions:

The hokusai pipeline promote deployment strategy will have to be abandoned, as it simply gets the deployed image on staging and promotes it to production, without regard to any configuration changes.

This change would propose to interpolate variables when creating a new deployment, rather than setting up a new project, so https://github.com/artsy/artsy-hokusai-templates would also change in structure (i.e. the templates would not be rendered when creating a project, but there will be template variables in yaml configuration)

The staging and production tags would become redundant, and this might very well be a good thing, as we have seen the staging tag conflicts with the staging branch in Positron's repo and we currently don't push Git tags because of it. cc @dleve123 @eessex

Additional Context:

@alloy has brought up issues of revision history and rolling back, which needs to be done across multiple deployments (i.e. Puma / Sidekiq for a given application)
@ansor4 and @joeyAghion @dleve123 and myself recently encountered a strange situation where we released a config change for a Gravity sidekiq bidding worker, before the code change that referenced Sidekiq config checked into the project, and built into the image.

How is this RFC resolved?

Consensus around our deployment strategy across repos
Implementation of a new deployment strategy in Hokusai
Migration of existing projects to the new deployment strategy
Developers upgrading Hokusai, possibly being required to switch between versions of Hokusai between projects during the transition, or rely on CI for deployments altogether.

[RFC] Create a shared space to organize and track PDDE cross-team work

Proposal

Create a shared space to organize and track PDDE cross-team work.

Reasoning

Tasks within the Product, Design, Data, and Engineering (PDDE) organization are often organized and tracked within individual product teams. Product teams within the organization have largely standardized on using Jira Software.

Outside of product teams, there are other structures in which colleagues collaborate on important cross-cutting efforts. So far we have Practices, Working Groups, and Task Forces. These different groups organize and track their tasks in different places.

Examples

The Platform Practice group organizes its tasks within the Platform product team’s Jira Software project
The On-Call Working group organizes its tasks within a kanban board in Notion
The Design Systems group organizes its tasks within the open-source artsy/palette GitHub repository

The reduced visibility of work within these cross-team groups may have an adverse on planning within individual product teams. It can be difficult to anticipate how work managed outside of product team Jira Software projects may affect a team’s bandwidth in current or future sprints.

A solution might:

Increase the visibility of goals/efforts
Improve consistency to organizing/tracking efforts
Enable improved collaboration between groups
Allow for representation of tasks within product team sprint planning

Exceptions

None yet

Additional Context

I originally chatted about this casually with @iskounen. I've since brought it up in a couple different Practice / Working Group meetings and during the last Front-End Practice meeting.

Open Questions

Is this a problem felt by others?
If so, what are some potential solutions?

How is this RFC advanced?

Validate the problem and update the RFC with a specific proposal

How is this RFC resolved?

Create (or reuse) a Jira project to track non-product initiatives.
Document recommendations for creating, organizing, and assigning tickets in artsy/README.

Participation in this process will be opt-in for Practice, Working Group, and Task Force leaders. Much of the process for the new Jira project will be sourced from the discussions here but there will be another PR with its documentation to provide feedback on as well.

[RFC] Document the process for requesting professional development

Proposal:

Document the engineering team's process for requesting training, conference attendance, and learning materials.

Reasoning

We want to have clarity in what types of professional development activities are reimbursed and how to request them. Clarifying this process will hopefully lead to increased requests to pursue opportunities for learning and growing in the engineering team, and a smoother request/approval process for both engineers and their managers.

How is this RFC resolved?

A playbook document outlining the current process for requesting professional development. This playbook should address questions that engineers and managers have about the process.

Improve software capitalization process

Proposal:

I'd like to see us improve the process of software capitalization so that ICs are double-checking their allocations and start/end dates with documentation are added as we go. I'd also like to see us work on this monthly so that we aren't in a rush at the end of a quarter and blocking Rob and his team from closing things out.

Then, once improved we should document the process so that it's easier for new TLs to get up to speed and is a good reminder for existing ones.

Reasoning

Waiting until the end of the quarter to get Accounting what they need for closing the quarter puts added stress on things - moving to do this continuously where possible and monthly otherwise should spread things out in a way that makes everyone feel super cool.

Exceptions:

Can't think of any!

Additional Context:

The current process is documented here:

https://www.notion.so/artsy/Software-Capitalization-22c90e835422423399323b7835f353a3

We'd update that process section, but there's some good context at the top of that doc that's helpful if this is new to you!

How is this RFC resolved?

Here are the concrete things I'd like to see happen:

The Software Cap Time by Project Spreadsheet

Rob creates this sheet at the beginning of the quarter and shares with all of PDDE so that anybody can edit it.

Monthly Software Capitalization Working Meeting

A recurring monthly meeting for Rob plus those responsible for filling out the sheet. This is a working meeting to fill out details.
TLs use Jira (or GitHub or whatever) to take a stab at the percentages of time.
TLs nudge ICs to review the percentages for a sanity check.

Regular Team-Specific PDDE meetings

Each team already has a weekly or bi-weekly meeting for the PM, TL, designer and data person to chat. We look at the project board in Notion - let's just take a second to make sure all start/end dates and evidence links are there or assign them if not.

Documentation

write up some docs based on this and find a home in README for them

Fill out doc on what is expected for mentors

Shell is here: https://github.com/artsy/README/blob/master/onboarding/mentors.md

Should/could include things like:

What makes a good mentor?
What is the expectation of a mentor?
What should mentors remember to do?
What role does a mentor play in the first few weeks of someone starting?

Add a doc on our "road to production"

cc @xtina-starr for the idea! (feel free to edit this description, haha)

We should add a doc on how code goes from ones dev environment --> production. It could describe the process for a web app vs. iOS, or go in-depth into the deployment process for a system like gravity.

[RFC] New dependencies to Emission/Reaction go through the RFC process

Proposal:

Fail PRs which add new keys to the dependencies section on the package.json in Reaction / Emission.

The fail would then be removed when an RFC is referenced in the body of the PR. Perhaps it canlook for a string like RFC: [ur] inside the body. Also support a message like #skip_rfc tag to skip the fail.

Reasoning

We'd like to find a better space to discuss additions to our two larger component libraries, it can be a shame to write a PR only to have it blocked on the discussion around adding a dependency

This is mainly to:

Keep our dependency tree somewhat under control
Ensure that we re-use dependencies, or avoid them if we have existing functionality you didn't know about
Give everyone the chance to see and react to dependency updates

Some recent examples:

Formik in Emission (and yeah, I was a bit curt there, sorry @yuki24 (I think I was between meetings))
React Spring

Slack:

What an example RFC could look like

iOS Navigation Options

Exceptions:

Some PRs are pretty trivial, and so it should be possible to skip the discussion.

Additional Context:

You can see our discussion in notion (it's a bit light on notes because I was the one taking notes and talking at the same time )

Create a doc which states what shouldn't be Open

If OSS by Default means starting with open, what are the right arguments for saying something should be closed/private.

Re: https://github.com/artsy/potential/issues/166#issuecomment-412677219

Make a technology roadmap that includes front-end needs

Right now, this lives in platform: practices/platform/technology-roadmap.md - but ideally it is scoped to all teams.

[RFC] Provide explicit recommendations when PDDE should take time off

Proposal:

Provide explicit recommendations in the PDDE playbook for when employees should take time off. Situations like high stress or after hours work.

A time off recommendation may be something like:

After dealing with an overnight on-call incident, take the next possible day off.
After dealing with a high-stress on-call rotation, add two days to your weekend.
When dealing with a sustained push to meet a product deadline (i.e. local discovery) take a week to recoup

These are just examples.

Reasoning

At Artsy we believe people are paramount. We generally have a healthy appreciation for work life balance. Leadership often encourages people to take a little time to recharge after they've put in a lot of effort. There's almost a mantra to managers for them to tell their direct reports to take time off. It's a great that that mindset exists, but I propose we extend and formalize that a bit.

Sometimes when issues arise that require a lot of investment in time and energy it becomes really hard to disconnect again. Providing a clear, explicit policy gives us a way to socialize and reinforce the idea that taking time to recharge is an important step in continuing to deliver quality worthy of art. It gives managers tools to reach for to provide concrete recommendations. It provides transparency to the org so that we can take rest time into account for when we're planning product deliverables.

How is this RFC resolved?

The team agrees, we come up with a set of recommendations, leadership and people ops buy in, and we add them to the PDDE time off recommendations.

[RFC] Relaunch the Platform practice

Proposal:

Restructure the Platform practice to embrace getting platform-aligned work done throughout product teams. To achieve this, we should reset the practice's membership and cancel the inconsistent stand-up forum. Instead, self-selected engineers should come together more deliberately in a weekly hour-long session to review (1) active projects, (2) trends around performance, incidents, risks, and (3) an in-depth topic based on an evolving, openly-curated agenda. This is hopefully where opportunities, projects, or decisions can surface.

This would be a significant investment of time for its members, so we should re-evaluate the format and interval after a few iterations. By agreeing on each meeting's in-depth topic in advance, we could also support engineers opting-in to some sessions without committing to all of them.

For reasons like that, we should encourage (but not enforce) that each product team be represented within the practice.

Finally, for clarity we've been talking about renaming the Platform team to Infrastructure. It's less ambiguous with the practice that way. It also more directly points to the specific KPIs for which the team can be accountable, as opposed to long-term platform-wide concerns that benefit from more collective ownership.

Reasoning

The practice hasn't settled on a productive form or forum since the reorganization of product teams. The current practice stand-up is attended inconsistently and doesn't allow us to discuss and resolve anything in-depth. Instead, these discussions happen in ad hoc design sessions, [too late] on pull requests, or not at all.

Important projects like BNMO and local discovery provide clear opportunities to build new capabilities for the entire platform, but in practice we've preferred to do that work in close collaboration with (or in) focused product teams. More minor projects also benefit from being executed in ways that leverage platform resources and advance the whole product.

Finally, we have a new, bigger team. The "practice" should be open to everyone rather than an artifact of the original Platform team.

Exceptions/Risks:

A major challenge will be ensuring platform-wide priorities are adequately recognized and integrated with product work. At the very least, notes, resolutions, outcomes, or documents should be reported out to the rest of engineering. When significant projects need investment, these will have to be either prioritized by the Platform team itself (Infrastructure!), lobbied for in product teams by practice members or technical leads, or escalated to quarterly priorities such that they're staffed explicitly.

Additional Context:

It's helpful to imagine examples of in-depth topics we might take on as a practice:

Modeling of offers
Stack preferences for admin/management utilities
Monitoring gaps
Collecting, assessing, and prioritizing opportunities to deprecate systems or features (Torque, delayed_job, consul, Twitter auth, invoicing, inquiry questionnaires...)
Performance trends
Results of stitching experimentation
Staging data persistence and QA-ability
Iterating on a whole-platform architecture diagram

Add a doc for the peer lab

Could be the setup/setdown todo list, a touch of history etc