artsy / readme Goto Github PK
View Code? Open in Web Editor NEW:wave: - The documentation for being an Artsy Engineer
License: Creative Commons Attribution 4.0 International
:wave: - The documentation for being an Artsy Engineer
License: Creative Commons Attribution 4.0 International
Example: Technical thread - Non-technical thread
What do you think @sweir27 ?
Drop the Team Updates agenda item in Engineering Open Standup.
I successfully got Reduce Duplication at Engineering Open Standup merged last month. Part of the reasoning behind that proposal was that Engineers would have just been in the Sprint Kickoff Meeting prior to Open Standup, so the Team Updates were pretty redundant from that fact alone. However, since that RFC was merged, the Sprint Kickoff Meeting morphed into Product Team Office Hours.
This sorta leaves us in a state where it's worth reviewing our choice here to make sure we're going in the right direction. I was chatting with @ashfurrow about this and his thought was that we should either drop these updates completely or bring them back every week. I opted for the former for this RFC.
I don't get much value out of the team updates at Open Standup, but would love to know how others feel. I've been happy with getting this type of info from skimming the Sprint Overview email that goes out. On that last RFC there was def some interest in dropping these updates completely or changing the focus so those ideas are certainly in scope as well.
One additional advantage of dropping these updates is that the meeting might have a slightly less Product vibe and more of a meeting where Engineering can just be Engineering. 🤓
There are times when teams have to share info with the rest of the group and Open Standup is a great avenue for this, so we should ensure there's still a good way for this to happen! If the info to share is covered by Cross-dependencies / Requests for Pairing or New Milestones / Repos / Blog posts sections, then great. Otherwise, the Closing Announcements section could be a good catch all.
The Reduce Duplication at Engineering Open Standup RFC from last month.
Our current RFC process is a fantastic mechanism for raising important conversations around large or controversial changes. I believe there are still optimizations we implement in the process to make it even better.
I propose that we
Edit: I've updated the proposal to remove the second part. It seems like preliminary feedback is against that, and I'm not so invested in that part of this RFC to push for it. I would still like to tackle the problem of resolving RFCs being a little murky. Open to ideas on what that might look like.
(props to @orta for the suggestion on this one)
This just ensures we're doing our best to resolve RFCs in a timely manner.
To describe a sponsor's role I'll borrow a part of coming to consensus from Mark Shepard:
A chosen facilitator can help consensus by keeping the discussion on track, encouraging good process, and posing alternatives that may resolve differences. But a facilitator is a servant, not a director, and assumes a neutral role. If a facilitator wishes to take a stand on an issue, the task of facilitating is handed to someone else.
So a sponsors role will be two fold:
1. Help guide the conversation and resolve differences
2. Take the responsibility for deciding the resolution state
The sponsor should be neutral in the conversation, open to any resolution, and willing to help facilitate. If an RFC loses its sponsor it's considered stalled until a new sponsor is found.
Metaphysics, Exchange, Convection and Gravity (for GravQL) and (any others I can't think of all, impulse/pulse?) have a _schema.graphql
file in the root of their respective repos. We add a pre-commit hook that ensures the schema is always up-to-date.
Mainly to move PR review up a level of abstraction to easily discuss schema design, secondary to improve tooling.
_schema.graphql
is a weird name yes, but I want it to be at the top of the PR every time.Nothing I can think of? Probably worth adding a CI check though incase people decide to skip the
A hat-tip to @cjoudrey who brought this idea up and discussed its merits within GitHub's API at the NYC GraphQL meetup.
More below, but this proposes a new process to keep track of incidents and communicate their status via tools including Jira Ops and Status Page.
As our on-call process has evolved, we’ve identified a few areas of improvement:
To make it easy for anyone at Artsy to raise incidents, keep track of which incidents are occurring/have occurred, and their potential impact and follow-up.
There are many ways to improve our fledgling on-call support process, and these changes do not attempt to solve all of them.
Our ultimate goal is to create a process that is transparent for both incident responders and incident reporters. Accomplishing this will at least involve changes to our process on engineering, changes to how stakeholders interact with our team, and potential additional tools and budget.
The changes proposed here focus only on the first part: how we can improve the Engineering team’s process.
Intentionally not included here are:
Some of these, especially the first two items in that list, are top-of-mind. They're not in this proposal because we felt it would be too much to introduce all at once and want to give this process a try first.
Join the #on-call-working-group slack channel to see discussions around this and other topics related to how we want to handle on-call.
See https://github.com/artsy/potential/pull/134 for the initial explanation behind our current process (which has since evolved!).
If we decide to move forward with this RFC, the next steps will be to:
If this is added to the master support playbook, we'd also like to include:
Our “open-source by default” approach has helped create a world class Engineering brand, but is being tested with open vs. closed onboarding and public vs. private documentation. There’s a clear need for writing down the definitions around this area beyond the short introduction of who we are.
Also #2
Share links to different teams retro boards publicly across PDDE.
Retros are great places to discuss our internal team processes and possible places for improvements. Whats possibly missing in our current approach is sharing the results of these retros and finding possible common pain points and improvement areas.
For example i've been in retros from different teams where something like missing integration tests has come up few times. Currently these stay within the team but if we share these boards across PDDE, these common patterns will emerge and we can act upon them easier.
In the past places, each team's retro boards were actual post its hanging on the wall for the rest of the week so you could just pass by the boards and see whats going on in other teams and possibly detect some common patterns and do a wider team effort to fix them.
None?
The only issue I can think of is, knowing these retro boards are currently within only each team, I can imagine some people may not feel comfortable about opening out to public in PDDE but I think overall we'd benefit more from having them as public.
Finding a proper place to share links to retro boards
Specify responsibilities for PR reviewers/assignees.
I propose that the following guidelines apply to all PRs created in the Artsy org:
People assigned to review a PR should ideally review before merge. It is not required that all reviewers complete reviews, but it is recommended. Authors should also do their best to limit the number of reviewers in order to keep PRs moving quickly.
Developers who should be notified of the PR but whose review is not required before merge should be cc'd in the PR body instead of assigned to review.
Reviewers should make a good-faith effort to review within one business day of being assigned, unless the PR's author specifies otherwise in the body of the PR. If a reviewer requires more time for any reason, they should communicate with the author of the PR so that the author can designate a different reviewer if need be.
PRs should be assigned to a single person. This helps to keep expectations clear and stops PRs from getting stuck ("oh, I thought the other person was the more important assignee").
The person assigned to the PR is only responsible for moving the PR forward (merging or re-assigning the PR), and not necessarily for reviewing. If the author wants the assignee to review, they should also assign them as a reviewer.
It is still the responsibility of the PR's author to get the PR to the point where it is ready to be merged. It is also the author's responsibility to notify the assignee when they believe the PR is ready to be merged, at which point the assignee should make a good-faith effort to merge or request changes within one business day.
Once an assignee has finished reviewing, they can:
These criteria should be fulfilled before a PR is merged:
PRs should not initially self-assigned (though they may be assigned back to the author by the original assignee). This ensures that there is a final check from a fresh pair of eyes on before the PR is merged.
Different engineers have different practices when it comes to PRs. Some ask 3 - 4 people to review, some only 1 person. Some self-assign their PRs, while others assign them to teammates.
This can create a lot of confusion, especially when engineers transition between teams and have to adjust to the styles of their new collaborators. It also slows the pace of progress, as it maybe be unclear who is responsible for making sure a PR is merged as fast as safely possible.
Ideally, this would be applied across teams and individual engineers so that everyone has the same expectations for what happens when they create a PR and designate reviewers and assignees. This would require close to 100% agreement on what these roles are and mean, so I'll do my best to make sure that everyone has a chance to voice their opinions or concerns.
The timelines may also change depending on the urgency of the PR. Ideally, authors will note things like "this is a high-priority PR and would ideally be merged within the next few hours" or "this is a low-priority PR and as such could wait until so-and-so gets back from vacation to review, since they're the most familiar with this system." Could be a good case for a peril rule + auto-tagging, actually.
If we're able to agree on guidelines, those guidelines would become part of Artsy's README and would be surfaced to new hires. They would also be raised in Engineering standups and in the #dev
channel to ensure that the whole team is aware and onboard.
cc @xtina-starr for the idea!
One of the items in our onboarding talks about taking some time to tour around Artsy.net in a staging environment, doing things like bidding in an auction and making an inquiry.
It would be cool if we had a doc that outlined some of these and provided some high-level instructions on how to actually do them. 😄
Remove any single line breaks in Markdown files.
It seems like many have an 80ish character line break, which seems completely unnecessary as all editors do just fine with wrapping text.
The extra CR/LFs are driving me bananas.
This could be automated via Danger.
Add a license to the Artsy Engineering blog: https://github.com/artsy/artsy.github.io
This applies to over 200 existing blog posts, so I thought it was important enough to warrant an RFC.
I was surprised to learn that our blog repo currently lacks a license – maybe because it's one of our oldest OSS repos and we didn't know better? In any case, we should license it. Our standard license is MIT, but that license is generally used for code, and not necessarily for content. For this reason, I suggest we use the CC BY 4.0 for the blog content itself. I've been doing this on my blog to make sure people know that they can use what they read on the blog. (The Attribution clause would require them to, if they modify the blog's contents and re-release them, to attribute the original post to us.)
We could, alternatively, omit the CC BY 4.0 license if we're feeling averse to getting into the weeds of licensing, but I think it's worth it.
None that I can think of.
This came up while developing the new Artsy Engineering blog. Our decision in this RFC would apply to that new repo, too.
Adding a LICENSE
file with the following contents:
MIT License
Copyright (c) 2019 Artsy Engineering
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
Make the following changes to README.md
:
+
+## License
+
+The code in this repository is released under the MIT license. The contents of the blog itself (ie: the contents of
+the `_posts` directory) are released under
+[Creative Commons Attribution 4.0 International License](https://creativecommons.org/licenses/by/4.0/).
When creating or updating new apps, code changes often do not take into account SEO or accessibility. These often become work items much later on when surfaced as bugs.
A few examples:
alt
text for imagesBecause many of these items are considered best-practices, I believe it would be a good idea to start assembling a list of acceptance criteria for new and redesigned pages in relationship to basic SEO and accessibility standards.
A simple baseline might be a minimum lighthouse score, and running this check will usually give a list of straightforward improvements. I'd imagine the SEO team would have ideas on further points to double check before launching new products.
For someone interested in stretching in different ways, would be nice to support that with resources/concrete ways of improving.
Growth could be in terms of technology (and we may have docs on this already in README).
Growth could also be in terms of skills that aren't strictly technical. We should have a high-level list of these + links to specific resources for each. Some examples are:
... there are probably more!
E.g. add these links into the slack doc
We add a new recurring meeting to the dev open calendar, "Front-End Engineering Office Hours"
A place where you can come in and ask for input/pairing on everything that powers front-end product work; from usage or setup of tooling, to help debugging that pesky situation, and/or working on features/bugs in libraries in our Omakase stack or beyond.
The aim being to provide a space where someone can dial-in or come to talk to @alloy and myself about all of the above.
We want a space so that people can have structured time to fix up their setup, or ask questions about how/why things work and the office hours format feels like a good mix.
We think we can get the 27 classroom right after Monday dev standup, so it can be a matter of staying in the room / zoom a little bit longer for people.
Nothing so far.
Take the Artsy Omakase (vid vid2):
and give it a name.
We always have to explain the word omakase. It totally fits, yeah, and it references ruby on rails In a way I particularly enjoy. Yet, it's better that we can give the stack a real name rather than a referential one.
Studio reflects the idea that it's a setting / place for getting things done. Which kinda fits how we settled on this stack. We didn't create any of these major projects, but our studio is where we got them all to work really well with each other to help keep it focused.
I think once named from “the artsy omakase” we can do:
ohm
/ Ω
as a reference to omakase, but we have a private repo call ohm, and that's gonna be confusingAdd:
Following the PDDE Q1 retro, while teams can operate and execute sprints differently, it'd be useful the organization aligns on some terminologies and shares best practices. The Galleries team had a discussion about what a technical spike is, and I feel it's a good example to get alignment.
“A spike is an investment to make the story estimable or schedule-able." (1)
When we are unable to estimate a use story from a technical perspective, it usually indicates the story is too big or uncertain. If it's too big, we should break it down to be reasonable and estimable. If it's uncertain because of technical unknowns, a technical spike can be used to reduce uncertainty. The goal of a technical spike should be to unblock the team to be confident making priority and planning decisions.
Some examples when a technical spike can be used:
The team agree on the terminology and can use it to facilitate planning on product teams.
Name: react-intersection-observer
or react-visibility-sensor
URL: https://www.npmjs.com/package/react-intersection-observer and https://www.npmjs.com/package/react-visibility-sensor.
EDIT: After digging into both libraries, I propose we only consider react-intersection-observer
. While both libraries are actively maintained and in use, react-intersection-observer
is written in Typescript (which integrates nicely with Artsy's preferred stack) and includes a component with similar functionality to the VisibilitySensor
in react-visibility-sensor
.
Allows us to track impressions of components when they enter viewport. Could be used in the future for triggering other actions when components enter viewport
Roll our own implementation with intersection observers.
Trial Atlassian OpsGenie for Engineering On-Call rotation scheduling.
By adopting an automated system for on-call rotation scheduling, we will save our team time and unlock future process improvements.
The current process (documented in artsy/README) depends upon an organizer to create a sign-up sheet, manage participation, and translate the sign-up sheet to Google Calendar events. This is a time-consuming process which doesn't result in a resource that can easily integrate with other products or processes.
OpsGenie provides a scheduling system that focuses specifically on common requirements for engineering organizations. This system would allow us to group engineering teammates into two rotations, resulting in two engineers on-call simultaneously, aligning with our current process.
We will no longer have an open signup process. Teammates will be able to swap rotations with others as necessary via overrides (docs).
Many of our documented recommendations would remain relevant in the new system. For example, we have a policy of leaving teammates out of an upcoming round if they doubled up in the previous round. In OpsGenie, an organizer could look at the previous round's roster and overrides to determine which participants had doubled up.
OpsGenie has support for alerting on-call colleagues in response to incidents and/or alerts. This functionality addresses a concern discussed during our last On-Call retrospective meeting and we're planning for a following iteration which takes advantage of this feature.
The trial will be scoped to the next round of on-call scheduling, targeting rotations starting May 6th, 2019. We'll continue to use the current process for our active on-call schedule.
We evaluated four different services within the On-Call Working Group. Notes on the comparison can be found here.
Atlassian JiraOps (beta) will soon be folded into the OpsGenie product, giving us another reason to trial OpsGenie over other solutions.
Our last On-Call Retrospective took place on Februrary 6th, 2019 (notes).
Following-up from my email on finding ways to reduce waiting time in-between tasks, I wanted to get a bit more concrete with all your input.
To give an example of what I could imagine; recently we visited the office of a large tech company, and they showed us dashboards that gave the engineers insights into all sorts of metrics on where and for how long engineers were spending time, which means that they can understand how to affect that themselves. For instance, they were recording how much time people had to wait for their tests to run, which made it very clear that spending a little bit of time optimising things led to an immediate huge impact on time saved per engineer, and working more efficiently made everybody happier–nobody likes sitting around idly.
I want you all to pitch in with your pet peeves of where you feel you are being blocked from getting the things done that you want to get done and then we can jointly discuss which of these things to tackle and how.
Somewhere in our onboarding docs, we should link out to where one can find a high-level view of what each product team is responsible for.
This may or may not already exist for those teams (and will likely be in Notion?). Possible topics include:
We've all agreed that any time we PR any changes to MP schemas we have already merged and deployed changes to staging in upstream applications. I think we should consider including some ruby code in every MP PR that generates the stub data required to demonstrate all cases of the new MP code.
At its most basic it should just be some code that we can paste into the console, something like:
user = User.where(email:'[email protected]').first
show = Show.create!(whatever)
show2 = Show.create!(whatever2)
follow = FollowShow.create(user: user, show: show)
user.id
That's a simple example - it creates everything necessary to make some assertions, and it ends by logging out the ID you'd look up in your root query in metaphysics. The PR would then include various GraphQL queries against that exact data as well as their results, and anyone can independently verify it.
It can be a nontrivial enterprise to put Gravity into a state that exposes the exact nuances that some new MP feature requires. Logging into the console and tweaking stuff until it works is fine, if you're willing to have your changes blown away over the weekend. This also requires any PR reviewers to figure out how to generate gravity state that'll demonstrate the PR's correctness.
If we're writing unit tests all of this just gets fabricated - we understand how important it is to actually test these things in a predictable, reliable way. This is specifically for those PRs that include features that cross service boundaries.
Perhaps we can have some template for the ruby code that wraps it in a guard clause - regenerate this or return it if it already exists.
Every metaphysics PR includes ruby code to (re)generate its upstream data.
I thought about potentially tracking these blocks of code in git, maybe adding a data_sets
top-level folder, but that won't do well over time since we don't track full migrations.
I think I'd be happy with a comment in the PR containing a block of code?
Take the gap in dev standup from #56 and fill it with updates from the two major practices instead.
We recently removed team updates - but maybe practice updates are a good in-between? The platform updates were always the most globally useful and with #86, maybe we're in a better spot for that too. Maybe they're high level enough, and applicable overall that we can give it a shot.
None
This came out of discussion from the future of platform.
Insight: assigning someone as a mentor to a new hire that has been at Artsy for a while is good because they know lots of things, but it's bad because it's probably been a long time since they set certain things up. Another new hire is in a much better position to assist with these types of things because they just solved these problems!
For example, @pepopowitz just started and @starsirius asked if I'd like to be a mentor for him and I was happy to. I was able to help with a couple questions he had. Later he was asking about creating AWS and Jira accounts - something I haven't really thought much about.
@javamonn (for example) would be better suited to help getting someone setup on Jira since he's just done it and I actually don't even remember how I did it. Similarly, @ashleyjelks has just setup her AWS creds and ran into gotchas - she's in a much better position to help him out.
As a bonus, it feels good when you're new to a company to be asked to help with things. Makes one feel like they know stuff and are being helpful!
So, I wonder if there's a way to connect new hires so that they can learn from each other. It could be as informal as a slack group or formalized by assigning buddies in some way. Open to ideas here!
This should likely be co-located with the other support docs, but we should make sure to link to it from the onboarding checklist.
This doc may include (brainstorming):
An example might be easier to think about:
Let's say you have made changes to Exchange (which Metaphysics depends on) adding a new field to a type in it's API. You will need to make a PR to metaphysics to merge those changes into the metaphysics global schema. In making this PR, if the changes in Exchange haven't been deployed to production haven't been shipped to production then your PR to metaphysics will fail.
If you want to make a change further down this list, the things above you need to be deployed in prod when you are making GraphQL schema changes.
This is currently happening on a force deploy with artsy/force#3061 - but we'd like to move that behavior to run on metaphysics for its dependencies ( and eventually in Emission/Reaction too)
This came up in the platform practice that the ability to be sure that all your dependencies are set up is valuable enough to introduce some friction to the process.
Doing this:
You can see our discussion in slack here
e.g. Heroku Bartender, the old gravity etc
https://github.com/artsy/README/blob/master/culture/highlights.md
TL;DR it would take a lot of developer time to truly own Peril. While Peril brings a lot to our team culture, it’s not critical to the success of Artsy’s business—and we can use third-party services to accomplish the same cultural automation that Peril has enabled.
Discussion of this issue in Platform Practice
Research document completed by Justin + Matt
For each commit in a repo’s mainline branch (a.k.a master
branch) to theoretically be deployable; as in, the test suite/linter/etc are all green.
And for related changes to be a single commit or next to each-other in the commit history.
Some of us work more often with a repo’s commit history, either to debug a software issue or otherwise gain information about why certain changes were made. This is made difficult when something that could semantically be considered a single change is spread out over multiple commits.
Squashing before merging means that all the related changes exist in a single commit and thus in context of each other and the commit message.
Multiple commits is problematic when trying to find a commit that introduces a change you’re looking for through tools such as git bisect
, as there’s the opportunity to come across commits that are known to be broken (which are typically followed up in the PR with a ‘fix‘ commit) and only interrupt the workflow.
A git merge may result in commits being interleaved with changes from other PRs, which is problematic when scanning the history for changes you may be interested in and having to keep a mental model of all the commits you came across. Especially those that have a commit message such as ‘fix’ only add noise and make this harder than it needs to be.
In some cases, a PR may contain a larger set of changes that may not all be directly related to each other. In these cases the author should take care to squash the directly related changes into single commits and then request the assignee to ‘Rebase and merge’ instead, which will ensure they will exist next to each-other in the destination branch but otherwise leaves them in tact.
Some reading material related to topics mentioned:
Offer some form of training on git merge vs rebase basics. (Short L&L?)
Disable the ’Merge button‘ on repos.
By default, ‘Squash and merge’ PRs, unless otherwise requested by adding the ‘Rebase and merge’ label.
Additionally an assignee may be able to glance from the commit history that the few commits are each their own distinct change and can decide to ‘Rebase and merge’ without the author explicitly asking for it.
Peril’s ‘Merge on green’ feature uses the label to choose either ‘Squash and merge’ or ‘Rebase and merge’.
Provide definition and guidance around the point people for a project.
We want projects to be owned by teams overall, but lots of temporary ownership over projects makes it hard to address technical debt, or to ensure a long-term vision on a project. One solution to this is to define point people, who are responsible for the long-term technical direction of a project, especially as a project can change ownership between teams during its lifetime.
The point people for a project should change over time as people come and go, and also grow if the project evolves in being more significant to us.
This RFC is about trying to document a consensus on what point people’s responsibilities are, and how we can improve code ownership at Artsy.
Having a clear definition of what it means to be a point person on a project at Artsy removes ambiguity. Defining what this role means helps those new to Artsy get up to speed more quickly.
We want to be confident that someone is always looking out for the overall vision of our projects so having an organic way to change who these people are helps with this confidence.
Providing a path for new people to own things helps them feel like they are a part of the team.
The impact of our projects varies and this affects the role of point people and the number required to provide confidence that the long term direction of the project will be covered adequately.
Consider these projects:
One way to distinguish between these projects is the level of impact each has on Artsy's business - as a project has more impact it should be staffed by more point people.
The aim of this RFC is to provide:
A definition for those stages
A process for either bumping a project up/down
A process for requesting changes for a point person
We aim to have every active project in Artsy assigned to a team (see the linked project list below) and there’s a reasonable argument to be made that this can be enough for some projects. We’d have to get to that when we get there.
The Artsy Project List in artsy/potential.
We want people to feel personal growth at Artsy and one way for engineers to do that is to leverage their business impact. By providing straightforward definitions and processes we can ensure a fair playing field regardless of anyone’s assertiveness.
Note: this RFC co-authored with @orta 👬
We'd like to create a public facing status page for our gallery partners.
We believe we could use Atlassian's StatusPage for this so hopefully the amount of dev work needed is low.
The most important things to monitor externally are likely the following to start:
Who actually will update this page in the event of an outage/major disruption is up for discussion still. It may make most sense for one of the two engineers on-call in #incidents to update the status page.
Having a public facing status page would greatly improve Gallery Relations' ability to handle outages or service disruptions by allowing partners to 'self-serve'.
Currently, if there is an outage or major service disruption (i.e. Conversations is down), it requires Partner Support to send at least 3 messages per user who writes in:
Generally, there is only one person on-shift at a time. If 25 galleries write in, that's ~75 messages for one person to send on top of the normal queue. Having a way for partners to check if something is resolved themselves could minimize this for Gallery Relations.
An added benefit could be having a history of issues as well.
There may be other external services we do not need to monitor. The Genomer applications come to mind but there may be more.
Our current status.artsy.net page uses pingdom uptime checks which may not give us the flexibility we need externally.
Alexander is happy to do as much of it as possible to minimize the amount of time needed from engineers if possible.
You can see our discussion in slack here
... specifically in the context of the on-call process. When does an #incident turn into a post-mortem-worthy event?
We already have https://github.com/artsy/post_mortems which links to helpful docs but doesn't answer that ☝️question. This could be part of our on-call onboarding docs or something separate.
Start a new biweekly "Incident Review" meeting series.
Agenda:
Invite list:
Meeting should be cancelled if there's nothing on the agenda.
We want to maximize the learnings from past issues and reduce the likelihood of repeating the same mistakes. Incidents are an amazing opportunities to reflect and identify what we can do better.
¯\(ツ)/¯
This was discussed previously in a platform pratice meeting and we agreed to have that discussion here instead before moving forward.
Document the rationale for why Artsy's various closed source repositories aren't open in their respective readmes.
Artsy Engineering defines Open Source by Default as its first Engineering Guiding Principle. Consequently, closed source projects are – in a sense – an exception. As engineers, we generally try to document exceptions and edge cases, and so documenting why a certain repo is closed would be congruent with our documentation practices.
This would introduce a slight bit of friction when creating new repositories, forcing engineers to consider why or why not a repo should be closed. This also lets Artsy Engineering revisit the decision to close a project if and when circumstances change.
Some examples of rationales:
I can't think of any exceptions to this. The rationales would, definitionally, be secret, so there's no real risk. We could always whitelist individual repos from this requirement as we come across them.
We can probably include this in the repo's meta
section and enforce it with Peril. See this issue for more context around meta
sections.
A PR to peril-settings adding the check, run periodically.
Define what "20% time" at Artsy is.
20% time is important in determining project estimation, in giving engineers time for career growth, in keeping our systems happy, and more. But it lacks a clear definition – you can ask five PDDE staff what 20% is and get five different answers. Its relationship to Future Fridays is also not well-understood.
This leads to some teams, like Sell, having a strong culture of 20% time. However, other teams frequently mention a lack of using 20% in their sprint retrospectives. This leads to unnecessary feelings of ambiguity and uncertainty.
PDDE Leadership should define 20% time so we can reference it, apply it consistently, and iterate on the practice. I don't want to see comments attempting to define 20% time on this RFC; I'd prefer to follow-up after this RFC is accepted and define it somewhere authoritative.
Some key questions:
Feel free to add more questions in the comments – if the RFC is accepted, PDDE Leadership can address them.
None.
I brought this up in our Q3 retrospective.
Define a framework for system "criticality." For each system, evaluate how critical it is and where it fits in this framework. For each framework level and its corresponding systems, decide things like:
This is based on work @dleve123 started on a service health framework (PLATFORM-1098) as well as discussions with many of you. I also think it aligns nicely with the professionalization of our on-call process. In particular, we would like to identify gaps in the health and resilience of our platform as well as prioritize efforts to address them. However our platform comprises dozens of systems with very different needs, so we should recognize those differences with a shared vocabulary, and try to set expectations accordingly.
This implies several sets of decisions, all of which are up for discussion in this RFC and/or later:
But I'll try to provide a sensible starting point below:
Systems that are essential to basic business operations such as registration, authentication, browsing, inquiring, bidding, and buying. These systems understandably experience a relatively high throughput, and any disruptions can have sizable financial and brand impact.
Examples of systems: Gravity, Force, Volt, Causality, Positron, Exchange
Expectations include all of those from lower levels plus:
Systems with limited throughput or public-facing functions. Disruptions may interfere with certain business operations or have mild financial or brand impact.
Examples of systems: Induction, Kaws, Diffusion, Prediction, Pulse, Vibrations
Internal utilities or systems with only occasional usage. Experimentation should be cheap and easy, and some tools serve only a few specific individuals or roles.
Examples of systems: Candela, Phonon, Waves, Apogee, Doppler, Torque, Helix, APRB, Vanity
Service health framework: https://artsyproduct.atlassian.net/browse/PLATFORM-1098
Service level indicators, objectives, and agreements: https://landing.google.com/sre/sre-book/chapters/service-level-objectives/
Maturity model: https://martinfowler.com/bliki/MaturityModel.html
Automate the rollout of configuration changes to Kubernetes yaml specifications along with code changes. Adopt ideas from or integrate Helm into Hokusai to create a deployment manifest for each deployment, interpolating the deployment tag (Git commit SHA1) and timestamp, and save each manifest in Kubernetes as an audit log, which would facilitate rolling code and configuration forward and backwards in concert.
Hokusai was originally designed to decouple rolling out code changes (i.e. hokusai [staging|production] deploy {COMMIT_SHA1}
) from configuration changes pertaining to that application's environment (i.e. hokusai [staging|production] update
). Initially the overhead of Helm seemed unnecessary given our configuration changes were very infrequent as compared to code changes, and Helm is designed for semantic versioned releases. However, this has caused us problems.
We have seen cases where this decoupling is a source of confusion, as the update
command runs the command kubectl apply -f {path-to-local-hokusai/staging-or-production.yml}
against the local checkout of the repo, so in essence, the developer's local git state becomes the source of truth.
Also maintaining the staging
and production
image tags as pointers to the currently-deployed Git-SHA1 tag has forced us to bust the docker cache and always check the latest image for these tags, which has had downstream effects resulting in downtime due to ECR unavailability.
This proposal is to adopt an org-wide deployment strategy that is fully automated for code as well as configuration changes. However, it will be a breaking change and require migration for projects as well the version of Hokusai running against those projects. The release 0.5.1 introduces a compatibility check to facilitate this migration.
The hokusai pipeline promote
deployment strategy will have to be abandoned, as it simply gets the deployed image on staging and promotes it to production, without regard to any configuration changes.
This change would propose to interpolate variables when creating a new deployment, rather than setting up a new project, so https://github.com/artsy/artsy-hokusai-templates would also change in structure (i.e. the templates would not be rendered when creating a project, but there will be template variables in yaml configuration)
The staging
and production
tags would become redundant, and this might very well be a good thing, as we have seen the staging
tag conflicts with the staging
branch in Positron's repo and we currently don't push Git tags because of it. cc @dleve123 @eessex
@alloy has brought up issues of revision history and rolling back, which needs to be done across multiple deployments (i.e. Puma / Sidekiq for a given application)
@ansor4 and @joeyAghion @dleve123 and myself recently encountered a strange situation where we released a config change for a Gravity sidekiq bidding worker, before the code change that referenced Sidekiq config checked into the project, and built into the image.
Create a shared space to organize and track PDDE cross-team work.
Tasks within the Product, Design, Data, and Engineering (PDDE) organization are often organized and tracked within individual product teams. Product teams within the organization have largely standardized on using Jira Software.
Outside of product teams, there are other structures in which colleagues collaborate on important cross-cutting efforts. So far we have Practices, Working Groups, and Task Forces. These different groups organize and track their tasks in different places.
Examples
The reduced visibility of work within these cross-team groups may have an adverse on planning within individual product teams. It can be difficult to anticipate how work managed outside of product team Jira Software projects may affect a team’s bandwidth in current or future sprints.
A solution might:
None yet
I originally chatted about this casually with @iskounen. I've since brought it up in a couple different Practice / Working Group meetings and during the last Front-End Practice meeting.
Validate the problem and update the RFC with a specific proposal
Participation in this process will be opt-in for Practice, Working Group, and Task Force leaders. Much of the process for the new Jira project will be sourced from the discussions here but there will be another PR with its documentation to provide feedback on as well.
Document the engineering team's process for requesting training, conference attendance, and learning materials.
We want to have clarity in what types of professional development activities are reimbursed and how to request them. Clarifying this process will hopefully lead to increased requests to pursue opportunities for learning and growing in the engineering team, and a smoother request/approval process for both engineers and their managers.
A playbook document outlining the current process for requesting professional development. This playbook should address questions that engineers and managers have about the process.
I'd like to see us improve the process of software capitalization so that ICs are double-checking their allocations and start/end dates with documentation are added as we go. I'd also like to see us work on this monthly so that we aren't in a rush at the end of a quarter and blocking Rob and his team from closing things out.
Then, once improved we should document the process so that it's easier for new TLs to get up to speed and is a good reminder for existing ones.
Waiting until the end of the quarter to get Accounting what they need for closing the quarter puts added stress on things - moving to do this continuously where possible and monthly otherwise should spread things out in a way that makes everyone feel super cool.
Can't think of any!
The current process is documented here:
https://www.notion.so/artsy/Software-Capitalization-22c90e835422423399323b7835f353a3
We'd update that process section, but there's some good context at the top of that doc that's helpful if this is new to you!
Here are the concrete things I'd like to see happen:
Each team already has a weekly or bi-weekly meeting for the PM, TL, designer and data person to chat. We look at the project board in Notion - let's just take a second to make sure all start/end dates and evidence links are there or assign them if not.
Shell is here: https://github.com/artsy/README/blob/master/onboarding/mentors.md
Should/could include things like:
cc @xtina-starr for the idea! (feel free to edit this description, haha)
We should add a doc on how code goes from ones dev environment --> production. It could describe the process for a web app vs. iOS, or go in-depth into the deployment process for a system like gravity.
Fail PRs which add new keys to the dependencies
section on the package.json
in Reaction / Emission.
The fail would then be removed when an RFC is referenced in the body of the PR. Perhaps it canlook for a string like RFC: [ur]
inside the body. Also support a message like #skip_rfc
tag to skip the fail.
We'd like to find a better space to discuss additions to our two larger component libraries, it can be a shame to write a PR only to have it blocked on the discussion around adding a dependency
This is mainly to:
Some recent examples:
Slack:
Some PRs are pretty trivial, and so it should be possible to skip the discussion.
You can see our discussion in notion (it's a bit light on notes because I was the one taking notes and talking at the same time )
If OSS by Default means starting with open, what are the right arguments for saying something should be closed/private.
Re: https://github.com/artsy/potential/issues/166#issuecomment-412677219
Right now, this lives in platform: practices/platform/technology-roadmap.md
- but ideally it is scoped to all teams.
Provide explicit recommendations in the PDDE playbook for when employees should take time off. Situations like high stress or after hours work.
A time off recommendation may be something like:
These are just examples.
At Artsy we believe people are paramount. We generally have a healthy appreciation for work life balance. Leadership often encourages people to take a little time to recharge after they've put in a lot of effort. There's almost a mantra to managers for them to tell their direct reports to take time off. It's a great that that mindset exists, but I propose we extend and formalize that a bit.
Sometimes when issues arise that require a lot of investment in time and energy it becomes really hard to disconnect again. Providing a clear, explicit policy gives us a way to socialize and reinforce the idea that taking time to recharge is an important step in continuing to deliver quality worthy of art. It gives managers tools to reach for to provide concrete recommendations. It provides transparency to the org so that we can take rest time into account for when we're planning product deliverables.
The team agrees, we come up with a set of recommendations, leadership and people ops buy in, and we add them to the PDDE time off recommendations.
Restructure the Platform practice to embrace getting platform-aligned work done throughout product teams. To achieve this, we should reset the practice's membership and cancel the inconsistent stand-up forum. Instead, self-selected engineers should come together more deliberately in a weekly hour-long session to review (1) active projects, (2) trends around performance, incidents, risks, and (3) an in-depth topic based on an evolving, openly-curated agenda. This is hopefully where opportunities, projects, or decisions can surface.
This would be a significant investment of time for its members, so we should re-evaluate the format and interval after a few iterations. By agreeing on each meeting's in-depth topic in advance, we could also support engineers opting-in to some sessions without committing to all of them.
For reasons like that, we should encourage (but not enforce) that each product team be represented within the practice.
Finally, for clarity we've been talking about renaming the Platform team to Infrastructure. It's less ambiguous with the practice that way. It also more directly points to the specific KPIs for which the team can be accountable, as opposed to long-term platform-wide concerns that benefit from more collective ownership.
The practice hasn't settled on a productive form or forum since the reorganization of product teams. The current practice stand-up is attended inconsistently and doesn't allow us to discuss and resolve anything in-depth. Instead, these discussions happen in ad hoc design sessions, [too late] on pull requests, or not at all.
Important projects like BNMO and local discovery provide clear opportunities to build new capabilities for the entire platform, but in practice we've preferred to do that work in close collaboration with (or in) focused product teams. More minor projects also benefit from being executed in ways that leverage platform resources and advance the whole product.
Finally, we have a new, bigger team. The "practice" should be open to everyone rather than an artifact of the original Platform team.
A major challenge will be ensuring platform-wide priorities are adequately recognized and integrated with product work. At the very least, notes, resolutions, outcomes, or documents should be reported out to the rest of engineering. When significant projects need investment, these will have to be either prioritized by the Platform team itself (Infrastructure!), lobbied for in product teams by practice members or technical leads, or escalated to quarterly priorities such that they're staffed explicitly.
It's helpful to imagine examples of in-depth topics we might take on as a practice:
Could be the setup/setdown todo list, a touch of history etc
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.