Coder Social home page Coder Social logo

dazedbear.github.io's Issues

Migrate infra from Vercel to AWS Amplify

Goal

evaluate the effort and pro/cons of migrating infra from Vercel to AWS Amplify

Logging: Sematext → AWS Cloudwatch Log Insight
Alert: Logalert → AWS Cloudwatch Log Alarm

Note

https://v5.dazedbear.pro/ (serving by AWS Amplify)

Pro

  • deeper integration with AWS ecosystem, more flexibility

Cons

  • DX is worse than Vercel
  • only basic features
  • Github integration is really basic...
  • Not support Node 18 (Amazon Linux 2 Image only support Node 12, 14, 16)
  • build from scratch

Other potential

INCIDENT-2021-12-21-001

Issue Info

  • Start Time: 2021/12/20 13:40:24 UTC
  • End Time: 2021/12/21 18:07:30 UTC
  • Is it detected automatically: yes

Issue Description

  1. Received non 200 alerts with ~20 504 events after failsafe generation job.

截圖 2021-12-21 上午2 08 27

  1. Users saw 504 Gateway-Timeout from Vercel and couldn't access pages after a few minutes after failsafe generation.

Root Cause

It's a regression after yesterday's release for failsafe generation. It fetched ~50 pages at the same time so it reached out the 30 connection limitation of Redis Enterprise Cloud (Basic Plan).

Besides, the redis client connections stay idle for a while after page fetch is finished, so that's the root cause why users saw 504 page after failsafe generation.

https://redis.io/topics/clients#client-timeouts

Sematext

2021-12-20T17:07:44.460Z	41511533-481f-41b5-b87d-5163467661ff	ERROR	�[91m[Mon, 20 Dec 2021 17:07:44 GMT][cacheClient] ReplyError: ERR max number of clients reached�[39m
�[91m    at parseError (/var/task/node_modules/redis-parser/lib/parser.js:179:12)�[39m
�[91m    at parseType (/var/task/node_modules/redis-parser/lib/parser.js:302:14) {�[39m
�[91m  command: 'AUTH',�[39m
�[91m  args: [ 'xxxxxxxx' ],�[39m
�[91m  code: 'ERR'�[39m
�[91m}�[39m
2021-12-20T17:07:44.461Z	41511533-481f-41b5-b87d-5163467661ff	ERROR	�[91m[Mon, 20 Dec 2021 17:07:44 GMT][cacheClient] AbortError: Ready check failed: Redis connection lost and command aborted. It might have been processed.�[39m
�[91m    at RedisClient.flush_and_error (/var/task/node_modules/redis/index.js:298:23)�[39m
�[91m    at RedisClient.connection_gone (/var/task/node_modules/redis/index.js:603:14)�[39m
�[91m    at Socket.<anonymous> (/var/task/node_modules/redis/index.js:231:14)�[39m
�[91m    at Object.onceWrapper (events.js:519:28)�[39m
�[91m    at Socket.emit (events.js:412:35)�[39m
�[91m    at endReadableNT (internal/streams/readable.js:1334:12)�[39m
�[91m    at processTicksAndRejections (internal/process/task_queues.js:82:21) {�[39m
�[91m  code: 'UNCERTAIN_STATE',�[39m
�[91m  command: 'INFO'�[39m

截圖 2021-12-21 上午2 01 57

Github Action

https://github.com/dazedbear/dazedbear.github.io/runs/4585459016?check_suite_focus=true
截圖 2021-12-21 上午2 12 41

Redis

https://redis.com/redis-enterprise-cloud/pricing/

截圖 2021-12-21 上午2 30 42

截圖 2021-12-21 上午2 01 22

截圖 2021-12-21 上午2 14 35

Mitigation

Set fixed failsafe concurrency to 15 (max: 30) to mitigate this issue. #19

failsafe generation success in all pages
https://github.com/dazedbear/dazedbear.github.io/runs/4585959445?check_suite_focus=true

Remediation Items

none

INCIDENT-2022-06-30-001

Issue Info

  • Start Time: 2022/04/04
  • End Time: 2022/06/30
  • Is it detected automatically: No

Issue Description

截圖 2022-06-30 下午12 09 40

The score doesn't render any content and shows a gray block instead.

截圖 2022-06-30 上午11 49 42

Root Cause

The iframe component rendering doesn't work.

截圖 2022-06-30 上午11 54 17

Mitigation

Remove the iframe and re-upload the score file to fix this issue.

Remediation Items

  • Should check the support of iframe component in react-notion-x
  • Need an automation mechanism to detect such UI issues.

Testing Strategy

Testing Strategy

Unit Test

path description tech stack
src/libs/server helper, transformer jest
src/libs/client/hook react hook jest, react hook
src/libs/client/slices redux jest, redux

Component Test

path description tech stack
src/components/__tests__ detailed component test jest, @storybook/testing-react, @testing-library/react
src/stories write story and interaction testing in all stories jest, @storybook/addon-interactions
src/stories/__tests__/storybook.ts automatic storybook snapshot testing for all stories jest, storybook
  • snapshot test is executed by ts-jest
  • interaction test and visual test are executed by Chromatic

E2E: Browser Test

path description tech stack
tests/e2e/pages e2e test for static and notion pages src/pages Playwright
tests/e2e/failsafe e2e visual test for static and notion failsafe generated pages src/pages Playwright
  • check notion page content rendering

  • check meta tags

  • visual testing

  • basic user operation (header, navigation menu)

  • Checkly

E2E: API Test

path description tech stack
tests/e2e/api e2e test for API routes src/pages/api Playwright

Reference

INCIDENT-2021-10-06-001

Issue Info

  • Start Time: 2021/10/06 12:47 GMT+8
  • End Time: 2021/10/06 23:51 GMT+8
  • Is it detected automatically: Yes, by 404 abuse alert

截圖 2021-10-07 上午12 31 36

Issue Description

Users saw 404 and cannot access all server-side render pages whose data sources come from Notion.

Root Cause

Notion pushed breaking changes of API today morning that both acceptable body params and partial structure of response data were updated. It broke all notion pages accidentally.

Mitigation

  1. Thanks to the community, notion-client released 4.9.3 to align the new version of Notion API. (see react-notion-x#140 PR). However, it was still broken even though I triggered redeployment around 14:00. The investigation was interrupted by other works.
START RequestId: bbbf7b0e-c6e5-410f-9725-6760b1305a94 Version: $LATEST
2021-10-06T04:47:20.316Z	bbbf7b0e-c6e5-410f-9725-6760b1305a94	INFO	�[97m[Wed, 06 Oct 2021 04:47:20 GMT][cacheClient] cache MISS | getNotionPage | key: production_1728db20-8ddd-4388-b355-7132d35738e4�[39m
2021-10-06T04:47:21.085Z	bbbf7b0e-c6e5-410f-9725-6760b1305a94	INFO	�[90m[Wed, 06 Oct 2021 04:47:21 GMT][cacheClient] cache HIT | lqip | key: production_https://www.notion.so/image/https%3A%2F%2Fs3-us-west-2.amazonaws.com%2Fsecure.notion-static.com%2F40520727-7869-44cf-a65d-b61183a90975%2FASUS-eeepc-Logo.png?table=block&id=1728db20-8ddd-4388-b355-7132d35738e4&cache=v2�[39m
2021-10-06T04:47:21.231Z	bbbf7b0e-c6e5-410f-9725-6760b1305a94	INFO	�[97m[Wed, 06 Oct 2021 04:47:21 GMT][cacheClient] cache MISS | getNotionPostsFromTable - getPage | key: production_4fe6f6a7546709ee1be9d1e923127c63aef75ff3�[39m
2021-10-06T04:47:21.494Z	bbbf7b0e-c6e5-410f-9725-6760b1305a94	WARN	NotionAPI collectionQuery error Response code 400 (Bad Request)
2021-10-06T04:47:21.810Z	bbbf7b0e-c6e5-410f-9725-6760b1305a94	WARN	NotionAPI collectionQuery error Response code 400 (Bad Request)
2021-10-06T04:47:22.100Z	bbbf7b0e-c6e5-410f-9725-6760b1305a94	INFO	�[97m[Wed, 06 Oct 2021 04:47:22 GMT][cacheClient] cache MISS | getNotionPostsFromTable - queryPosts | key: production_3a649e6b52fba33d40c42d937fe0a41594b8104d�[39m
2021-10-06T04:47:22.229Z	bbbf7b0e-c6e5-410f-9725-6760b1305a94	ERROR	�[91m[Wed, 06 Oct 2021 04:47:22 GMT][singlePage] HTTPError: Response code 400 (Bad Request)�[39m
2021-10-06T04:47:22.229Z	bbbf7b0e-c6e5-410f-9725-6760b1305a94	WARN	�[93m[Wed, 06 Oct 2021 04:47:22 GMT][page] redirect to 404 page | notionPath: /article/eeepc-relife-plan-1728db208ddd4388b3557132d35738e4�[39m
END RequestId: bbbf7b0e-c6e5-410f-9725-6760b1305a94
REPORT RequestId: bbbf7b0e-c6e5-410f-9725-6760b1305a94	Duration: 4606.08 ms	Billed Duration: 4607 ms	Memory Size: 1024 MB	Max Memory Used: 156 MB	Init Duration: 421.33 ms	
  1. I came back to continue the investigation at 20:00. After deep dive, I found some code getting value from the old API response structure, so I sent commits to align the new response structure. All notion pages recovered after the fix.
{
    "result": {
        "type": "reducer",
        "reducerResults": {
            "collection_group_results": {
                "type": "results",
                "blockIds": [
                    "231052a6-9034-45b4-b7de-b7d2b1f1962d",
                    "db0c4004-93bd-4a1a-83c2-fc83bb8c8c3a",
                    "5428cadf-17cc-408c-bbe4-bc11c982df6d"
                ]
            },
            "table:uncategorized:title:count": {
                "type": "aggregation",
                "aggregationResult": {
                    "type": "number",
                    "value": 3
                }
            }
        }
    },
    "recordMap": { ... }
}
  1. However, I found there were some draft articles also rendered on the page. After checking the fix PR of react-notion-x#140, I saw the "query" was removed accidentally so the notion API could filter / sort as expected. For a short-term fix, I forked the repo and bundle a new package then uploaded it to CDN to get the fix quickly. (fix commit) The issues were resolved after triggering a new release.
npm i https://static.dazedbear.pro/notion-client-4.9.3-patch.tgz

Remediation Items

I'll have the below follow up:

  1. Send a PR to react-notion-x to fix missing query support for filter / sort functionality
  2. Split 404 and 500 traffic for better monitoring. It's not ideal to find serious issues from the 4xx alert.
  3. Add a static page and feature bar for website maintenance announcement
  4. Set an hourly synthesis monitoring job to visit notion pages for early error discovery
  5. Evaluate to migrate from unofficial to official notion API

INCIDENT-2021-12-01-001

Issue Info

  • Start Time: 2021/12/01 06:07:31 GMT+8
  • Mitigated Time: 2021/12/02 12:16:00 GMT+8
  • End Time: ongoing
  • Is it detected automatically: yes

Issue Description

Users will see 500 pages or broken pages if they visit the below notion pages.

截圖 2021-12-02 上午12 04 13

截圖 2021-12-02 上午12 04 50

截圖 2021-12-02 上午12 04 41

Alert

3 events matched your lambda non 200, 404 alert alert in the past 5 minutes.

500 /music-notebook(no message)
--
500 /music-notebook(no message)
500 /music-notebook(no message)

Error logs

2021-11-30T22:07:37.007Z	3a836c75-051a-43a4-a723-13c44433a361	ERROR	TypeError: Cannot read property 'e938c2ce-3e70-4042-8c40-441d54b975b5' of undefined
    at /var/task/node_modules/react-notion-x/build/cjs/components/text.js:142:59
    at Array.reduce (<anonymous>)
    at /var/task/node_modules/react-notion-x/build/cjs/components/text.js:50:37
    at Array.map (<anonymous>)
    at Text (/var/task/node_modules/react-notion-x/build/cjs/components/text.js:36:133)
    at d (/var/task/node_modules/react-dom/cjs/react-dom-server.node.production.min.js:33:498)
    at bb (/var/task/node_modules/react-dom/cjs/react-dom-server.node.production.min.js:36:16)
    at a.b.render (/var/task/node_modules/react-dom/cjs/react-dom-server.node.production.min.js:42:43)
    at a.b.read (/var/task/node_modules/react-dom/cjs/react-dom-server.node.production.min.js:41:83)
    at Object.exports.renderToString (/var/task/node_modules/react-dom/cjs/react-dom-server.node.production.min.js:52:138) {
  page: '/[...notionPath]'
}
END RequestId: 3a836c75-051a-43a4-a723-13c44433a361
REPORT RequestId: 3a836c75-051a-43a4-a723-13c44433a361	Duration: 5075.82 ms	Billed Duration: 5076 ms	Memory Size: 1024 MB	Max Memory Used: 169 MB	Init Duration: 430.51 ms	
RequestId: 3a836c75-051a-43a4-a723-13c44433a361 Error: Runtime exited with error: exit status 1
Runtime.ExitError

Root Cause

The root cause is unknown yet. I guess it's related to the recent notion query collection response format change.

Mitigation

  1. Create a new static maintenance page and add rewrites rules to prevent users from accessing these notion pages first. (Mitigated Time: 2021/12/02 12:16:00 GMT+8)
    #15
  2. Will keep investigating the root cause and preparing a fix for this issue.

Remediation Items

TBD

Monitoring Strategy

Logging

  • Sematext

Monitoring / Alerting

  • Metric-based (4xx, 5xx): Logalert

Synthetic Monitoring

  • Browser Test
  • API Test

INCIDENT-2021-10-04-001

Issue Info

  • Start Time: 2021/10/04 06:07:24 GMT+8
  • End Time: 2021/10/04 15:38:00 GMT+8
  • Is it detected automatically: Yes, by the 404 abuse alert.

截圖 2021-10-04 下午4 02 46

截圖 2021-10-04 下午4 17 21

Issue Description

Users will see 404 page when visiting https://www.dazedbear.pro/music-notebook/

Root Cause

There were 2 broken images in this link which caused a TypeError during the lqip image optimization.

https://www.dazedbear.pro/music-notebook/transcription-%E8%B8%AE%E8%B5%B7%E8%85%B3%E5%B0%96%E6%84%9B-741fd1b104db4cddb8bb44a5791d228c

2021-10-04T07:10:29.023Z	3f1a3d4a-d93e-4e4b-870d-e678f5d13d48	INFO	�[97m[Mon, 04 Oct 2021 07:10:29 GMT][cacheClient] cache MISS | lqip | key: production_https://www.notion.so/image/https%3A%2F%2Fdazedbear-pro-assets.s3-ap-northeast-1.amazonaws.com%2Fwebsite%2F%E8%B8%AE%E8%B5%B7%E8%85%B3%E5%B0%96%E6%84%9B-1.jpg?table=block&id=194e322a-0c98-4fd6-923f-14e03be455b6&cache=v2�[39m
2021-10-04T07:10:29.025Z	3f1a3d4a-d93e-4e4b-870d-e678f5d13d48	INFO	�[97m[Mon, 04 Oct 2021 07:10:29 GMT][cacheClient] cache MISS | lqip | key: production_https://www.notion.so/image/https%3A%2F%2Fdazedbear-pro-assets.s3-ap-northeast-1.amazonaws.com%2Fwebsite%2F%E8%B8%AE%E8%B5%B7%E8%85%B3%E5%B0%96%E6%84%9B-2.jpg?table=block&id=60f17b2a-92c5-4413-9340-8d479bb4e840&cache=v2�[39m
2021-10-04T07:10:30.225Z	3f1a3d4a-d93e-4e4b-870d-e678f5d13d48	ERROR	�[91m[Mon, 04 Oct 2021 07:10:30 GMT][lqip]�[39m
2021-10-04T07:10:30.225Z	3f1a3d4a-d93e-4e4b-870d-e678f5d13d48	ERROR	�[91m[Mon, 04 Oct 2021 07:10:30 GMT][listPage] TypeError: Cannot read property 'metadata' of undefined�[39m
2021-10-04T07:10:30.225Z	3f1a3d4a-d93e-4e4b-870d-e678f5d13d48	WARN	�[93m[Mon, 04 Oct 2021 07:10:30 GMT][page] redirect to 404 page | notionPath: /music-notebook�[39m

There was an AWS infra update on 2021/09/21 to introduce a new CDN host to prevent fetching images from S3 buckets directly for cost optimization. I forgot to update these broken images at that time, and their images cache were expired today's morning so they caused this issue.

Mitigation

Remove these broken images and upload new ones to resolve this issue.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.