Coder Social home page Coder Social logo

joldnine.github.io's People

Contributors

dependabot[bot] avatar eldnine avatar joldnine avatar ohkimchi avatar

Stargazers

 avatar  avatar

Forkers

qiuhaiqiuhai

joldnine.github.io's Issues

第一篇issue

这里是第一篇issue,开发测试使用.

System.out.println(“Hello, world!”);

Java: 知识点汇总

虽然学习使用Java很多年,但是很多学校里学的数据结构实现和原理的的东西却忘得七七八八,偶尔复习一下,苟日新,日日新,又日新。我们就来一起复习一下这些基础知识吧!

数据结构

1. String, StringBuilder, StringBuffer

String 原理:

String做为一个Class,它的数据由一个final修饰的char array 存储
private final char value[];
有一点要注意,在Java 7以前,String的object pool一般都放在Permanent generation里,和static待遇相同。Java 7开始String就丢到了heap里,因为Perm太小了(默认4M),而且Perm有取消的趋势。
String s = new String("abc") 会创建2个object, 一个放在heap,一个在常量池(JDK 7开始在heap)。

StringBuilder 原理:

StringBuilder继承了abstract修饰的AbstractStringBuilder类,它的数据也是由char array存储(初始默认长度是16)。
char[] value;.
StringBuilder在每次append新的String进来的时候,做了2件事
(1) 确认这个StringBuilder的容量是否足够,如果不足,就扩容。扩容操作会新建一个char[],大小为可以刚好存储加入新String。
核心代码:
ensureCapacityInternal(count + len);
Arrays.copyOf(value, newCapacity(minimumCapacity));

(2) 把要加入的String里的char一个个复制进来。
核心代码:
str.getChars(0, len, value, count);
System.arraycopy(value, srcBegin, dst, dstBegin, srcEnd - srcBegin);

StringBuffer 原理:

StringBuffer就有意思了。它总的来说是一个thread-safe的StringBuilder,其方法有synchronized注释。

对比

String小巧,生成速度快,但是修改会产生大量Object,加大GC的工作量,影响性能,适用于immutable的数据。
StringBuilder生成速度比String慢,线程不同步,但是可以修改,修改不会显著加大GC工作量,默认推荐使用。
StringBuffer修改速度比StringBuilder要慢,但是线程同步,安全。

2. HashMap

我们一般会使用键值对(不重复的key,可重复的value)的时候用上Map类。

HashMap 原理

继承关系

public class HashMap<K,V> extends AbstractMap<K,V> implements Map<K,V>, Cloneable, Serializable

数据存储

transient Node<K,V>[] table;
static class Node<K,V> implements Map.Entry<K,V> {
        final int hash;
        final K key;
        V value;
        Node<K,V> next;
}

重要方法
put实现

  1. 将key hash化
  2. 对table做非空检查,如果为空,则resize初始化。
  3. 判断hash for key和table的最后一个元素是否都为null, 如果是则新建一个node,把value放进去。
  4. 如果步骤3不符合,判断该index是否在table里存在,如果存在,并且hash值相等,直接修改value,否则步骤5.
  5. 如果index存在但是hash不相等且这个node是个list,比较这个node.next的hash,直到相等就覆盖value,如果到null还没找到,就新加一个node在这个list的尾部。如果这个node是tree, 进行tree查找,没找到就做一次balance insersion。
  6. 如果步骤5里这个node是list,并且这个list过长超过TREEIFY_THRESHOLD(默认8)了,则把list变成红黑树(treeifyBin)。
  7. 如果table的size超过threshold,就是说满了,resize table,size变成2倍大小。这里threshold默认是这个table里有75%的index都有node占用。因为超过一定大小,index碰撞会变得很严重(很多hash在和(n-1)bitwise and后的index坑位都已经被占了,只能在那个list或者tree后面‘排队’。

至于为什么一开始就不用搜索更快的tree作为node,注释里这么解释

Because TreeNodes are about twice the size of regular nodes, we
use them only when bins contain enough nodes to warrant use
(see TREEIFY_THRESHOLD). And when they become too small (due to
removal or resizing) they are converted back to plain bins.  In
usages with well-distributed user hashCodes, tree bins are
rarely used.

get实现
get就是put的反向实现,但是逻辑比put简单很多。

  1. 将key hash化。
  2. 在table这个entry里找这个hash的node,找到后对node(tree or list)查找(用的是e.hash == hash和key.equals())。O(n) for LinkedList, O(nlgn) for tree.
    红黑树实在Java8之后才加入,之前都只是纯粹的list。但是如果list比较长,O(n)是很慢的。不过红黑树空间占用比较多,所有加了一个TREEIFY_THRESHOLD作为妥协。

get和put中index的实现

static final int hash(Object key) {
    int h;
    return (key == null) ? 0 : (h = key.hashCode()) ^ (h >>> 16);
}

... 
    // n 是table的大小
    (n - 1) & hash
...
  1. 如果key为null,返回0.
  2. 拿到key的hashCode(),做法一般是从key这个对象的内存地址转化得到。
  3. 把hashcode和其右进16位后的值进行bitwise XOR
  4. 把步骤3里的值和(n - 1) 做bitwise and,最终拿到index.
    这样子做虽有一定可能的概率发生index碰撞,但是可以减少resize的开销。
    其实hash,treenode的设计,都是性能上的权衡妥协。hash可以保证table数组查找为O(1),list和tree的查找均摊查找复杂度也是O(1)并且这个O(1)的offset不至于太大。

使用注意:

  1. 不适合并发编程,对数据的修改是不是同步的,数据会不一致,而且如果两个thread同时发现需要resize,会陷入死循环(详情看方法resize())。并发可以用Hashtable。
    Hashtable和HashMap基本相同,但是其put和get有synchronized关键词标记以保证线程安全,加锁的操作导致速度慢了,而且在Java8里面的Node不是tree,而是普通的list,不接受null作为key。而现在用得更多的是ConcurrentHashMap,Hashtable一直处于被放弃支持的境地。ConcurrentHashMap比Hashtable更快,因为方法没有synchronized,而是只给其field加了transient,volatile。ConcurrentHashMap的bucket有被分段,每段加了一个锁。
    另外,其实HashMap可以通过Map m = Collections.synchronizeMap(hashMap);来做到线程同步。
  2. key最好是immutable的(final修饰就是immutable的),不然key被修改了,其hashcode()也不一样了,value也丢了。

sorted:
HashMap不是sorted,但是TreeMap和LinkedHashMap是sorted。

3. Vector, List

Vector同步,线程安全。

4. LinkedList, ArrayList

LinkedList本质是linked nodes,。ArrayList本质是数组,但是有index,这个数组在填满后需要新建复制一个大的。两者都有space的浪费。LinkedList每个node需要更多空间,ArrayList的数组tail需要预留一定冗余。
LinkedList使用场景:不会随意对list不同位置的value进行访问。希望在list中或者head加入删除元素。希望按顺序访问其中元素。
ArrayList使用场景:会对list随意位置频繁访问,希望在tail增加元素。

Keywords

  1. extends
    inheritance.
    single inheritance for Java
  2. implements
    multiple implementation for Java
  3. protected
    The “protected” keyword allows the subclass to access the attributes directly.
  4. static
    表示这个variable, method, nested class是这个class本身的固有属性。
  5. final
    修饰后该variable不可修改。
  6. abstract
  7. transient
    代表该variable不会被serializable。

并发编程

  1. Runnable和Thread
    Runnable一般是implements,而Thread一般是extends。前者不同线程间共享数据(需要互斥锁synchronized,注意不是直接给数据加synchronized...),后者内部数据独立。
  2. volatile
    只能赋值于variable。
    优化器在用到这个variable时必须每次都小心地重新读取这个variable的值,而不是使用保存在寄存器里的备份。它修饰的variable不会被编译器优化。
    保证修改的值会立即被更新到主存,保证了可见性,但是不保证原子性。
    不会造成线程堵塞。
    能够保证修饰的variable本身有有序性(内存屏障)。适用于状态variable(boolean)和JAVA的double check。
  3. synchronized
    保证任一时刻只有一个线程使用该variable或者执行该代码块,保证了大范围(赋值和读取之外)操作的原子性,可见性和有序性。
    可能造成线程堵塞。
    我们知道自增是两个原子操作(读取+赋值)组成的非原子操作,我们有三个保证原子性的方法。
public synchronized void increase() {
    inc++;
}
// 加锁
Lock lock = new ReentrantLock(); 
public  void increase() {
    lock.lock();
    try {
        inc++;
    } finally{
        lock.unlock();
    }
}
// atomic wrapper classes
public  AtomicInteger inc = new AtomicInteger();  
public  void increase() {
    inc.getAndIncrement();
}
  1. 让线程停止的Thread的方法:优先级,yield,sleep,join
    优先级:
    1-10. 默认5。用于提高程序的效率。高优先级有大概率抢到CPU时间。
    Thread t = new MyThread();  
     t.setPriority(8);  
     t.start();

yield:
该线程让步于其他相同优先级的线程。
Thread.yield()
sleep:
暂停线程。帮助所有线程获得运行机会。指定的睡眠时间是最短时间。
Thread.sleep(123);
join:
让一个线程加入另外一个线程的尾部。
非静态。

Thread t = new MyThread();  
t.start();  
t.join(); 
  1. wait,notify, notifyAll
    与sleep不同,wait是Object的方法。wait后,需要别的线程执行notify或者notifyAll,才能继续执行。
    sleep 让线程从 [running] -> [阻塞态] 时间结束/interrupt -> [runnable]
    wait 让线程从 [running] -> [等待队列]notify -> [lock pool] -> [runnable]
    notify, notifyAll也是Object的方法,注释是这么说的:Wakes up all threads that are waiting on this object's monitor.

GC

GC的工作原理

reference在stack,object在heap,没有reference的object就会被回收。
优先级较低的后台进程。

Young gen, Old gen,Permanent gen

Young gen会放刚new的object,并且频繁快速gc符合条件的对象。
Young gen里又分Eden,Fron Survivor和To Survivor (8:1:1),survivor有一个一定是空的。但是如果object超大,一次GC就可能被丢进Old gen。
Old gen放多次GC后还存在的对象(MaxTenuringThreshold默认15次)。
空间Young gen:Old gen约等于1:2.
Permanent gen放一些static,String(为了解决String内容重复问题)。针对Permanent gen的Major GC,触发条件苛刻。

Minor GC,Major GC = Full GC

Minor GC是发生于Young gen的GC。
Major GC是发生于Old gen和Permanent gen的GC,时间更久。

有哪些GC

Serial(UseSerialGC)串行收集器,复制算法。
SerialOld (UseSerialGC)串行收集器,标记整理算法。
CMS(Concurrent Low Pause Collector)经典的Stop Tow World(STW)。STW initial mark -- concurrent marking -- concurrent precleaning -- STW remark -- concurrent sweeping -- concurrent reset。
缺点:因为标记清理算法,会有内存碎片。需要更多CPU资源。需要更大的heap,默认old gen 68%启动GC。
ParNew (UseParNewGC)并行收集器,复制算法。
G1 GC

哪些算法

根搜索
标记-清除:扫描--标记--清除,没有对活着的对象进行整理,会有内存碎片。
复制:在标记-清除后,将存活对象往一端空闲空间移动,更新stack里的reference,成本高,但是解决了内存碎片的问题。
Young gen的Minor GC一般用复制算法,Major GC一般用标记-清除算法。

System.gc() 可以触发full GC但不是马上执行,比较影响性能。

Reflection

原理:

Exception

Throwable
Exception & Error
RunTimeException & other Exceptions

Npm: Notes

Here are some commonly used npm commands.

Install nodejs and npm (for Ubuntu)

  1. Download binaries from Nodejs download, eg. node-v6.11.5-linux-x64.tar.xz

  2. Extract to /home/{ ubuntu username }/Develop/Application

  3. Add export PATH=/home/{ ubuntu username }/Develop/Application/node-v6.11.5-linux-x64/bin:$PATH to the last line of the hiden file /home/{ ubuntu username }/.bashrc

  4. Execute
    source /home/{ ubuntu username }/.bashrc

  5. Execute
    node --version; npm --version
    It will show the versions if the installation is successful.

Get current global location of npm

$ npm config get prefix

Install a package

// global
$ npm i { package name } -g
// local
$ npm i { package name }
// update package.json
$ npm i { package name } -S

Install a package with the specified version

$ npm install [email protected]

Update a package

$ npm update axios

List packages

// global
$ npm list -g
$ npm list -g --depth=0
// local
$ npm list
$ npm list --depth=0

Uninstall

$ npm uninstall { package name }

Use package.json file to manage the dependencies

$ npm init

Update all packages in package.json.

$ npm install npm-update-all -g
// under the directory of package.json
$ npm-update-all

Run npm for different directory

$ npm { command } --prefix { path/to/another/directory }

Registry config

Set registry
$ npm config set registry { registry URL }
Get current registry
$ npm config get registry
Actually, for npm config get/set, it simply gets/sets the key values in the config map, ie. registry is a key and its URL is its value in config map.

Go语言(golang)使用笔记

最近捡起了golang写些API,偶尔遇到些有价值的东西,记录一下。一切从简。

image

HTTP Restful API server

Golang起个HTTP Restful API server比较简单。

A comprehensive example:

package main

import (
	"encoding/json"
	"fmt"
	"log"
	"net/http"
)

type Greeting struct {
        // must be "Content" but not "content" to indicate that it is exportable
	Content string
}

func enableCors(w *http.ResponseWriter) {
	(*w).Header().Set("Access-Control-Allow-Origin", "http://localhost:8080")
}

func helloHandler(w http.ResponseWriter, r *http.Request) {
	enableCors(&w)
	name := r.URL.Query().Get("name")
	if name == "" {
		w.WriteHeader(422)
		fmt.Fprintf(w, "Invalid input.")
		return
	}
	greeting := Greeting{Content: "Hello, " + name}
	jsonResult, err := json.MarshalIndent(greeting, "", "\t")
	if err != nil {
		log.Fatal(err)
	}
	w.Header().Set("Content-Type", "application/json")
	w.Write(jsonResult)
}
func main() {
	http.HandleFunc("/hello", helloHandler)
	log.Fatal(http.ListenAndServe(":8080", nil))
}

麻雀虽小五脏俱全。上面几行代码不仅包括了基本http请求,还有reponseBody的数据模型,错误处理,CORS,log,JSON parse这些基本功能。另外,通过阅读http包的源码我们可以知道,每个http请求都是一个goroutine,原生支持高并发。Handling 1 Million Requests per Minute with Go

Output

运行起来后,在浏览器访问 http://localhost:8080?name=Steve, 可以看到:

{
	"Content": "Hello, Steve"
}

Reference

https://golang.org/pkg/net/http/

defer

修饰延时执行的代码,丢到整个function return前最后被调用。defer经典用例:修饰IO的close操作,IO的close代码紧跟在open之后,赏心悦目。其他用法还有与panic的搭配使用来处理程序异常。

An example

package main

import "fmt"

func main() {
	defer fmt.Println("4")
	defer fmt.Println("3")
	fmt.Println("1")
	fmt.Println("2")
}

Output

1
2
3
4

Reference

https://tour.golang.org/flowcontrol/12

parse XML string to struct

golang自带的xml.Unmarshal可以方便地转换xml到struct。类似的还有json.Unmarshal,用法相同。

A comprehensive example

package main

import (
	"encoding/xml"
	"fmt"
	"log"
)

// Department data model
type Department struct {
	Name    string `xml:"Name"`
	Members []struct {
		Name string `xml:"Name"`
	} `xml:"Members>Member"`
}

func parseXML(data interface{}, xmlContent string) (err error) {
	err = xml.Unmarshal([]byte(xmlContent), data)
	if nil != err {
		log.Fatal("Error unmarshalling from XML", err)
	}
	return
}

func main() {
	departmentXML := `
		<Department>
			<Name>Agents of S.H.I.E.L.D.</Name>
			<Members>
				<Member>
					<Name>Nick Fury</Name>
				</Member>
				<Member>
					<Name>Natalia Alianovna Romanova</Name>
				</Member>
			</Members>
		</Department>
	`
	department := &Department{}
	parseXML(department, departmentXML)
	for i := 0; i < len(department.Members); i++ {
		fmt.Println(department.Members[i].Name)
	}
}

Output

Nick Fury
Natalia Alianovna Romanova

如果嫌手写type struct太麻烦,可以使用工具把一个具体的xml一键转换成type struct,比如说XML to Go struct

References

https://golang.org/pkg/encoding/xml/
https://golang.org/pkg/encoding/json/
https://www.onlinetool.io/xmltogo/

Memory allocation and value initialisation in Go

4种方法:&T{}, &localVariable, new, make。其中不同,我们观察一下例子就懂。

Examples

package main

import (
	"fmt"
	"reflect"
)

// User data model
type User struct {
	Name string
}

func main() {
	fmt.Println(reflect.TypeOf(&[]User{}))
	var localVar []User
	fmt.Println(reflect.TypeOf(&localVar)) // localVar has to be a local variable
	fmt.Println(reflect.TypeOf(new([]User)))
	fmt.Println(reflect.TypeOf(make([]User, 1)))
}

Output

*[]main.User
*[]main.User
*[]main.User
[]main.User

Reference

https://tour.golang.org/moretypes/13

跨系统编译指令

linux
GOOS=linux GOARCH=amd64 go build -ldflags "-w -s" -o {name_of_target_file}
windows
GOOS=windows GOARCH=amd64 go build -ldflags "-w -s" -o {name_of_target_file}
mac
GOOS=darwin GOARCH=amd64 go build -ldflags "-w -s" -o {name_of_target_file}

Reference

https://gocn.vip/question/50

panic() vs log.fatal()

Golang鼓励error一层层上抛或者就地处理,而不是Java那样允许吃掉exception。有时出现比较严重的error,就需要停止程序或当前的goroutine,而其中做法又有panic和log.fatal之分。

panic()

log.fatal()

// TODO

Hive数据倾斜的一般解决思路

用Hive做ETL的时候,经常会遇到数据倾斜(Data Stew)的问题,记录总结一下。

分类

平时大概率遇到的可能有以下几类:

  1. join
  2. group by
  3. count distinct

解决办法

join

join时的数据倾斜一般是因为某些key对应的数据量比较大。思路一般是:

1. 数据去重或者裁剪

2. 如果是大表join小表,使用mapjoin

3. 如果知道热点的key的具体值,可以用skewjoin。

4. 思考业务上这样join是不是必须的,这种大量数据的key进行笛卡尔积(Cartesian product)是否合理,有没有变通方法

5. 在特定情况下,以上四种方法都没办法解决,需要具体情况具体分析。这里,我们可以挑一个大商家的经典例子分析一下。

我们有2个表,一个pv表,一个seller表。
dwd_pv 流量表字段:
visit_time, product_id, seller_id
假设有10亿行
dim_seller 卖家表字段:
seller_id, seller_name
假设有一千万行

Query:

SELECT visit_time, product_id, pv.seller_id, seller_name
FROM dwd_pv AS pv
LEFT OUTER JOIN dim_seller AS slr
ON pv.seller_id = slr.seller_id;

这就是一个普通的补字段SQL,但是在某些个seller流量特别大的情况下会发生数据倾斜。
为了解决这个问题,我们先了解一下HIVE里的join(Shuffle Join)发生了什么:

  1. map:
  2. shuffle:
  3. reduce:
    吃饭去了,待续

group by

group by 造成的数据倾斜和join类似,group by里的某个字段数据量太大。思路一般是:

  1. 数据去重或者裁剪。
  2. MRR做法,group by两次:第一次先在小一点的粒度上group进行初步汇总,目的是把大量数据的字段打散防止热点,然后在目标粒度上对汇总过的数据再group。

count distinct

思路:
key splitting。

Frequently Used Commands for Machine Learning

GPU

# Nvidia GPU information and status check
$ sudo nvidia-smi

# Nvidia GPU information and status check with trend and metrics (recommended)
$ sudo nvidia-smi --query-gpu=timestamp,pstate,temperature.gpu,utilization.gpu,utilization.memory,memory.total,memory.free,memory.used --format=csv -l 1

# Refresh GPU info every 5 seconds
$ watch -n 5 nvidia-smi

# Check version of CUDA
$ nvcc -V

Check if Torch is using GPU
In python shell:

import torch
torch.cuda.get_device_name(0)

Check if tensorflow is using GPU
In python shell:


Github: 在Github Pages部署单页网页应用

用Vue开发网页,Webpack打包,部署到Github Pages.

1. 以vue的应用举例:用 vue-cli 初始一个空白webpack项目并打包成静态网页

  1. 先安装vue-cli
  2. 用webpack template初始化
    $ vue init webpack my-github-page
  3. 进入新建的folder
    $ cd my-github-page
  4. 安装dependencies
    $ npm install
  5. 修改config/index.js build module下的assetsPublicPath./
  6. 修改build/webpack.prod.conf.js removeAttributeQuotesfalse
  7. $ npm run build
    生成静态页面,生成的文件在dist/目录下

2. 建一个Github Page

  1. 在自己的github上新建一个repo,取名为 username.github.io, 其中username要和自己github的username相同。
  2. <username>.github.io repo下,进入Settings, 在GitHub Pages那栏下,Source选择master branch.

3. push 静态页面到repo username.github.io

$ git clone { username.github.io的git link }
  1. 复制第一步dist/文件夹下所有文件到 username.github.io 文件夹下
$ cd username.github.io
$ git add .
$ git commit -m "init"
$ git push

最后,大概2分钟后,在浏览器上打开 https://username.github.io, 就可以看到你的网页了。

自动化脚本

#!/bin/bash 
set -x

BLOG_REPO="joldnine.github.io"
BLOG_DEV="vuejs-blog"

cd $BLOG_DEV
npm run build >/dev/null
cd ..
rm -rf $BLOG_REPO/static
rm -rf $BLOG_REPO/index.html
cp -r $BLOG_DEV/dist/* $BLOG_REPO/
cd $BLOG_REPO
git add .
git commit -a -m "deploy"
git push origin master

Cassandra: Note

  1. Populate all schemas in the node by cqlsh.
ks="$(cqlsh -e 'DESC keyspaces;')"
arr=(${ks// / })
mkdir schemas_temp_folder
for i in "${arr[@]}"
do
    if [[ ("$i" != "system_traces") && ("$i" != "system") ]]; then
        cqlsh -e "DESC keyspace $i;" | tee schemas_temp_folder/$i-schema.cql > /dev/null
    fi
done
  1. Import all cql files into Cassandra.
files="$(ls)"
arr=(${files// / })
for i in "${arr[@]}"
do
    echo $i
    cqlsh  --request-timeout=3600 -f $i
done
  1. Convert sstables into JSON
sstables_dir=$1
sstable2json='/home/ubuntu/Application/apache-cassandra-2.1.19/tools/bin/sstable2json'
output_dir=$2
cd $sstables_dir
files="$(ls)"
arr=(${files// / })
for i in "${arr[@]}"
do
	if [[ $i = *"-Data.db" ]]; then
		echo $i
		$sstable2json $sstables_dir/$i | tee $output_dir/$i.json > /dev/null
	fi
done

Build CNN Model with Python+Numpy only from Scratch

Today we are going to build a Convolutional Neural Network with Python+Numly only without any machine learning packages and frameworks. Eventually, we are going to train a MNIST model using our "home-made" CNN framework.

Structure

Model

Optimizer

Adam

CNN Operations

Convolutional

Pooling

Dropout

Kernal Initializer

One Sentence Concepts

I was always about to write a concepts list where every concept is explained in one sentence with strong intuition, and it can be understood by a beginner or even an interested layman, while the veteran will applaud for it. As we know, when explaining a concept, it is easy to add more notes and details to it, but it is very challenging to shorten the explanation.

Here is the list. I will keep adding instances to the list and modify the concepts, occasionally :>

Front End

Flux, Redux, Vuex, etc.

The database for the engineered front-end project.

Immutable

A state type whose mutation can be detected and recovered easily because any changes to the object imply the replacement of the whole object (reference, ie. memory address of the object, is replaced).

Java

Spring Beans

The initialized objects managed by Spring container.

Spring Dependency Injection

Flexibly define the dependencies for a Spring Bean.

Java Reflection

Get the information of a Java Class in its runtime.

Data Science

Decision Tree

Data Transformation

MapReduce

Spark

Database

Functional Dependency

Normalization

CAP

nocql

How install a specific version of formula (package) in Homebrew

Step 1: Clone the Homebrew repository

git clone https://github.com/Homebrew/homebrew-core.git

Step 2: Log the formula history

cd homebrew-core
git log master -- Formula/git.rb

Step 3: From the history, find the commit hash of the version you want to install

commit 0c49ceffe4944b095da4d0c39a6b8499714d0df8
Author: BrewTestBot [email protected]
Date: Tue Aug 17 04:07:34 2021 +0000
git: update 2.33.0 bottle.

Note the first 10 char of the commit hash, which will be 0c49ceffe4 in the example

Step 4: Install the specified version referenced by SHA

brew install https://github.com/Homebrew/homebrew-core/raw/0c49ceffe4/Formula/git.rb

You can also use the full SHA of the commit.

Types of Index

Revise types of index used in databases from the class CS4221, National University of Singapore.

(Data Structure) B+ Tree Index vs RB Tree Index, Hash, Full-text

B+ tree has more nodes in one layer. Less IO for each search.
RB Tree is too deep.

(Space Saving)Sparse vs Dense

Sparse: not all keys will be indexed. Unstable search speed but less space.
Dense: all keys will be indexed. Faster search but more space.

(Physical Storage) Clustered vs Unclustered

Clustered index: rows in the disk are stored in the same order of index, so there can only be one cluster index for the rows.
With a non clustered index there is a second list that has pointers to the physical rows.

(Logical) Primary vs Secondary, Composited

Primary key is unique.
The secondary key can be not unique.

Frequently Used K8S Commands

Some frequently used kubectl commands.

Imperative and Declarative Commands

Imperative Commands

Create objects:
kubectl run --image nginx nginx
Example:
Create a pod with exposed clusterip service:
kubectl run webapp --image=webapp --port=80 --expose

kubectl create deployment --image nginx nginx
kubectl expose deployment nginx --port=80
Update objects:
kubectl edit deployment nginx
kubectl scale deployment nginx --replicas=5
kubectl set image deployment nginx nginx=nginx:1.18

xxx --dry-run -o yaml
Dry run and check the correctness
xxx --dry-run=client -o yaml

Use YAML files:
kubectl create -f nginx.yaml
kubectl replace -f nginx.yaml
kubectl delete -f nginx.yaml

Declarative Command

kubectl apply -f nginx.yaml

Object related commands

Pod

List pods of a namespace
kubectl get pods -n=NAMESPACE

List pods
kubectl get pods --all-namespaces
kubectl get pods -n NAMESPACE --sort-by .metadata.creationTimestamp | grep Pending

Describe a pod
kubectl describe POD_NAME -n=NAMESPACE

Get namespaces
kubectl get namespaces

Get logs of the init container of a pod
kubectl logs POD_NAME -c INIT_CONTAINER_NAME

Go into bash shell of a pod
kubectl exec -it POD_NAME -n NAMESPACE -- /bin/bash

Service Account

List service accounts of a namespace
kubectl get serviceaccounts -n NAMESPACE

Describe serect of a service account
kubectl describe secret SERVICE_ACCOUNT -n NAMESPACE

Nodes

kubectl get nodes --show-labels

kubectl describe node NODE_NAME

kubectl drain NODE_NAME --delete-local-data --ignore-daemonsets --force

PVC

kubectl get pvc -n NAMESPACE

Patch a hanging pvc
kubectl patch pvc PVC_NAME -p '{"metadata":{"finalizers":null}}' -n NAMESPACE

Configmap

kubectl describe configmap CONFIGMAP_NAME -n NAMESPACE

Resource

List the resources utilized by pods or nodes
kubectl top node/pod -n NAMESPACE

Proxy

Launch K8S proxy server in local to access k8s APIs
kubectl proxy --port=XXXX

Git: git squash

Sometimes we want to ‘squash’ multiple commits into one. We can always do it by interactive rebase.
Example:

$ git log --oneline

We have:

5f2f5fb commit-3
2506e7a commit-2
2800c01 commit-1
12s21f commit-0

We hope to merge three commits (commit-1, commit-2 and commit-3) into one.

Steps:
$ git rebase -i 12s21f

We have:

pick 2800c01 commit-1
pick 2506e7a commit-2
pick 5f2f5fb commit-3

....

Modify it into:

pick 2800c01 commit-1
s 2506e7a commit-2
s 5f2f5fb commit-3

....

We save it by:
Press esc
wq
Enter

After it, we will be asked to edit the commit message; edit and save it by
Press esc
wq
Enter

Done!

Vim Cheatsheet

换手机时翻到了大一时记的Vim Commands Cheatsheet。其实vim在现在工作中也经常用到,记录一下吧。注:linux环境。

常用基础操作

insert模式
i
回到normal模式 (以下指令如未特殊说明,都是在normal模式下)
Esc
去文件首行
1G
1gg
:1 + Enter
去文件最后一行
G
保存并退出
:wq or :x or ZZ
保存
:w
退出
q
不保存退出
q!
Undo
u
新开一行
o or O
Delete a word
dw

d$
Delete a line
dd
Paste
p
Delete 3 lines
3dd
复制一行
yy
复制三行
3yy

常用组合操作

批量注释代码

方法一

在normal模式下,
ctrl + v
上下选择要注释的行
shift + i
输入注释 #, //
esc

方法二

在normal模式下,
:起始行号,结束行号s/^/注释符/g
例:
如果注释符是 #
:1, 100s/^/#/g
如果注释符是 //
:1, 100s#^#//#g
第二种办法比较丑...也不好记,个人不太喜欢

批量取消注释

方法一

在normal模式下,
ctrl + v
选择要取消注释的行
d

方法二

:起始行号,结束行号s/^注释符//g
例:
如果注释符是 #
:1, 100s/^/#//g
如果注释符是 //
:1, 100s#^#//##g

批量替换

全局批量替换

:%s/替换原文regex/替换目标
例子:
把所有111换成222
:%s/111/222

光标当前行替换

:s/替换原文regex/替换目标

指定替换行范围

:范围起始行数,范围末尾行数s/替换原文regex/替换目标
例:
:1,100s/111/222

pretty print one line json

:%!python -m json.tool

AWS: 5分钟用Lambda部署一个RESTful API

最近有个业余project用到Lambda来全线部署RESTful后端。
以一个简单的CRUD service为例,步骤如下:

新建一个Lambda function

  1. 把我们的api的CRUD业务逻辑写成一个function:
    新建一个api.py, 例子内容如下:
import sys
import logging
import pymysql
import rds_config
import json
logger = logging.getLogger()
logger.setLevel(logging.INFO)
class DbAccess(object):
    def __init__(self):
        rds_host = rds_config.db_host
        db_username = rds_config.db_username
        db_password = rds_config.db_password
        db_name = rds_config.db_name
        # try catch is required in prod
        self.conn = pymysql.connect(rds_host, user=db_username, passwd=db_password, db=db_name, connect_timeout=5)
        logger.info("SUCCESS: Connection to RDS mysql instance succeeded")
def get(event, context):
    field_a = event['field_a']
    rows = []
    conn = DbAccess().conn
    with conn.cursor() as cur:
        # prevent SQL injection
        cur.execute('SELECT * FROM db_name.table_name WHERE field_a=%s;', field_a)
        for row in cur:
            rows.append(row)
    return {"body": json.dumps(rows), "statusCode": 200}

新建一个rds_config.py, 存放数据库连接配置:

db_host = 
db_username = 
db_password = 
db_name = 
  1. Lambda运行时会建立一个linux容器,这个容器很小,没有依赖,所以在上传到Lambda前,我们要先安装好本地依赖。
    在我们的例子中我们只引入了一个第三方依赖pymysql,所以我们在步骤1的文件同目录下,运行命令行:
$ pip install pymysql -t .

最后,我们现在有4个文件(夹):

api.py
rds_config.py
pymysql/
PyMySQL-0.8.0.dist-info/

注意

使用Lambda部署python脚本的时候,所引入的第三方依赖(eg. sklearn)不能有C语言库的依赖,详情,因为我们在本地非linux机子上安装依赖的时候,它引入的C语言库在Lambda的容器里无法使用。这时候有2种解决办法:方法1. 用linux环境(docker,VM,或者换台机子)安装第三方依赖。方法2. 把语言从python换成Node.js或者Golang(注:Golang在Lambda上性能和Java差不多)。

  1. 将以上四个文件(夹)打包成一个zip压缩包。

  2. 打开

aws console -- Lambda -- Create function

选Author from scratch
填入Name,
Runtime选择python3.6,
Role和Existing role随便选个开放权限的,
点Create function。

  1. Function Code - Code entry type(Upload a .ZIP file)上传我们刚刚打包的zip文件。
    Handler改成api.get
    Save

在API Gateway新建一个api

  1. aws console -- API Getway -- Create API
    选择 New API, 给个API name,点Create API
  2. Resources -- Actions -- Create Method,
    在list里面新出现的一个method,选择post,打勾。
  3. 在setup里,填入我们刚刚新建的lambda名字,点Save。

这样就建好了一个基于Lambda的API,之后可能还要Enable CORS,deploy API,按需配置即可。

Go Error Handling

本来想把error handling写进“Go语言使用笔记”里的,奈何这部分实在有点多,那就单独开一篇文吧!

渊源

用Go一段时间的小伙伴可能都会吐槽:为什么要有近半代码都在处理err??
这是因为Go本身没有Java那种catch exception的机制,而是通过func逐级向上抛err。这么做的可以鼓励工程师主动处理err,而非忽略err,代价就是多了许多err handling的代码。关于err handling,Go社区里也是有过热烈的讨(tu)论(cao),总结一下大概有以下几种方案:

TODO

Life: 陈奕迅演唱会

刚刚看了陈奕迅演唱会。

说实话,从一开始到最后快结束,一直感觉挺失望的。唱的都是新专辑的歌,大家都没有听过,也没有请到老搭档林夕作词,新歌的歌词普遍显得空泛做作,以至于在退场的时候朋友圈里都在吐槽歌都没听过也不出彩。

不过,在Eason和工作人员们都退场后,虽然舞台灯已经暗了下来,大家还是不甘心地一遍遍喊"Eason Eason Eason", "再来一个",最后奇迹般地舞台灯突然亮起,"让我留在你身边"的前奏悠然而来,那瞬间,整个会场都沸腾了。当时真的很开心,因为这是我们俩最喜欢的歌。这才是我们的Eason和青春!

当然我们都知道这是演出效果,不过感觉还是很棒的。

Java: 用反射机制判断泛型和对象

Java的反射机制 (Java Reflection)可以在Runtime阶段得到Java对象的信息。网上有很多实用教程,比如说 Java Reflection Example Tutorial

前段时间遇到一个有趣的问题:在runtime判断某个class里所有field的type,其中难点在于判断某个field是否为泛型(generic type)或者对象(java.lang.Object)。针对此难点特此记录。

思路如下:

  1. 用reflection或者classloader拿到class<?> 对象
  2. getDeclaredFields()方法拿到fields
  3. field.getType().equals(Object.class) 判断是否为java.lang.Object
  4. field.getGenericType() instanceof TypeVariable 判断是否为泛型。
    注:在JVM里,泛型会是java.lang.Object,所以在这里需要进一步判断。

完整代码如下:

try {
    for (Field field : clazz.getDeclaredFields()) {
        try {
            if (field.getType().equals(Object.class)) {
                if (field.getGenericType() instanceof TypeVariable) {
                    System.out.println("Class: '%s' has generic type field: '%s'", clazz.getName(), field.getName();
                } else {
                    System.out.println("Class: '%s' has Object type field: '%s'", clazz.getName(), field.getName();
                }
            }
        } catch (TypeNotPresentException e) {
          // do something
        }
    }
} catch (NoClassDefFoundError) {
  // do something
}

Python 知识点小结

1. Python的最大最小值

最大integer
sys.maxsize
最小integer
-sys.maxsize - 1

2. Python怎么写static method

class A:
    @staticmethod
    def my_func0():

注意和@classmethod作区分

class Person:
    age = 25
    @classmethod
    def printAge(cls):
        print('The age is:', cls.age)
# create printAge class method
Person.printAge = classmethod(Person.printAge)
Person.printAge()

AWS Certified Machine Learning Preparation Notes

Recently in 2020 December, I just passed AWS Certified Machine Learning exam with a score of 931/1000. Here is the preparation notes for the exam.

ML Domain Knowledge

Feature Engineering

  1. Imputing missing data
    Mean/median replacement, median is better when got outliers
    Dropping
    KNN, deep learning, regression (MICE, multiple imputation by chained equations)
    Just get more data

  2. Handling unbalanced data
    Oversampling
    Undersampling
    SMOTE, synthetic minority over-sampling technique
    Adjusting thresholds

  3. Handling outlier
    What is outlier
    Data points that lie more than certain (one) standard deviation from the mean.
    How to resolve
    Remove outliers when understand them.

  4. Techniques of feature engineering
    Binning
    Transforming with functions
    Encoding
    Scaling/normalization
    Shuffling

  5. TF-IDF Score
    Term Frequency and Inverse Document Frequency: figure out what terms are most relevant for a document
    Term frequency: how often a word occurs in a document
    Document frequency: how often a word in an entire set of documents --> common words that appear everywhere
    Relevancy of a word to a document: TF/DF == TF*IDF (IDF = 1/DF) --> how often the word appears in a document over how often it appears everywhere, ie. how important and unique this word is for this document
    An extension: use uni-grams, bi-grams, tri-grams, n-grams

Modelling

Deep Learning

  1. Activation functions
    Define the output of node givens its input signals
    Linear activation function: no backpropagation
    Binary step function: no multiple classfication, not good for calculus
    Sigmoid, logistic, tanH (for RNN)
    Rectified Linear Unit (ReLU)
    Leaky ReLU
    Parametric ReLU, negative slope is learned via backpropagation
    Exponential Linear Unit (ELU)
    Swish for really deep NN
    Softmax: final output layer, prob of each class

  2. CNN
    What are they for:
    Images, translation, sentence classification, sentiment analysis
    Feature-location invariant
    How do they work:
    Local receptive fields

  3. How to fix vanishing gradient problem
    Multi-level heirarchy
    Long short-term memory
    Residual networks
    Better choice of activation function, ReLU

  4. Batch Size
    Small —> out local minima, better

  5. Regularization, dropout

Metrics

Recall (TP/(TP+FN))
Precision (TP/(TP+FP))
F1 2PR/(P+R)
ROC, AUC
RMSE

AWS

AWS Data Engineering

Glue: Data Catalog, Crawler, ETL (Spark),
Catalog, crawler

Athena
Serverless query from S3, columnar data formats (Parquet, Apache ORC) faster perf

QuickSight
Visualization, ML Insights, adhoc

Streaming
Kinesis: source —> Streams —> Analytics —> Firehose —> s3/redshift

Step Functions
Workflow

Batch
Resource and schedule

EMRFS
Access S3 as if it were HDFS

AWS Embedded Algorithms

XGBoost
Subsample, Eta prevent overfitting

The Amazon SageMaker Latent Dirichlet Allocation (LDA) algorithm
an unsupervised learning algorithm that attempts to describe a set of observations as a mixture of distinct categories.
Latent Dirichlet Allocation, unsupervised, topic modelling, use CPU

Random Cut Forest
Anomaly detection

Neural Topic Model
Classify or summarize documents based on topics, Unsupervised

Factorization Machines
Sparse data, click prediction, item recom, recordIO float32

IP Insights
Unsupervised, suspicious behavior from IP addresses, CSV only

Seq2seq

Tokens must be integers

BlazingText
Text classification, label
Continuous bag of words, order does not matter

SageMaker RL
Distributed training

Amazon Comprehend
NLP and text analytics, sentiment

Amazon Transcribe
Speech

Amazon Polly
Text to speech

Amazon Lex
Alexa chatbot engine

Amazon Forecast
Time series

Amazon Kendra
Search with natural language

DeepAR
Time series, RNN

Sagemaker

SageMaker Neo
Compiling models using TensorFlow and other frameworks to edge devices such as Nvidia Jetson

Sagemaker production variants
Production new model switching inside Sagemaker

Any bucket with “sagemaker” in the name is accessible with the default SageMaker notebook role.

Sagemaker inference
Container response to 8080 for /invocations and /ping
Artefacts in tar format

Pipe mode
Streams data directly to container, improve perf

Most Amazon SageMaker algorithms work best when you use the optimised protobuf recordIO format for the training data.

Python: Submit multi parameters function to Executor

In Python, ThreadPoolExecutor and ProcessPoolExecutor, subclasses of Executor, are easy-to-use modules for multitasking.

They are easy-to-use because we only need to submit a task, consisting of a function and its parameters, to the executor then the executor will run the tasks synchronously for us.
For example,

from concurrent.futures import ThreadPoolExecutor


def say_something(var):
    print(var)

pool = ThreadPoolExecutor(2)
pool.submit(say_something, 'hello')
pool.submit(say_something, 'hi')

Output (the order maybe different):

hello
hi

However, the function may have multiple parameters.

def say_something(var1, var2):
    print('{}: {}'.format(var1, var2))

For such cases, we can use lambda to tackle the trick.

from concurrent.futures import ThreadPoolExecutor


def say_something (var1, var2):
    print('{}: {}'.format(var1, var2))

arr1 = ['name', 'Joldnine']
arr2 = ['email', '[email protected]']
pool = ThreadPoolExecutor(2)
pool.submit(lambda p: say_something(*p), arr1)
pool.submit(lambda p: say_something(*p), arr2)


Output:

name: Joldnine
email: [email protected]

Web Security

Outline

  1. Introduction to Certificates and HTTPS
  2. Session Cloning Attacks
  3. Same-origin Policy
  4. Cross-site Scripting (XSS) and defense
    4.1 Reflected XSS
    4.2 Persistent XSS
    4.3 DOM-based XSS
  5. Cross-site Request Forgery (CSRF)
  6. SQL Injection
  7. Password attacks
  8. Phishing
  9. Clickjacking
  10. Web SSO Attacks
  11. HTTP parameter pollution
  12. HTTP parameter tampering

图解Load Balancing五种schema, 四种实现方式, AWS ELB

TODO

  • 均匀派发(Even Task Distribution Scheme)
  • 加权派发(Weighted Task Distribution Scheme)
  • 粘滞会话(Sticky Session Scheme)
  • 均匀任务队列派发(Even Size Task Queue Distribution Scheme)
  • 单一队列(Autonomous Queue Scheme)

HTTP Redirect
DNS
Reverse Proxy
Direct Routing (DR)

Classification of User Category depending on Internet Usage.

Hello guys.

I have a 12GB .tgz file. Inside of that file, there are .csv.gz files.
I want to use this data for machine learning to classify user category.
Before I jump into this big file, I wanted to train only one .csv file inside of this zip. (for learning) 108MB file and it has something like this data >

image

The output of machine learning prediction will be a number that represents the category of user.
Which ML algorithm do you suggest to me? But I am not sure how I should proceed.

I learned SVM, Naive Bayes, KNN, Decision Tree before but the datasets were easy.
Like only two output > Cancer(1) or not cancer(0)
For this kind of dataset, how should I approach it?

Thanks.

Frequently Used Linux Commands

Some common commands and command tools in the Unix-like system.

Basics

$ mkdir $HOME/testFolder

$ cd $HOME/testFolder

$ cd ../

$ mv $HOME/testFolder /var/tmp

$ rm -rf /var/tmp/testFolder

$ ls /etc

$ touch ~/testFile

$ ls ~

$ cp ~/testFile ~/testNewFile

$ rm ~/testFile

$ cat ~/.bash_history

# Find a string in a file.
grep 'root' /etc/passwd

# Find a string under the folder.
$ grep -r 'linux' /var/log/

$ grep -r "linux" /var/log/ --include="*.log"

$ cat /etc/passwd | grep 'root'

$ ls /etc | grep 'ssh'

$ echo 'Hello World' > ~/test.txt

$ ping -c 4 cloud.tencent.com

$ ps -aux | grep 'ssh'

netstat

netstat 命令用于显示各种网络相关信息,如网络连接, 路由表, 接口状态等等

# 列出所有处于监听状态的tcp端口
$ netstat -lt

# 查看所有的端口信息, 包括 PID 和进程名称
$ netstat -tulpn

refs: 腾讯云实验室

tar

tar 是一个简单的解压缩工具。其中tar后缀代表只是把文件打包在一起,gz后缀代表压缩。

# 压缩
$ tar -cvzf <target_file> <source file/folder>

# 解压缩
$ tar -xvzf <source_file>

# 打包但是不压缩
$ tar -cvf <target_file> <source file/folder>

c代表compress;
z代表gzip的压缩包;
x代表extract;
v代表显示过程信息;
f代表后面接的是文件.

scp

# 下载
$ scp -i <pem file> <username>@<ip>:<remote/path> <local/path>

# 上传
$ scp -i <pem file> <local/path> <username>@<ip>:<remote/path> 

Suppress all logs

$ {{ commands }} >/dev/null 2>&1
# `>/dev/null` 代表把`stdout`输出到不存在的地方,`2>&1`代表把`stderr`输出到`stdout`

find

# Find a file under a directory
$ find ./dir -name "*.h"

# Delete the folders that does not meet name_pattern in the path depth of 1.
$ find { DIR } -mindepth 1 -maxdepth 1 -not -name '{ name_pattern }' -type d -exec rm -rf {} +

# For multiple patterns with `or`,
$ find { DIR } -mindepth 1 -maxdepth 1 \( -not -name "*.py" -o -name "*.html" \) -type d -exec rm -rf {} +

cut

TODO

alias

For example, I want to add alias for Linux WeChat app.
Add following commands to the last line of file ~/.bashrc. (I have an NPM project for electronic-wechat project).
alias wechat="npm start --prefix ~/Develop/WorkSpace/electronic-wechat"

Some useful alias:

alias untar='tar -zxvf '
alias ping5='ping -c 5'
alias www='python -m SimpleHTTPServer 8000'
alias ipe='curl ipinfo.io/ip'
alias ipi='ipconfig getifaddr en0'
alias c='clear'

wc

# Count the lines.
$ ... | wc -l
# Count the bytes.
$ ... | wc -c
# Count the characters.
$ ... | wc -m
# Count the words.
$ ... | wc -w

type

Get the type of a command.

$ type cd
cd is a shell builtin
$ type type
type is a shell builtin

disk/fs related

$ lsblk
$ df -h
$ du -sh
$ du -sh *
$ sudo resize2fs /dev/xvdf

date

$ yesterday=`TZ=Singapore date --date="-1 day" +%Y%m%d`
$ echo $yesterday # 20190117

kill a process by name

$ kill $(ps aux | grep '[k]ill_me.py' | awk '{print $2}')

Fetch API's Error Handling

Error handling of Fetch API will be much different from the way of ajax.

Normally, when the back end returns a non 200 response, the front end may either deal with the response's statusText or the response's body.

Here are the snapshots (in our example, the response body is json):

  1. response.statusText
fetch(query).then((response) => {
  if (!response.ok) {
    throw Error(response.statusText);
  }
  response.json().then((response) => {
    console.log(response)
  })
}).catch((error) => {
  // caution: error (which is response.statusText) is a ByteString, so we may need to convert it to string by error.toString()
  console.log(error.toString())
})
  1. response.body
fetch(query).then((response) => {
  if (!response.ok) {
    response.json().then((error) => {
      throw Error(error);
    }).catch(error => {
      console.log(error.message)
    })
  } else {
    response.json().then((response) => {
      console.log(response)
    })
  }
})  

Docker: Notes

Docker的一些常用脚本。
yum install docker-io -y
docker -v
service docker start
chkconfig docker on

echo "OPTIONS='--registry-mirror=https://mirror.ccs.tencentyun.com'" >> /etc/sysconfig/docker
systemctl daemon-reload
service docker restart
docker pull centos
docker images
docker build -t IMAGE_NAME:latest .

打印logs
docker logs CONTAINER_ID

生成一个 centos 镜像为模板的容器并使用其中的 bash shell
docker run -it centos /bin/bash
exit

打印所有container
docker ps -a
打印所有container并且不略写信息
docker ps -a --no-trunc

运行docker里的bash:
docker exec -it CONTAINER_ID /bin/bash

新建并启动一个container
docker run -d -p HOST_PORT:CONTAINER_INTERNAL_PORT -e PARAM1=VPARAM1 IMAGENAME

清理不用的资源
docker system prune

Docker: Use Java JRE from Another Docker

用shell写脚本的时候,有时会需要从另外一个docker(标准环境)里面call 特定版本或者特定configuration的Java,类似于python的virtual environment, 只要call docker的接口就可以了。
例:

function docker_java() {
    java_command=${@:?java command is not specified.}
    docker run --rm \
        -v ${HOST_WORKSPACE}:${CONTAINER_WORKSPACE} \
        ${BUILD_IMAGE} \
        /bin/bash -c "java ${java_command}"
}
docker_java -jar my-jar.jar

Nginx: 部署静态资源

Nginx是一个强大的开源服务器软件, 支持HTTP,reverse proxy 甚至 IMAP/POP3 proxy。Nginx Wiki 本文介绍如何用Nginx部署一个静态页面,或者说部署一个单页应用(Single Page Application)。顺利的话,用时大约5分钟。

环境

ubuntu 14.04 或者 ubuntu 16.04

安装

sudo apt-get update
sudo apt-get install nginx

firewall (16.04 only)

sudo ufw app list
sudo ufw allow 'Nginx HTTP'

config

这时候访问ip地址,就已经可以直接看到Nginx自带的欢迎页面了。
接下来把nginx导向我们的静态网页。

example 1

修改nginx的默认配置文件/etc/nginx/sites-available/default (不同版本可能不一样)
root项为你的静态页面文件所在目录。

最后,用浏览器直接访问这台ubuntu的ip地址,就可以看到我们刚刚部署的静态页面了。这里虽然直接访问IP地址,但访问的其实是这个IP的80端口。

example 2

如果不想污染nginx的default conf文件,可以新建一个conf文件,步骤如下。
在nginx的默认配置文件的http项中加入 include servers/*.
servers/文件夹下新建my-site.conf.
my-site.conf配置为:

server {
  listen: 8080; # 输入自己想要的端口。
  location / {
    root /path_to_static_files/; # 静态文件目录。
    index index.html;
  }
}

SQL: Transaction Isolation Level

Recently, I encountered topics about isolation levels, so I write this article to revise some basic concepts in transaction and transaction isolation levels.

In a database data operation, a transaction is defined as a single unit of work, which may consist of one or multiple SQL statements. It guarantees the single unit work can be wholly committed to the database or rolled back if any statement in it fails. A transaction should be atomic, consistent, isolated, and durable (ACID).

ACID

Atomicity:

A transaction is a single unit of work that should not be divided into smaller units. It means there are only two results of a transaction: whole commit or whole failure (rollback).

Consistency:

The database is always in a consistent state, ie. the data in the DB can be in the state that the transaction is not committed or the state that the transaction is wholly committed. Before and after transactions, the rules (constraints, cascades, triggers, etc) of the database are always met.
It is tricky to put consistency into ACID. Actually, consistency is correlated with the other three concepts, and AID is applied to guarantee the consistency in a certain level.

Isolation:

The concurrent running transactions are isolated. For example, we have a row row_a being modified by transaction t_a (not committed yet, but a few update statements have been executed). At the same time, if the transaction t_b tries to read row_a, the data has been always in the state before the starting of the whole transaction t_a.

Durability:

The data will remain so after the commit of a transaction, even occurring power loss, crashes or errors. The transactions must be recorded in a non-volatile memory. The non-volatile memory guarantees that the committed transaction result will be written in the disk immediately instead of being stored in the disk cache.

Types of Concurrency Problems in Data Accessing

Lost Update

Transactions overwrite each others' updates.

Dirty Reads

The data is changed by a transaction, but during this process another transaction reads the old value.

Non-repeatable Reads

In a transaction, there is more than one read action, but during these reads, changes are made to the data in other transactions. As a result, two read actions get different results, ie. the read actions are non-repeatable.

Phantom Reads

In a transaction, there is more than one read action, but during these reads, new rows are added in other transactions, which may cause different read results.

Types of Locks

To resolve the concurrency problems, databases usually use locks which will be stored in memory. The choices of lock types introduce several transaction isolation levels to achieve a performance trade-off.

Shared Lock

Exclusive Lock

Update Lock

Transaction Isolation Level

There are four common transaction isolation levels that prevent concurrency problems in data accessing.

Read Uncommitted

Dirty rows (not committed) are allowed to be returned. It provides good performance, but can cause dirty reads.

Read Committed

Read action will wait for the completion of the deletion, updating, or inserting by another transaction.

Repeatable Read

Read action blocks other transaction's update and delete.

Serializable

Read action blocks other transaction's update, delete and insert.

image

Figure. Capabilities of Isolation Levels to Prevent Concurrency Problems

Ansible: Note

This note records some common usage of Ansible that may not appear in the Ansible official docs.

Use script module to run local scripts on a remote host.

We may need to use script to run a local script on a remote host, and our bash file test.sh may be as simple as:

echo $1

And our playbook is like:

- hosts: my_remote_host
  tasks:
    - script: ./test.sh first_arg
      register: output
    - debug:
        var: output

However, the output may be empty. The possible reason is that in the remote host, the bash interpreter is specifically configured. So we need to edit our test.sh to add an interpreter for our script, such as:

#!/bin/bash
echo $1

Use script module to run local ansible on a remote host.

With script module, we can also control a host that is connected through an intermediate host.
image

The trick is to run a playbook in the intermediate host, but it requires the intermediate host to have the Ansible config to connect to our actual target host. With this method, we can put all our scripts in the localhost insteads of uploading to the imtermediate host.
An example:

# playbook in the localhost
- hosts: intermediate_host
  vars_files:
    - ./vars/main.yml
  tasks:
    - script: './files/run-me-in-the-intermediate-host.yml'  #the file 
      register: output
    - debug:
        var: output
      failed_when: '"FAILED! =>" in output.stdout'
  tags: [I-am-a-tag]
#! /usr/bin/env ansible-playbook
# run-me-in-the-intermediate-host.yml
# To be executed in the intermediate host
- hosts: actual-target-host
  tasks:
    - shell: ls
      register: output
    - debug:
        var: output
      failed_when: output.stderr != ''

Dynamically add a host (ubuntu) that uses a PEM.

- name: Add a host
  add_host:
    groups: "{{ GROUP_NAME }}"
    name: "{{ IP }}"
    ansible_user: ubuntu
    ansible_ssh_private_key_file: "{{ PEM_PATH }}"

Add a host that is connected through an intermediate host

- add_host:
    groups: {{ HOST_GROUPS }}
    name: {{ HOST_IP }}
    ansible_user: ubuntu
    ansible_ssh_private_key_file: "{{ PEM_FOR_HOST }}"
    ansible_ssh_common_args: '-o ProxyCommand="ssh -i {{ PEM_PATH_FOR_INTERMEDIATE }} -W %h:%p -q ubuntu@{{ INTERMEDIATE_HOST_IP }}"'

Confusion of tags on the role and an imported playbook

There is a usage of tags:

- name: A playbook.
  hosts: hostX
  roles:
    - { role: A, tags: [B] }
- import_playbook: a.yml
  tags: [B]

Intuitively, we may think it means run all the tasks tagged with B in the role A and the imported playbook a.yml, but it is not true. It actually means adding a tag B to the role A and the playbook importing action.

Syntax to run a playbook with multiple tags or skipped tags

$ ansible-playbook {my playbook} --tags "{tag1}, {tag2}"
$ ansible-playbook {my playbook} --skip-tags "{tag1}, {tag2}"

Front-end: Vue 和 React异同

做了一些前端项目,有些用Vue,有些用React,它们的**很像,但是具体到实现,用法又有许多不同。

TODO

Java: GC 和 原始类型(Primitive Type)

Primitive Type没有在heap里面存object,在method结束后自动清理这些variable以释放内存,也就不需要GC。
如果一定要说这些variable是在stack里面被 GC了,这个 GC一般指这个内存块被pop out了,程序的指针回到了一开始call这个function的内存地址。

SQL: An example of FIRST_VALUE

In a page view table, get the user id of non-login users from their future login page views:

SELECT FIRST_VALUE(user_id, TRUE) -- set TRUE to ignore NULL values
    OVER (
        PARTITION BY device_id 
        ORDER BY visit_time ASC
        ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING -- only get the future
    )
FROM page_view
WHERE stat_date='20190510' -- the table is partitioned by stat_date
;

Redux: When to Render Loading Icon

There is always a scenario that when data is being fetched from the server, the browser needs to display a spin icon. With react + redux, you may probably implement it in such way:

class SampleComponent extends React.Component {
  componentDidMount() {
    // dispatch the action here to fetch new data from the server and update the store
  }
  render() {
    if (// data is empty) { // or use conditional rendering
      return(<renderSpin />);
    }
    return (<renderWithData />);
  }
}

The implementation looks good at the first glance. However, the data may be incorrect in a case:

  1. SampleComponent mounts for the first time and the store is updated, where everything is fine.
  2. SampleComponent umounts.
  3. SampleComponent mounts for the second time and the fetching data is failed or takes a long time. The spin will be skipped because the data is not empty, but it is the data fetched from the first mount and is old.

We can see an unexpected behavior.

To avoid such problem, there could be 2 solutions.

  1. destroy the data in componentWillUnmount:
class SampleComponent extends React.Component {
  componentDidMount() {
    // dispatch the action here to fetch new data from the server and update the store
  }
  componentWillUnmount() {
    // dispatch the action here to make the data in the store empty
  }
  render() {
    if (// data is empty) {
      return(<renderSpin />);
    }
    return (<renderWithData />);
  }
}

This method is only applicable in the case that this data is not used in other components or you are confident that destroying the data upon umounting of this component is ok.
But if the data is not used in other components, why not use state instead of redux for this data?

  1. dispatch an empty data before fetching from the server in the action:
export const sampleData = () => {
  const url = `${ServerConst.SERVER_CONTEXT_PATH}/api/v1/xxx`;
  return (dispatch) => {
    dispatch({ // dispatch a default value
      type: ActionTypes.SAMPLE_DATA,
      data: [] // or null,
    });
    return fetchApi(url)
    .then(response => response.json())
    .then(data => dispatch({
      type: ActionTypes.SAMPLE_DATA,
      data,
    }));
  };
};

This method is applicable for most cases, but it will dispatch twice in each data updating.

Comment on this issue if you have other ideas.

Competitive Programming Study Notes

Introduction

Problem solving steps

  1. Read problem statement
  • Check input/output specification
  1. Make the problem abstraction
  2. Design the algorithm
  3. Implement and debug

Math

Algebra

  1. Sum of powers

image

2. Fast exponentiation

image

3. Gaussian Elimination

就是小学学的多项式方程组消除法 :|

Number Theory

  1. Greatest Common Divisor (GCD)
    gcd(a, b)=gcd(a, b − a)
    Runningtime:O(log(a + b))
    Care negative sign of a, b.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.