joldnine / joldnine.github.io Goto Github PK
View Code? Open in Web Editor NEWMy github.io blog repo. https://joldnine.github.io
My github.io blog repo. https://joldnine.github.io
这里是第一篇issue,开发测试使用.
System.out.println(“Hello, world!”);
虽然学习使用Java很多年,但是很多学校里学的数据结构实现和原理的的东西却忘得七七八八,偶尔复习一下,苟日新,日日新,又日新。我们就来一起复习一下这些基础知识吧!
String做为一个Class,它的数据由一个final修饰的char array 存储
private final char value[];
有一点要注意,在Java 7以前,String的object pool一般都放在Permanent generation里,和static待遇相同。Java 7开始String就丢到了heap里,因为Perm太小了(默认4M),而且Perm有取消的趋势。
String s = new String("abc")
会创建2个object, 一个放在heap,一个在常量池(JDK 7开始在heap)。
StringBuilder继承了abstract修饰的AbstractStringBuilder类,它的数据也是由char array存储(初始默认长度是16)。
char[] value;
.
StringBuilder在每次append新的String进来的时候,做了2件事
(1) 确认这个StringBuilder的容量是否足够,如果不足,就扩容。扩容操作会新建一个char[],大小为可以刚好存储加入新String。
核心代码:
ensureCapacityInternal(count + len);
Arrays.copyOf(value, newCapacity(minimumCapacity));
(2) 把要加入的String里的char一个个复制进来。
核心代码:
str.getChars(0, len, value, count);
System.arraycopy(value, srcBegin, dst, dstBegin, srcEnd - srcBegin);
StringBuffer就有意思了。它总的来说是一个thread-safe的StringBuilder,其方法有synchronized注释。
String小巧,生成速度快,但是修改会产生大量Object,加大GC的工作量,影响性能,适用于immutable的数据。
StringBuilder生成速度比String慢,线程不同步,但是可以修改,修改不会显著加大GC工作量,默认推荐使用。
StringBuffer修改速度比StringBuilder要慢,但是线程同步,安全。
我们一般会使用键值对(不重复的key,可重复的value)的时候用上Map类。
继承关系
public class HashMap<K,V> extends AbstractMap<K,V> implements Map<K,V>, Cloneable, Serializable
数据存储
transient Node<K,V>[] table;
static class Node<K,V> implements Map.Entry<K,V> {
final int hash;
final K key;
V value;
Node<K,V> next;
}
重要方法
put实现
至于为什么一开始就不用搜索更快的tree作为node,注释里这么解释
Because TreeNodes are about twice the size of regular nodes, we
use them only when bins contain enough nodes to warrant use
(see TREEIFY_THRESHOLD). And when they become too small (due to
removal or resizing) they are converted back to plain bins. In
usages with well-distributed user hashCodes, tree bins are
rarely used.
get实现
get就是put的反向实现,但是逻辑比put简单很多。
get和put中index的实现
static final int hash(Object key) {
int h;
return (key == null) ? 0 : (h = key.hashCode()) ^ (h >>> 16);
}
...
// n 是table的大小
(n - 1) & hash
...
使用注意:
sorted:
HashMap不是sorted,但是TreeMap和LinkedHashMap是sorted。
Vector同步,线程安全。
LinkedList本质是linked nodes,。ArrayList本质是数组,但是有index,这个数组在填满后需要新建复制一个大的。两者都有space的浪费。LinkedList每个node需要更多空间,ArrayList的数组tail需要预留一定冗余。
LinkedList使用场景:不会随意对list不同位置的value进行访问。希望在list中或者head加入删除元素。希望按顺序访问其中元素。
ArrayList使用场景:会对list随意位置频繁访问,希望在tail增加元素。
public synchronized void increase() {
inc++;
}
// 加锁
Lock lock = new ReentrantLock();
public void increase() {
lock.lock();
try {
inc++;
} finally{
lock.unlock();
}
}
// atomic wrapper classes
public AtomicInteger inc = new AtomicInteger();
public void increase() {
inc.getAndIncrement();
}
Thread t = new MyThread();
t.setPriority(8);
t.start();
yield:
该线程让步于其他相同优先级的线程。
Thread.yield()
sleep:
暂停线程。帮助所有线程获得运行机会。指定的睡眠时间是最短时间。
Thread.sleep(123);
join:
让一个线程加入另外一个线程的尾部。
非静态。
Thread t = new MyThread();
t.start();
t.join();
reference在stack,object在heap,没有reference的object就会被回收。
优先级较低的后台进程。
Young gen会放刚new的object,并且频繁快速gc符合条件的对象。
Young gen里又分Eden,Fron Survivor和To Survivor (8:1:1),survivor有一个一定是空的。但是如果object超大,一次GC就可能被丢进Old gen。
Old gen放多次GC后还存在的对象(MaxTenuringThreshold默认15次)。
空间Young gen:Old gen约等于1:2.
Permanent gen放一些static,String(为了解决String内容重复问题)。针对Permanent gen的Major GC,触发条件苛刻。
Minor GC是发生于Young gen的GC。
Major GC是发生于Old gen和Permanent gen的GC,时间更久。
Serial(UseSerialGC)串行收集器,复制算法。
SerialOld (UseSerialGC)串行收集器,标记整理算法。
CMS(Concurrent Low Pause Collector)经典的Stop Tow World(STW)。STW initial mark -- concurrent marking -- concurrent precleaning -- STW remark -- concurrent sweeping -- concurrent reset。
缺点:因为标记清理算法,会有内存碎片。需要更多CPU资源。需要更大的heap,默认old gen 68%启动GC。
ParNew (UseParNewGC)并行收集器,复制算法。
G1 GC
根搜索
标记-清除:扫描--标记--清除,没有对活着的对象进行整理,会有内存碎片。
复制:在标记-清除后,将存活对象往一端空闲空间移动,更新stack里的reference,成本高,但是解决了内存碎片的问题。
Young gen的Minor GC一般用复制算法,Major GC一般用标记-清除算法。
System.gc() 可以触发full GC但不是马上执行,比较影响性能。
原理:
Throwable
Exception & Error
RunTimeException & other Exceptions
Here are some commonly used npm commands.
Download binaries from Nodejs download, eg. node-v6.11.5-linux-x64.tar.xz
Extract to /home/{ ubuntu username }/Develop/Application
Add export PATH=/home/{ ubuntu username }/Develop/Application/node-v6.11.5-linux-x64/bin:$PATH
to the last line of the hiden file /home/{ ubuntu username }/.bashrc
Execute
source /home/{ ubuntu username }/.bashrc
Execute
node --version; npm --version
It will show the versions if the installation is successful.
$ npm config get prefix
// global
$ npm i { package name } -g
// local
$ npm i { package name }
// update package.json
$ npm i { package name } -S
$ npm install [email protected]
$ npm update axios
// global
$ npm list -g
$ npm list -g --depth=0
// local
$ npm list
$ npm list --depth=0
$ npm uninstall { package name }
$ npm init
$ npm install npm-update-all -g
// under the directory of package.json
$ npm-update-all
$ npm { command } --prefix { path/to/another/directory }
Set registry
$ npm config set registry { registry URL }
Get current registry
$ npm config get registry
Actually, for npm config get/set
, it simply gets/sets the key values in the config map, ie. registry
is a key and its URL is its value in config map.
最近捡起了golang写些API,偶尔遇到些有价值的东西,记录一下。一切从简。
Golang起个HTTP Restful API server比较简单。
package main
import (
"encoding/json"
"fmt"
"log"
"net/http"
)
type Greeting struct {
// must be "Content" but not "content" to indicate that it is exportable
Content string
}
func enableCors(w *http.ResponseWriter) {
(*w).Header().Set("Access-Control-Allow-Origin", "http://localhost:8080")
}
func helloHandler(w http.ResponseWriter, r *http.Request) {
enableCors(&w)
name := r.URL.Query().Get("name")
if name == "" {
w.WriteHeader(422)
fmt.Fprintf(w, "Invalid input.")
return
}
greeting := Greeting{Content: "Hello, " + name}
jsonResult, err := json.MarshalIndent(greeting, "", "\t")
if err != nil {
log.Fatal(err)
}
w.Header().Set("Content-Type", "application/json")
w.Write(jsonResult)
}
func main() {
http.HandleFunc("/hello", helloHandler)
log.Fatal(http.ListenAndServe(":8080", nil))
}
麻雀虽小五脏俱全。上面几行代码不仅包括了基本http请求,还有reponseBody的数据模型,错误处理,CORS,log,JSON parse这些基本功能。另外,通过阅读http包的源码我们可以知道,每个http请求都是一个goroutine,原生支持高并发。Handling 1 Million Requests per Minute with Go。
运行起来后,在浏览器访问 http://localhost:8080?name=Steve, 可以看到:
{
"Content": "Hello, Steve"
}
https://golang.org/pkg/net/http/
修饰延时执行的代码,丢到整个function return前最后被调用。defer经典用例:修饰IO的close操作,IO的close代码紧跟在open之后,赏心悦目。其他用法还有与panic的搭配使用来处理程序异常。
package main
import "fmt"
func main() {
defer fmt.Println("4")
defer fmt.Println("3")
fmt.Println("1")
fmt.Println("2")
}
1
2
3
4
https://tour.golang.org/flowcontrol/12
golang自带的xml.Unmarshal可以方便地转换xml到struct。类似的还有json.Unmarshal,用法相同。
package main
import (
"encoding/xml"
"fmt"
"log"
)
// Department data model
type Department struct {
Name string `xml:"Name"`
Members []struct {
Name string `xml:"Name"`
} `xml:"Members>Member"`
}
func parseXML(data interface{}, xmlContent string) (err error) {
err = xml.Unmarshal([]byte(xmlContent), data)
if nil != err {
log.Fatal("Error unmarshalling from XML", err)
}
return
}
func main() {
departmentXML := `
<Department>
<Name>Agents of S.H.I.E.L.D.</Name>
<Members>
<Member>
<Name>Nick Fury</Name>
</Member>
<Member>
<Name>Natalia Alianovna Romanova</Name>
</Member>
</Members>
</Department>
`
department := &Department{}
parseXML(department, departmentXML)
for i := 0; i < len(department.Members); i++ {
fmt.Println(department.Members[i].Name)
}
}
Nick Fury
Natalia Alianovna Romanova
如果嫌手写type struct太麻烦,可以使用工具把一个具体的xml一键转换成type struct,比如说XML to Go struct。
https://golang.org/pkg/encoding/xml/
https://golang.org/pkg/encoding/json/
https://www.onlinetool.io/xmltogo/
4种方法:&T{}, &localVariable, new, make。其中不同,我们观察一下例子就懂。
package main
import (
"fmt"
"reflect"
)
// User data model
type User struct {
Name string
}
func main() {
fmt.Println(reflect.TypeOf(&[]User{}))
var localVar []User
fmt.Println(reflect.TypeOf(&localVar)) // localVar has to be a local variable
fmt.Println(reflect.TypeOf(new([]User)))
fmt.Println(reflect.TypeOf(make([]User, 1)))
}
*[]main.User
*[]main.User
*[]main.User
[]main.User
https://tour.golang.org/moretypes/13
linux
GOOS=linux GOARCH=amd64 go build -ldflags "-w -s" -o {name_of_target_file}
windows
GOOS=windows GOARCH=amd64 go build -ldflags "-w -s" -o {name_of_target_file}
mac
GOOS=darwin GOARCH=amd64 go build -ldflags "-w -s" -o {name_of_target_file}
Golang鼓励error一层层上抛或者就地处理,而不是Java那样允许吃掉exception。有时出现比较严重的error,就需要停止程序或当前的goroutine,而其中做法又有panic和log.fatal之分。
// TODO
用Hive做ETL的时候,经常会遇到数据倾斜(Data Stew)的问题,记录总结一下。
平时大概率遇到的可能有以下几类:
join时的数据倾斜一般是因为某些key对应的数据量比较大。思路一般是:
我们有2个表,一个pv表,一个seller表。
dwd_pv 流量表字段:
visit_time, product_id, seller_id
假设有10亿行
dim_seller 卖家表字段:
seller_id, seller_name
假设有一千万行
Query:
SELECT visit_time, product_id, pv.seller_id, seller_name
FROM dwd_pv AS pv
LEFT OUTER JOIN dim_seller AS slr
ON pv.seller_id = slr.seller_id;
这就是一个普通的补字段SQL,但是在某些个seller流量特别大的情况下会发生数据倾斜。
为了解决这个问题,我们先了解一下HIVE里的join(Shuffle Join)发生了什么:
group by 造成的数据倾斜和join类似,group by里的某个字段数据量太大。思路一般是:
思路:
key splitting。
# Nvidia GPU information and status check
$ sudo nvidia-smi
# Nvidia GPU information and status check with trend and metrics (recommended)
$ sudo nvidia-smi --query-gpu=timestamp,pstate,temperature.gpu,utilization.gpu,utilization.memory,memory.total,memory.free,memory.used --format=csv -l 1
# Refresh GPU info every 5 seconds
$ watch -n 5 nvidia-smi
# Check version of CUDA
$ nvcc -V
Check if Torch is using GPU
In python shell:
import torch
torch.cuda.get_device_name(0)
Check if tensorflow is using GPU
In python shell:
用Vue开发网页,Webpack打包,部署到Github Pages.
$ vue init webpack my-github-page
$ cd my-github-page
$ npm install
config/index.js
build
module下的assetsPublicPath
为./
build/webpack.prod.conf.js
removeAttributeQuotes
为 false
$ npm run build
dist/
目录下username.github.io
, 其中username要和自己github的username相同。<username>.github.io
repo下,进入Settings, 在GitHub Pages那栏下,Source选择master branch
.username.github.io
$ git clone { username.github.io的git link }
dist/
文件夹下所有文件到 username.github.io 文件夹下$ cd username.github.io
$ git add .
$ git commit -m "init"
$ git push
最后,大概2分钟后,在浏览器上打开 https://username.github.io, 就可以看到你的网页了。
#!/bin/bash
set -x
BLOG_REPO="joldnine.github.io"
BLOG_DEV="vuejs-blog"
cd $BLOG_DEV
npm run build >/dev/null
cd ..
rm -rf $BLOG_REPO/static
rm -rf $BLOG_REPO/index.html
cp -r $BLOG_DEV/dist/* $BLOG_REPO/
cd $BLOG_REPO
git add .
git commit -a -m "deploy"
git push origin master
Regex to find the last occurrence of "abcd" in a string
abcd(?!.*abcd)
Negative lookahead: (?!...)
Starting at the current position in the expression, ensures that the given pattern will not match. Does not consume characters.
ks="$(cqlsh -e 'DESC keyspaces;')"
arr=(${ks// / })
mkdir schemas_temp_folder
for i in "${arr[@]}"
do
if [[ ("$i" != "system_traces") && ("$i" != "system") ]]; then
cqlsh -e "DESC keyspace $i;" | tee schemas_temp_folder/$i-schema.cql > /dev/null
fi
done
files="$(ls)"
arr=(${files// / })
for i in "${arr[@]}"
do
echo $i
cqlsh --request-timeout=3600 -f $i
done
sstables_dir=$1
sstable2json='/home/ubuntu/Application/apache-cassandra-2.1.19/tools/bin/sstable2json'
output_dir=$2
cd $sstables_dir
files="$(ls)"
arr=(${files// / })
for i in "${arr[@]}"
do
if [[ $i = *"-Data.db" ]]; then
echo $i
$sstable2json $sstables_dir/$i | tee $output_dir/$i.json > /dev/null
fi
done
Today we are going to build a Convolutional Neural Network with Python+Numly only without any machine learning packages and frameworks. Eventually, we are going to train a MNIST model using our "home-made" CNN framework.
TODO
I was always about to write a concepts list where every concept is explained in one sentence with strong intuition, and it can be understood by a beginner or even an interested layman, while the veteran will applaud for it. As we know, when explaining a concept, it is easy to add more notes and details to it, but it is very challenging to shorten the explanation.
Here is the list. I will keep adding instances to the list and modify the concepts, occasionally :>
The database for the engineered front-end project.
A state type whose mutation can be detected and recovered easily because any changes to the object imply the replacement of the whole object (reference, ie. memory address of the object, is replaced).
The initialized objects managed by Spring container.
Flexibly define the dependencies for a Spring Bean.
Get the information of a Java Class in its runtime.
git clone https://github.com/Homebrew/homebrew-core.git
cd homebrew-core
git log master -- Formula/git.rb
commit 0c49ceffe4944b095da4d0c39a6b8499714d0df8
Author: BrewTestBot [email protected]
Date: Tue Aug 17 04:07:34 2021 +0000
git: update 2.33.0 bottle.
Note the first 10 char of the commit hash, which will be 0c49ceffe4
in the example
brew install https://github.com/Homebrew/homebrew-core/raw/0c49ceffe4/Formula/git.rb
You can also use the full SHA of the commit.
Revise types of index used in databases from the class CS4221, National University of Singapore.
B+ tree has more nodes in one layer. Less IO for each search.
RB Tree is too deep.
Sparse: not all keys will be indexed. Unstable search speed but less space.
Dense: all keys will be indexed. Faster search but more space.
Clustered index: rows in the disk are stored in the same order of index, so there can only be one cluster index for the rows.
With a non clustered index there is a second list that has pointers to the physical rows.
Primary key is unique.
The secondary key can be not unique.
Some frequently used kubectl commands.
Create objects:
kubectl run --image nginx nginx
Example:
Create a pod with exposed clusterip service:
kubectl run webapp --image=webapp --port=80 --expose
kubectl create deployment --image nginx nginx
kubectl expose deployment nginx --port=80
Update objects:
kubectl edit deployment nginx
kubectl scale deployment nginx --replicas=5
kubectl set image deployment nginx nginx=nginx:1.18
xxx --dry-run -o yaml
Dry run and check the correctness
xxx --dry-run=client -o yaml
Use YAML files:
kubectl create -f nginx.yaml
kubectl replace -f nginx.yaml
kubectl delete -f nginx.yaml
kubectl apply -f nginx.yaml
List pods of a namespace
kubectl get pods -n=NAMESPACE
List pods
kubectl get pods --all-namespaces
kubectl get pods -n NAMESPACE --sort-by .metadata.creationTimestamp | grep Pending
Describe a pod
kubectl describe POD_NAME -n=NAMESPACE
Get namespaces
kubectl get namespaces
Get logs of the init container of a pod
kubectl logs POD_NAME -c INIT_CONTAINER_NAME
Go into bash shell of a pod
kubectl exec -it POD_NAME -n NAMESPACE -- /bin/bash
List service accounts of a namespace
kubectl get serviceaccounts -n NAMESPACE
Describe serect of a service account
kubectl describe secret SERVICE_ACCOUNT -n NAMESPACE
kubectl get nodes --show-labels
kubectl describe node NODE_NAME
kubectl drain NODE_NAME --delete-local-data --ignore-daemonsets --force
kubectl get pvc -n NAMESPACE
Patch a hanging pvc
kubectl patch pvc PVC_NAME -p '{"metadata":{"finalizers":null}}' -n NAMESPACE
kubectl describe configmap CONFIGMAP_NAME
-n NAMESPACE
List the resources utilized by pods or nodes
kubectl top node/pod -n NAMESPACE
Launch K8S proxy server in local to access k8s APIs
kubectl proxy --port=XXXX
Sometimes we want to ‘squash’ multiple commits into one. We can always do it by interactive rebase
.
Example:
$ git log --oneline
We have:
5f2f5fb commit-3
2506e7a commit-2
2800c01 commit-1
12s21f commit-0
We hope to merge three commits (commit-1, commit-2 and commit-3) into one.
Steps:
$ git rebase -i 12s21f
We have:
pick 2800c01 commit-1
pick 2506e7a commit-2
pick 5f2f5fb commit-3....
Modify it into:
pick 2800c01 commit-1
s 2506e7a commit-2
s 5f2f5fb commit-3....
We save it by:
Press esc
wq
Enter
After it, we will be asked to edit the commit message; edit and save it by
Press esc
wq
Enter
Done!
换手机时翻到了大一时记的Vim Commands Cheatsheet。其实vim在现在工作中也经常用到,记录一下吧。注:linux环境。
insert模式
i
回到normal模式 (以下指令如未特殊说明,都是在normal模式下)
Esc
去文件首行
1G
1gg
:1
+ Enter
去文件最后一行
G
保存并退出
:wq
or :x
or ZZ
保存
:w
退出
q
不保存退出
q!
Undo
u
新开一行
o
or O
Delete a word
dw
d$
Delete a line
dd
Paste
p
Delete 3 lines
3dd
复制一行
yy
复制三行
3yy
在normal模式下,
ctrl + v
上下选择要注释的行
shift + i
输入注释 #
, //
等
esc
在normal模式下,
:起始行号,结束行号s/^/注释符/g
例:
如果注释符是 #
:1, 100s/^/#/g
如果注释符是 //
:1, 100s#^#//#g
第二种办法比较丑...也不好记,个人不太喜欢
在normal模式下,
ctrl + v
选择要取消注释的行
d
:起始行号,结束行号s/^注释符//g
例:
如果注释符是 #
:1, 100s/^/#//g
如果注释符是 //
:1, 100s#^#//##g
:%s/替换原文regex/替换目标
例子:
把所有111换成222
:%s/111/222
:s/替换原文regex/替换目标
:范围起始行数,范围末尾行数s/替换原文regex/替换目标
例:
:1,100s/111/222
:%!python -m json.tool
最近有个业余project用到Lambda来全线部署RESTful后端。
以一个简单的CRUD service为例,步骤如下:
import sys
import logging
import pymysql
import rds_config
import json
logger = logging.getLogger()
logger.setLevel(logging.INFO)
class DbAccess(object):
def __init__(self):
rds_host = rds_config.db_host
db_username = rds_config.db_username
db_password = rds_config.db_password
db_name = rds_config.db_name
# try catch is required in prod
self.conn = pymysql.connect(rds_host, user=db_username, passwd=db_password, db=db_name, connect_timeout=5)
logger.info("SUCCESS: Connection to RDS mysql instance succeeded")
def get(event, context):
field_a = event['field_a']
rows = []
conn = DbAccess().conn
with conn.cursor() as cur:
# prevent SQL injection
cur.execute('SELECT * FROM db_name.table_name WHERE field_a=%s;', field_a)
for row in cur:
rows.append(row)
return {"body": json.dumps(rows), "statusCode": 200}
新建一个rds_config.py, 存放数据库连接配置:
db_host =
db_username =
db_password =
db_name =
pymysql
,所以我们在步骤1的文件同目录下,运行命令行:$ pip install pymysql -t .
最后,我们现在有4个文件(夹):
api.py
rds_config.py
pymysql/
PyMySQL-0.8.0.dist-info/
使用Lambda部署python脚本的时候,所引入的第三方依赖(eg. sklearn
)不能有C语言库的依赖,详情,因为我们在本地非linux机子上安装依赖的时候,它引入的C语言库在Lambda的容器里无法使用。这时候有2种解决办法:方法1. 用linux环境(docker,VM,或者换台机子)安装第三方依赖。方法2. 把语言从python换成Node.js或者Golang(注:Golang在Lambda上性能和Java差不多)。
将以上四个文件(夹)打包成一个zip压缩包。
打开
aws console -- Lambda -- Create function
选Author from scratch
填入Name,
Runtime选择python3.6,
Role和Existing role随便选个开放权限的,
点Create function。
这样就建好了一个基于Lambda的API,之后可能还要Enable CORS,deploy API,按需配置即可。
本来想把error handling写进“Go语言使用笔记”里的,奈何这部分实在有点多,那就单独开一篇文吧!
用Go一段时间的小伙伴可能都会吐槽:为什么要有近半代码都在处理err??
这是因为Go本身没有Java那种catch exception的机制,而是通过func逐级向上抛err。这么做的可以鼓励工程师主动处理err,而非忽略err,代价就是多了许多err handling的代码。关于err handling,Go社区里也是有过热烈的讨(tu)论(cao),总结一下大概有以下几种方案:
TODO
刚刚看了陈奕迅演唱会。
说实话,从一开始到最后快结束,一直感觉挺失望的。唱的都是新专辑的歌,大家都没有听过,也没有请到老搭档林夕作词,新歌的歌词普遍显得空泛做作,以至于在退场的时候朋友圈里都在吐槽歌都没听过也不出彩。
不过,在Eason和工作人员们都退场后,虽然舞台灯已经暗了下来,大家还是不甘心地一遍遍喊"Eason Eason Eason", "再来一个",最后奇迹般地舞台灯突然亮起,"让我留在你身边"的前奏悠然而来,那瞬间,整个会场都沸腾了。当时真的很开心,因为这是我们俩最喜欢的歌。这才是我们的Eason和青春!
当然我们都知道这是演出效果,不过感觉还是很棒的。
Java的反射机制 (Java Reflection)可以在Runtime阶段得到Java对象的信息。网上有很多实用教程,比如说 Java Reflection Example Tutorial。
前段时间遇到一个有趣的问题:在runtime判断某个class里所有field的type,其中难点在于判断某个field是否为泛型(generic type)或者对象(java.lang.Object)。针对此难点特此记录。
思路如下:
class<?>
对象getDeclaredFields()
方法拿到fieldsfield.getType().equals(Object.class)
判断是否为java.lang.Object
field.getGenericType() instanceof TypeVariable
判断是否为泛型。java.lang.Object
,所以在这里需要进一步判断。完整代码如下:
try {
for (Field field : clazz.getDeclaredFields()) {
try {
if (field.getType().equals(Object.class)) {
if (field.getGenericType() instanceof TypeVariable) {
System.out.println("Class: '%s' has generic type field: '%s'", clazz.getName(), field.getName();
} else {
System.out.println("Class: '%s' has Object type field: '%s'", clazz.getName(), field.getName();
}
}
} catch (TypeNotPresentException e) {
// do something
}
}
} catch (NoClassDefFoundError) {
// do something
}
最大integer
sys.maxsize
最小integer
-sys.maxsize - 1
class A:
@staticmethod
def my_func0():
注意和@classmethod作区分
class Person:
age = 25
@classmethod
def printAge(cls):
print('The age is:', cls.age)
# create printAge class method
Person.printAge = classmethod(Person.printAge)
Person.printAge()
Recently in 2020 December, I just passed AWS Certified Machine Learning exam with a score of 931/1000. Here is the preparation notes for the exam.
Imputing missing data
Mean/median replacement, median is better when got outliers
Dropping
KNN, deep learning, regression (MICE, multiple imputation by chained equations)
Just get more data
Handling unbalanced data
Oversampling
Undersampling
SMOTE, synthetic minority over-sampling technique
Adjusting thresholds
Handling outlier
What is outlier
Data points that lie more than certain (one) standard deviation from the mean.
How to resolve
Remove outliers when understand them.
Techniques of feature engineering
Binning
Transforming with functions
Encoding
Scaling/normalization
Shuffling
TF-IDF Score
Term Frequency and Inverse Document Frequency: figure out what terms are most relevant for a document
Term frequency: how often a word occurs in a document
Document frequency: how often a word in an entire set of documents --> common words that appear everywhere
Relevancy of a word to a document: TF/DF == TF*IDF (IDF = 1/DF) --> how often the word appears in a document over how often it appears everywhere, ie. how important and unique this word is for this document
An extension: use uni-grams, bi-grams, tri-grams, n-grams
Activation functions
Define the output of node givens its input signals
Linear activation function: no backpropagation
Binary step function: no multiple classfication, not good for calculus
Sigmoid, logistic, tanH (for RNN)
Rectified Linear Unit (ReLU)
Leaky ReLU
Parametric ReLU, negative slope is learned via backpropagation
Exponential Linear Unit (ELU)
Swish for really deep NN
Softmax: final output layer, prob of each class
CNN
What are they for:
Images, translation, sentence classification, sentiment analysis
Feature-location invariant
How do they work:
Local receptive fields
How to fix vanishing gradient problem
Multi-level heirarchy
Long short-term memory
Residual networks
Better choice of activation function, ReLU
Batch Size
Small —> out local minima, better
Regularization, dropout
Recall (TP/(TP+FN))
Precision (TP/(TP+FP))
F1 2PR/(P+R)
ROC, AUC
RMSE
Glue: Data Catalog, Crawler, ETL (Spark),
Catalog, crawler
Athena
Serverless query from S3, columnar data formats (Parquet, Apache ORC) faster perf
QuickSight
Visualization, ML Insights, adhoc
Streaming
Kinesis: source —> Streams —> Analytics —> Firehose —> s3/redshift
Step Functions
Workflow
Batch
Resource and schedule
EMRFS
Access S3 as if it were HDFS
XGBoost
Subsample, Eta prevent overfitting
The Amazon SageMaker Latent Dirichlet Allocation (LDA) algorithm
an unsupervised learning algorithm that attempts to describe a set of observations as a mixture of distinct categories.
Latent Dirichlet Allocation, unsupervised, topic modelling, use CPU
Random Cut Forest
Anomaly detection
Neural Topic Model
Classify or summarize documents based on topics, Unsupervised
Factorization Machines
Sparse data, click prediction, item recom, recordIO float32
IP Insights
Unsupervised, suspicious behavior from IP addresses, CSV only
Seq2seq
Tokens must be integers
BlazingText
Text classification, label
Continuous bag of words, order does not matter
SageMaker RL
Distributed training
Amazon Comprehend
NLP and text analytics, sentiment
Amazon Transcribe
Speech
Amazon Polly
Text to speech
Amazon Lex
Alexa chatbot engine
Amazon Forecast
Time series
Amazon Kendra
Search with natural language
DeepAR
Time series, RNN
SageMaker Neo
Compiling models using TensorFlow and other frameworks to edge devices such as Nvidia Jetson
Sagemaker production variants
Production new model switching inside Sagemaker
Any bucket with “sagemaker” in the name is accessible with the default SageMaker notebook role.
Sagemaker inference
Container response to 8080 for /invocations and /ping
Artefacts in tar format
Pipe mode
Streams data directly to container, improve perf
Most Amazon SageMaker algorithms work best when you use the optimised protobuf recordIO format for the training data.
In Python, ThreadPoolExecutor
and ProcessPoolExecutor
, subclasses of Executor
, are easy-to-use modules for multitasking.
They are easy-to-use because we only need to submit a task, consisting of a function and its parameters, to the executor then the executor will run the tasks synchronously for us.
For example,
from concurrent.futures import ThreadPoolExecutor
def say_something(var):
print(var)
pool = ThreadPoolExecutor(2)
pool.submit(say_something, 'hello')
pool.submit(say_something, 'hi')
Output (the order maybe different):
hello
hi
However, the function may have multiple parameters.
def say_something(var1, var2):
print('{}: {}'.format(var1, var2))
For such cases, we can use lambda
to tackle the trick.
from concurrent.futures import ThreadPoolExecutor
def say_something (var1, var2):
print('{}: {}'.format(var1, var2))
arr1 = ['name', 'Joldnine']
arr2 = ['email', '[email protected]']
pool = ThreadPoolExecutor(2)
pool.submit(lambda p: say_something(*p), arr1)
pool.submit(lambda p: say_something(*p), arr2)
Output:
name: Joldnine
email: [email protected]
Outline
Sigmoid
Tanh
Relu, Leaky Relu, PRelu, Randomized Leaky ReLU, Maxout
TODO
HTTP Redirect
DNS
Reverse Proxy
Direct Routing (DR)
Hello guys.
I have a 12GB .tgz file. Inside of that file, there are .csv.gz files.
I want to use this data for machine learning to classify user category.
Before I jump into this big file, I wanted to train only one .csv file inside of this zip. (for learning) 108MB file and it has something like this data >
The output of machine learning prediction will be a number that represents the category of user.
Which ML algorithm do you suggest to me? But I am not sure how I should proceed.
I learned SVM, Naive Bayes, KNN, Decision Tree before but the datasets were easy.
Like only two output > Cancer(1) or not cancer(0)
For this kind of dataset, how should I approach it?
Thanks.
Some common commands and command tools in the Unix-like system.
$ mkdir $HOME/testFolder
$ cd $HOME/testFolder
$ cd ../
$ mv $HOME/testFolder /var/tmp
$ rm -rf /var/tmp/testFolder
$ ls /etc
$ touch ~/testFile
$ ls ~
$ cp ~/testFile ~/testNewFile
$ rm ~/testFile
$ cat ~/.bash_history
# Find a string in a file.
grep 'root' /etc/passwd
# Find a string under the folder.
$ grep -r 'linux' /var/log/
$ grep -r "linux" /var/log/ --include="*.log"
$ cat /etc/passwd | grep 'root'
$ ls /etc | grep 'ssh'
$ echo 'Hello World' > ~/test.txt
$ ping -c 4 cloud.tencent.com
$ ps -aux | grep 'ssh'
netstat 命令用于显示各种网络相关信息,如网络连接, 路由表, 接口状态等等
# 列出所有处于监听状态的tcp端口
$ netstat -lt
# 查看所有的端口信息, 包括 PID 和进程名称
$ netstat -tulpn
refs: 腾讯云实验室
tar 是一个简单的解压缩工具。其中tar后缀代表只是把文件打包在一起,gz后缀代表压缩。
# 压缩
$ tar -cvzf <target_file> <source file/folder>
# 解压缩
$ tar -xvzf <source_file>
# 打包但是不压缩
$ tar -cvf <target_file> <source file/folder>
c代表compress;
z代表gzip的压缩包;
x代表extract;
v代表显示过程信息;
f代表后面接的是文件.
# 下载
$ scp -i <pem file> <username>@<ip>:<remote/path> <local/path>
# 上传
$ scp -i <pem file> <local/path> <username>@<ip>:<remote/path>
$ {{ commands }} >/dev/null 2>&1
# `>/dev/null` 代表把`stdout`输出到不存在的地方,`2>&1`代表把`stderr`输出到`stdout`
# Find a file under a directory
$ find ./dir -name "*.h"
# Delete the folders that does not meet name_pattern in the path depth of 1.
$ find { DIR } -mindepth 1 -maxdepth 1 -not -name '{ name_pattern }' -type d -exec rm -rf {} +
# For multiple patterns with `or`,
$ find { DIR } -mindepth 1 -maxdepth 1 \( -not -name "*.py" -o -name "*.html" \) -type d -exec rm -rf {} +
TODO
For example, I want to add alias for Linux WeChat app.
Add following commands to the last line of file ~/.bashrc
. (I have an NPM project for electronic-wechat project).
alias wechat="npm start --prefix ~/Develop/WorkSpace/electronic-wechat"
alias untar='tar -zxvf '
alias ping5='ping -c 5'
alias www='python -m SimpleHTTPServer 8000'
alias ipe='curl ipinfo.io/ip'
alias ipi='ipconfig getifaddr en0'
alias c='clear'
# Count the lines.
$ ... | wc -l
# Count the bytes.
$ ... | wc -c
# Count the characters.
$ ... | wc -m
# Count the words.
$ ... | wc -w
Get the type of a command.
$ type cd
cd is a shell builtin
$ type type
type is a shell builtin
$ lsblk
$ df -h
$ du -sh
$ du -sh *
$ sudo resize2fs /dev/xvdf
$ yesterday=`TZ=Singapore date --date="-1 day" +%Y%m%d`
$ echo $yesterday # 20190117
$ kill $(ps aux | grep '[k]ill_me.py' | awk '{print $2}')
Error handling of Fetch API will be much different from the way of ajax.
Normally, when the back end returns a non 200 response, the front end may either deal with the response's statusText or the response's body.
Here are the snapshots (in our example, the response body is json):
fetch(query).then((response) => {
if (!response.ok) {
throw Error(response.statusText);
}
response.json().then((response) => {
console.log(response)
})
}).catch((error) => {
// caution: error (which is response.statusText) is a ByteString, so we may need to convert it to string by error.toString()
console.log(error.toString())
})
fetch(query).then((response) => {
if (!response.ok) {
response.json().then((error) => {
throw Error(error);
}).catch(error => {
console.log(error.message)
})
} else {
response.json().then((response) => {
console.log(response)
})
}
})
Docker的一些常用脚本。
yum install docker-io -y
docker -v
service docker start
chkconfig docker on
echo "OPTIONS='--registry-mirror=https://mirror.ccs.tencentyun.com'" >> /etc/sysconfig/docker
systemctl daemon-reload
service docker restart
docker pull centos
docker images
docker build -t IMAGE_NAME:latest .
打印logs
docker logs CONTAINER_ID
生成一个 centos 镜像为模板的容器并使用其中的 bash shell
docker run -it centos /bin/bash
exit
打印所有container
docker ps -a
打印所有container并且不略写信息
docker ps -a --no-trunc
运行docker里的bash:
docker exec -it CONTAINER_ID /bin/bash
新建并启动一个container
docker run -d -p HOST_PORT:CONTAINER_INTERNAL_PORT -e PARAM1=VPARAM1 IMAGENAME
清理不用的资源
docker system prune
用shell写脚本的时候,有时会需要从另外一个docker(标准环境)里面call 特定版本或者特定configuration的Java,类似于python的virtual environment, 只要call docker的接口就可以了。
例:
function docker_java() {
java_command=${@:?java command is not specified.}
docker run --rm \
-v ${HOST_WORKSPACE}:${CONTAINER_WORKSPACE} \
${BUILD_IMAGE} \
/bin/bash -c "java ${java_command}"
}
docker_java -jar my-jar.jar
Nginx是一个强大的开源服务器软件, 支持HTTP,reverse proxy 甚至 IMAP/POP3 proxy。Nginx Wiki 本文介绍如何用Nginx部署一个静态页面,或者说部署一个单页应用(Single Page Application)。顺利的话,用时大约5分钟。
ubuntu 14.04 或者 ubuntu 16.04
sudo apt-get update
sudo apt-get install nginx
sudo ufw app list
sudo ufw allow 'Nginx HTTP'
这时候访问ip地址,就已经可以直接看到Nginx自带的欢迎页面了。
接下来把nginx导向我们的静态网页。
修改nginx的默认配置文件/etc/nginx/sites-available/default
(不同版本可能不一样)
root
项为你的静态页面文件所在目录。
最后,用浏览器直接访问这台ubuntu的ip地址,就可以看到我们刚刚部署的静态页面了。这里虽然直接访问IP地址,但访问的其实是这个IP的80端口。
如果不想污染nginx的default conf文件,可以新建一个conf文件,步骤如下。
在nginx的默认配置文件的http项中加入 include servers/*
.
在servers/
文件夹下新建my-site.conf
.
将my-site.conf
配置为:
server {
listen: 8080; # 输入自己想要的端口。
location / {
root /path_to_static_files/; # 静态文件目录。
index index.html;
}
}
Recently, I encountered topics about isolation levels, so I write this article to revise some basic concepts in transaction and transaction isolation levels.
In a database data operation, a transaction is defined as a single unit of work, which may consist of one or multiple SQL statements. It guarantees the single unit work can be wholly committed to the database or rolled back if any statement in it fails. A transaction should be atomic, consistent, isolated, and durable (ACID).
A transaction is a single unit of work that should not be divided into smaller units. It means there are only two results of a transaction: whole commit or whole failure (rollback).
The database is always in a consistent state, ie. the data in the DB can be in the state that the transaction is not committed or the state that the transaction is wholly committed. Before and after transactions, the rules (constraints, cascades, triggers, etc) of the database are always met.
It is tricky to put consistency into ACID. Actually, consistency is correlated with the other three concepts, and AID is applied to guarantee the consistency in a certain level.
The concurrent running transactions are isolated. For example, we have a row row_a
being modified by transaction t_a
(not committed yet, but a few update
statements have been executed). At the same time, if the transaction t_b
tries to read row_a
, the data has been always in the state before the starting of the whole transaction t_a
.
The data will remain so after the commit of a transaction, even occurring power loss, crashes or errors. The transactions must be recorded in a non-volatile memory. The non-volatile memory guarantees that the committed transaction result will be written in the disk immediately instead of being stored in the disk cache.
Transactions overwrite each others' updates.
The data is changed by a transaction, but during this process another transaction reads the old value.
In a transaction, there is more than one read action, but during these reads, changes are made to the data in other transactions. As a result, two read actions get different results, ie. the read actions are non-repeatable.
In a transaction, there is more than one read action, but during these reads, new rows are added in other transactions, which may cause different read results.
To resolve the concurrency problems, databases usually use locks which will be stored in memory. The choices of lock types introduce several transaction isolation levels to achieve a performance trade-off.
There are four common transaction isolation levels that prevent concurrency problems in data accessing.
Dirty rows (not committed) are allowed to be returned. It provides good performance, but can cause dirty reads.
Read action will wait for the completion of the deletion, updating, or inserting by another transaction.
Read action blocks other transaction's update and delete.
Read action blocks other transaction's update, delete and insert.
Figure. Capabilities of Isolation Levels to Prevent Concurrency Problems
This note records some common usage of Ansible that may not appear in the Ansible official docs.
script
module to run local scripts on a remote host.We may need to use script
to run a local script on a remote host, and our bash file test.sh
may be as simple as:
echo $1
And our playbook is like:
- hosts: my_remote_host
tasks:
- script: ./test.sh first_arg
register: output
- debug:
var: output
However, the output
may be empty. The possible reason is that in the remote host, the bash interpreter is specifically configured. So we need to edit our test.sh
to add an interpreter for our script, such as:
#!/bin/bash
echo $1
script
module to run local ansible on a remote host.With script
module, we can also control a host that is connected through an intermediate host.
The trick is to run a playbook in the intermediate host, but it requires the intermediate host to have the Ansible config to connect to our actual target host. With this method, we can put all our scripts in the localhost insteads of uploading to the imtermediate host.
An example:
# playbook in the localhost
- hosts: intermediate_host
vars_files:
- ./vars/main.yml
tasks:
- script: './files/run-me-in-the-intermediate-host.yml' #the file
register: output
- debug:
var: output
failed_when: '"FAILED! =>" in output.stdout'
tags: [I-am-a-tag]
#! /usr/bin/env ansible-playbook
# run-me-in-the-intermediate-host.yml
# To be executed in the intermediate host
- hosts: actual-target-host
tasks:
- shell: ls
register: output
- debug:
var: output
failed_when: output.stderr != ''
- name: Add a host
add_host:
groups: "{{ GROUP_NAME }}"
name: "{{ IP }}"
ansible_user: ubuntu
ansible_ssh_private_key_file: "{{ PEM_PATH }}"
- add_host:
groups: {{ HOST_GROUPS }}
name: {{ HOST_IP }}
ansible_user: ubuntu
ansible_ssh_private_key_file: "{{ PEM_FOR_HOST }}"
ansible_ssh_common_args: '-o ProxyCommand="ssh -i {{ PEM_PATH_FOR_INTERMEDIATE }} -W %h:%p -q ubuntu@{{ INTERMEDIATE_HOST_IP }}"'
There is a usage of tags
:
- name: A playbook.
hosts: hostX
roles:
- { role: A, tags: [B] }
- import_playbook: a.yml
tags: [B]
Intuitively, we may think it means run all the tasks tagged with B
in the role A
and the imported playbook a.yml
, but it is not true. It actually means adding a tag B
to the role A
and the playbook importing action.
$ ansible-playbook {my playbook} --tags "{tag1}, {tag2}"
$ ansible-playbook {my playbook} --skip-tags "{tag1}, {tag2}"
做了一些前端项目,有些用Vue,有些用React,它们的**很像,但是具体到实现,用法又有许多不同。
TODO
Primitive Type没有在heap里面存object,在method结束后自动清理这些variable以释放内存,也就不需要GC。
如果一定要说这些variable是在stack里面被 GC
了,这个 GC
一般指这个内存块被pop out了,程序的指针回到了一开始call这个function的内存地址。
In a page view table, get the user id of non-login users from their future login page views:
SELECT FIRST_VALUE(user_id, TRUE) -- set TRUE to ignore NULL values
OVER (
PARTITION BY device_id
ORDER BY visit_time ASC
ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING -- only get the future
)
FROM page_view
WHERE stat_date='20190510' -- the table is partitioned by stat_date
;
There is always a scenario that when data is being fetched from the server, the browser needs to display a spin icon. With react + redux, you may probably implement it in such way:
class SampleComponent extends React.Component {
componentDidMount() {
// dispatch the action here to fetch new data from the server and update the store
}
render() {
if (// data is empty) { // or use conditional rendering
return(<renderSpin />);
}
return (<renderWithData />);
}
}
The implementation looks good at the first glance. However, the data may be incorrect in a case:
We can see an unexpected behavior.
To avoid such problem, there could be 2 solutions.
componentWillUnmount
:class SampleComponent extends React.Component {
componentDidMount() {
// dispatch the action here to fetch new data from the server and update the store
}
componentWillUnmount() {
// dispatch the action here to make the data in the store empty
}
render() {
if (// data is empty) {
return(<renderSpin />);
}
return (<renderWithData />);
}
}
This method is only applicable in the case that this data is not used in other components or you are confident that destroying the data upon umounting of this component is ok.
But if the data is not used in other components, why not use state
instead of redux for this data?
export const sampleData = () => {
const url = `${ServerConst.SERVER_CONTEXT_PATH}/api/v1/xxx`;
return (dispatch) => {
dispatch({ // dispatch a default value
type: ActionTypes.SAMPLE_DATA,
data: [] // or null,
});
return fetchApi(url)
.then(response => response.json())
.then(data => dispatch({
type: ActionTypes.SAMPLE_DATA,
data,
}));
};
};
This method is applicable for most cases, but it will dispatch twice in each data updating.
Comment on this issue if you have other ideas.
就是小学学的多项式方程组消除法 :|
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.