Coder Social home page Coder Social logo

my_blog's Introduction


2019-01-12

博客迁移到 Netlify,这里的博文也迁移过去了,以后这里不更新了。新地址 Jackie's Blog

希望这是最后一次迁移!flag...


说明

这是我的个人博客,建立在 GitHub Issues 中,大概会写一些 Linux、R、生物信息、数据库之类的东西。

目录

2018

12 月

9 月

8 月

7 月

6 月

2 月

2017

12 月

6 月

5 月

4 月


本来是个 GitHub Pages + Hexo 搭建的博客,嫌太麻烦,迁移到 Issues 中算了。

2018-03-09 迁移完成

my_blog's People

Contributors

jackiemium avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

my_blog's Issues

回归分析中的共线性问题

0.cover.jpg

主要参考博文 :

引子

如果现在我们预测一个国家,比如说**,2018 年全年的旅游收入(美元)。因变量是** 2018 年度的旅游收入 Y,自变量 X 我们假设存在下面两组:

  1. X1 = 2018 年来**的总游客人数
  2. X2 = 2018 年**政府对旅游作出的市场宣传财政支出
  3. X3 = a * X1 + b * X2 + c,a、b、c 是三个常数

另一组:

  1. X1 = 2018 年来**的总游客人数
  2. X2 = 2018 年**政府对旅游作出的市场宣传财政支出
  3. 2018 年人民币兑换美元平均汇率

上面两个情况,那种情况下预测得到 Y 会更准确呢?相信大家都会选第二组 X,因为从直觉来看第二组 X 有 3 个不同的变量,而每一个变量对可以为我们预测 Y 提供一些不一样的信息。而且,这 3 个变量都不是直接从其他变量转换来的。或者说,没有哪个变量能与其他变量构成一个线性组合。

反之,在第一组 X 里,只有两个变量提供了有用的信息,而第 3 个变量只是前两个变量的线性组合。不考虑这个变量直接构建模型的话,其实最终的模型中也包含了这个组合。

在第一组出现的这种情况就是两个变量的共线性(multicollinearity)。在这组变量里,有的变量与其他变量之间强相关(不一定要求所有变量之间都相关,但至少是两个)。此时第一组变量得到的模型将不如第二组变量得到的模型准确,因为第二组变量提供的信息比第一组要多。因此,在做回归分析的时候,研究如何鉴定和处理共线性很有必要。

概念和基础

维基百科的 Multicollinearity 词条 写道:

In statistics, multicollinearity (also collinearity) is a phenomenon in which one predictor variable in a multiple regression model can be linearly predicted from the others with a substantial degree of accuracy. In this situation the coefficient estimates of the multiple regression may change erratically in response to small changes in the model or the data. Multicollinearity does not reduce the predictive power or reliability of the model as a whole, at least within the sample data set; it only affects calculations regarding individual predictors. That is, a multivariate regression model with collinear predictors can indicate how well the entire bundle of predictors predicts the outcome variable, but it may not give valid results about any individual predictor, or about which predictors are redundant with respect to others.

多重回归时,一个自变量能被其他自变量在一定程度上线性预测时就是共线性。此时,模型或者数据中的微小的变化就可能会引起回归系数异常的变化。共线性不会降低模型整体的预测效力和可信度(在相同样本数据前提下),但它将影响对个体的预测。即,在存在共线性情况下,模型仍能反映所有自变量对因变量的预测能力,但对于单个自变量所给出的预测,或者关于自变量是否冗余的结论则都不可信。

共线性的出现可以有很多原因。比如哑变量的引入或者不正确的使用可能会导致共线性。使用通过其他变量转换生成的变量也会导致共线性,上面的例子就是这样的。另外,如果引入的变量之间本身就是相关的,或者提供的信息相似也有可能造成共线性(下面会出现例子)。共线性在总体上不会导致什么问题,但是却对单个变量及其预测效能影响巨大。它可能使得我们根本没办法鉴定哪些变量是显著性的。有时候你会发现一组变量的预测结果非常相似,或者一些变量相对其他一些变量完全就是冗余的。总结起来,共线性可能导致这些后果:

  • 无法鉴定哪些变量是具有显著意义的。因为共线性会使得模型相对于所选取的样本数据十分敏感,不同的样本数据会得到不同的显著变量结果
  • 因为共线性的存在使得标准差倾向于异常的大,因此也无法准确估计回归系数。选取不同的样本数据的时候,回归系数的值甚至是符号都会发生变化。
  • 模型对于加入或剔除独立的变量异常敏感。添加一个与当前存在的变量正交的变量时,模型也可能会得出完全不同的结果;从模型中剔除一个变量也可能会对模型造成很大的影响。
  • 可信区间变得很宽,因此可能无法拒绝备择假设。备择假设认为在总体人群中回归系数为 0(即模型是随机事件观察到的结果)。

好的,现在我们知道共线性不是个好事。那要怎么识别共线性呢?方法有很多:

  • 第一个也是最简单的就是看变量之间两两相关性。在很多情况下,变量之间多多少少都存在一些相关性。但是变量之间的高度相关性就很容易导致共线性问题了。
  • 加入或者剔除变量,或者样本数据变化时回归系数变化异常的大也提示共线性的存在。使用不同的样本数据建模得到不同的显著变量也提示共线性存在。
  • 另一个方法是使用方差扩大因子(或方差膨胀因子,variance inflation factor,VIF)。VIF > 10 时提示变量之间存在共线性。一般地,我们认为 VIF < 4 时模型稳定。
  • 模型总体 R 方很高,但是多数变量的回归系数都不显著。这也提示模型中存在变量间共线性。
  • Farrar-Glauber 检验是用于检测共线性的一种统计方法。它又包含了 3 个进一步的检验:首先是卡方检验确定系统中是否有共线性的存在;然后是方差检验(或叫 F 检验)用于发现哪些变量间存在共线性关系;最后是 t 检验确定共线性的模式和类别。

例子

下面来通过一个列子看看怎么鉴别数据中的共线性以及简单的处理。

我们用到的数据叫做 CPS_85_Wages data来源与介绍

These data consist of a random sample of 534 persons from the CPS, with information on wages and other characteristics of the workers, including sex, number of years of education, years of work experience, occupational status, region of residence and union membership. Source: Berndt, ER. The Practice of Econometrics. 1991. NY: Addison-Wesley. (Therese.A.Stukel_AT_Dartmouth.EDU) (MS Word format) [21/Jul/98] (23 kbytes)

这里可以下载到:https://www.economicswebinstitute.org/data/wagesmicrodata.xls

我也上传到了 GitHub repo:wagesmicrodata.xls

R 的 mosaic包也附带了这个数据,library(mosaic) 然后 data("CPS85")就可以加载这个数据了,但是这个数据和我们直接下载的稍微形式变了一点,我嫌麻烦就直接用了 xls 文件了。但是其实 ?CPS85 查看帮助看看关于数据的细节也不错的。

简单来说,这个数据是来自 534 人的样本的薪水和其他一些信息,比如年龄、性别、人种、受教育年限、工作年限、工作状态、居住地、工会状态、婚姻状态等等。而我们现在就是想通过这一系列变量来预测薪水。

先来看看数据长什么样子:

# 原始数据里数据在 Sheet2,名叫 Data, 第一行第一列没用,我都直接自己删掉了
CPS85 <- readxl::read_xlsx('CPS85', sheet = 1)

str(CPS85)

数据这样子:

Classestbl_df’, ‘tbland 'data.frame':	534 obs. of  11 variables:
 $ WAGE      : num  5.1 4.95 6.67 4 7.5 ...
 $ OCCUPATION: num  6 6 6 6 6 6 6 6 6 6 ...
 $ SECTOR    : num  1 1 1 0 0 0 0 0 1 0 ...
 $ UNION     : num  0 0 0 0 0 1 0 0 0 0 ...
 $ EDUCATION : num  8 9 12 12 12 13 10 12 16 12 ...
 $ EXPERIENCE: num  21 42 1 4 17 9 27 9 11 9 ...
 $ AGE       : num  35 57 19 22 35 28 43 27 33 27 ...
 $ SEX       : num  1 1 0 0 0 0 0 0 0 0 ...
 $ MARR      : num  1 1 0 0 1 0 0 0 1 0 ...
 $ RACE      : num  2 3 3 3 3 3 3 3 3 3 ...
 $ SOUTH     : num  0 0 0 0 0 0 1 0 0 0 ...

更直观一点:

1.data.png

首先我们先用所有变量建立线性模型。这里,由于薪水之间差别很大导致方差也会很大,我们把它 Log 一下。

fit1 = lm(log(CPS85$WAGE) ~., data = CPS85)

看看效果如何 summary(fit1)

Call:
lm(formula = log(CPS85$WAGE) ~ ., data = CPS85)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.1625 -0.2916 -0.0047  0.2998  1.9825 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1.07860    0.68751    1.57  0.11729    
OCCUPATION  -0.00742    0.01311   -0.57  0.57176    
SECTOR       0.09146    0.03874    2.36  0.01859 *  
UNION        0.20048    0.05247    3.82  0.00015 ***
EDUCATION    0.17937    0.11076    1.62  0.10595    
EXPERIENCE   0.09582    0.11080    0.86  0.38753    
AGE         -0.08544    0.11073   -0.77  0.44067    
SEX         -0.22200    0.03991   -5.56  4.2e-08 ***
MARR         0.07661    0.04193    1.83  0.06826 .  
RACE         0.05041    0.02853    1.77  0.07787 .  
SOUTH       -0.10236    0.04282   -2.39  0.01719 *  
---
Signif. codes:  0***0.001**0.01*0.05.0.1 ‘ ’ 1

Residual standard error: 0.44 on 523 degrees of freedom
Multiple R-squared:  0.318,	Adjusted R-squared:  0.305 
F-statistic: 24.4 on 10 and 523 DF,  p-value: <2e-16

R 方 0.318,对于一个 只有 534 个样本的数据来说还行。F 统计量高度显著,提示模型中多个变量对因变量的解释具有统计学意义。但仔细看的话会发现 4 个变量 (occupation, education, experience, age) 没有统计学意义,有 2 个变量(marital status 及 south) 在 0.1 的统计学水平上有显著性。

下面我们来画图看看模型诊断信息所反映的误差正态性、方差齐性等等:

par(mfrow=c(2,2))
plot(fit1)

得图:

2.fit.plot.png

图看着也还行。所以问题可能在于得到的显著变量为什么这么少。

进一步,我们看看变量间的关系:

library(GGally)
ggpairs(X)

得图:

3.ggairs.png

或者

library(corrplot)

cor1 <- cor(CPS85)
corrplot.mixed(cor1, lower.col = 'black', cl.cex = 0.8, tl.cex = 0.8)

得图:

4.corplot.png

从两张图都可以发现,AGE 和 EXPERIENCE 以及 EDUCATION 之间关联性很高。来看看偏相关系数:

library(corpcor)

corpcor::cor2pcor(cov(CPS85[,-1]))
           [,1]      [,2]      [,3]      [,4]     [,5]     [,6]      [,7]      [,8]      [,9]     [,10]
 [1,]  1.000000  0.314747  0.212996  0.029437  0.04206 -0.04414 -0.142751 -0.018581  0.057539  0.008431
 [2,]  0.314747  1.000000 -0.013531 -0.021253 -0.01326  0.01457 -0.112147  0.036495  0.006412 -0.021519
 [3,]  0.212996 -0.013531  1.000000 -0.007479 -0.01024  0.01224 -0.120088  0.068918 -0.107706 -0.097549
 [4,]  0.029437 -0.021253 -0.007479  1.000000 -0.99756  0.99726  0.051510 -0.040303  0.017231 -0.031750
 [5,]  0.042059 -0.013262 -0.010244 -0.997562  1.00000  0.99988  0.054977 -0.040977  0.010888 -0.022314
 [6,] -0.044140  0.014566  0.012239  0.997262  0.99988  1.00000 -0.053698  0.045090 -0.010803  0.021525
 [7,] -0.142751 -0.112147 -0.120088  0.051510  0.05498 -0.05370  1.000000  0.004163  0.020017 -0.030152
 [8,] -0.018581  0.036495  0.068918 -0.040303 -0.04098  0.04509  0.004163  1.000000  0.055646  0.030418
 [9,]  0.057539  0.006412 -0.107706  0.017231  0.01089 -0.01080  0.020017  0.055646  1.000000 -0.111198
[10,]  0.008431 -0.021519 -0.097549 -0.031750 -0.02231  0.02153 -0.030152  0.030418 -0.111198  1.000000

colnames(CPS85[,-1])
 [1] "OCCUPATION" "SECTOR"     "UNION"      "EDUCATION"  "EXPERIENCE" "AGE"        "SEX"        "MARR"      
 [9] "RACE"       "SOUTH

一样的,我们发现AGE 和 EXPERIENCE 以及 EDUCATION 之间关联性很高。

接下来,我们来做 Farrar-Glauber 检验看看。mctest 包的 omcdiag (Overall Multicollinearity Diagnostics Measures) 函数计算不同的检测总体共线性的指标:

library(mctest)

omcdiag(CPS85[, -1], CPS85$WAGE)

Call:
omcdiag(x = CPS85[, -1], y = CPS85$WAGE)


Overall Multicollinearity Diagnostics

                       MC Results detection
Determinant |X'X|:          0.000         1
Farrar Chi-Square:       4833.575         1
Red Indicator:              0.198         0
Sum of Lambda Inverse:  10068.844         1
Theil's Method:             1.226         1
Condition Number:         739.734         1

1 --> COLLINEARITY is detected by the test 
0 --> COLLINEARITY is not detected by the test

结果表明模型中存在共线性的现象。然后是 F 检验看看具体哪些变量的问题:

imcdiag(CPS85[, -1], CPS85$WAGE)

Call:
imcdiag(x = CPS85[, -1], y = CPS85$WAGE)


All Individual Multicollinearity Diagnostics Result

                VIF   TOL        Wi        Fi Leamer     CVIF Klein
OCCUPATION    1.298 0.770 1.736e+01 1.957e+01  0.878    1.328     0
SECTOR        1.199 0.834 1.157e+01 1.304e+01  0.913    1.226     0
UNION         1.121 0.892 7.037e+00 7.931e+00  0.945    1.146     0
EDUCATION   231.196 0.004 1.340e+04 1.511e+04  0.066  236.473     1
EXPERIENCE 5184.094 0.000 3.018e+05 3.401e+05  0.014 5302.419     1
AGE        4645.665 0.000 2.704e+05 3.048e+05  0.015 4751.700     1
SEX           1.092 0.916 5.335e+00 6.013e+00  0.957    1.117     0
MARR          1.096 0.912 5.597e+00 6.309e+00  0.955    1.121     0
RACE          1.037 0.964 2.162e+00 2.437e+00  0.982    1.061     0
SOUTH         1.047 0.955 2.726e+00 3.073e+00  0.977    1.071     0

1 --> COLLINEARITY is detected by the test 
0 --> COLLINEARITY is not detected by the test

OCCUPATION , SECTOR , EDUCATION , EXPERIENCE , AGE , MARR , RACE , SOUTH , coefficient(s) are non-significant may be due to multicollinearity

R-square of y on all x: 0.28 

* use method argument to check which regressors may be the reason of collinearity
===================================

VIF、TOL 和 Wi 列分别为 variance inflation factor, tolerance 和 Farrar-Glauber F 检验结果。

检验结果显示 EDUCATION,EXPERIENCE 和 AGE 确实存在共线性,而且 VIF 也确实很大。

最后 t 检验看看是什么样的关系:

library(ppcor)

pcor(CPS85[,-1], method = "pearson")
$estimate
           OCCUPATION    SECTOR     UNION EDUCATION EXPERIENCE      AGE       SEX      MARR      RACE     SOUTH
OCCUPATION   1.000000  0.314747  0.212996  0.029437    0.04206 -0.04414 -0.142751 -0.018581  0.057539  0.008431
SECTOR       0.314747  1.000000 -0.013531 -0.021253   -0.01326  0.01457 -0.112147  0.036495  0.006412 -0.021519
UNION        0.212996 -0.013531  1.000000 -0.007479   -0.01024  0.01224 -0.120088  0.068918 -0.107706 -0.097549
EDUCATION    0.029437 -0.021253 -0.007479  1.000000   -0.99756  0.99726  0.051510 -0.040303  0.017231 -0.031750
EXPERIENCE   0.042059 -0.013262 -0.010244 -0.997562    1.00000  0.99988  0.054977 -0.040977  0.010888 -0.022314
AGE         -0.044140  0.014566  0.012239  0.997262    0.99988  1.00000 -0.053698  0.045090 -0.010803  0.021525
SEX         -0.142751 -0.112147 -0.120088  0.051510    0.05498 -0.05370  1.000000  0.004163  0.020017 -0.030152
MARR        -0.018581  0.036495  0.068918 -0.040303   -0.04098  0.04509  0.004163  1.000000  0.055646  0.030418
RACE         0.057539  0.006412 -0.107706  0.017231    0.01089 -0.01080  0.020017  0.055646  1.000000 -0.111198
SOUTH        0.008431 -0.021519 -0.097549 -0.031750   -0.02231  0.02153 -0.030152  0.030418 -0.111198  1.000000

$p.value
           OCCUPATION    SECTOR     UNION EDUCATION EXPERIENCE    AGE      SEX   MARR    RACE   SOUTH
OCCUPATION  0.000e+00 1.467e-13 8.220e-07    0.5005     0.3357 0.3123 0.001027 0.6707 0.18764 0.84704
SECTOR      1.467e-13 0.000e+00 7.569e-01    0.6267     0.7616 0.7389 0.010051 0.4035 0.88336 0.62243
UNION       8.220e-07 7.569e-01 0.000e+00    0.8641     0.8147 0.7794 0.005823 0.1144 0.01345 0.02527
EDUCATION   5.005e-01 6.267e-01 8.641e-01    0.0000     0.0000 0.0000 0.238259 0.3563 0.69338 0.46745
EXPERIENCE  3.357e-01 7.616e-01 8.147e-01    0.0000     0.0000 0.0000 0.208090 0.3483 0.80325 0.60963
AGE         3.123e-01 7.389e-01 7.794e-01    0.0000     0.0000 0.0000 0.218884 0.3020 0.80476 0.62233
SEX         1.027e-03 1.005e-02 5.823e-03    0.2383     0.2081 0.2189 0.000000 0.9241 0.64692 0.49016
MARR        6.707e-01 4.035e-01 1.144e-01    0.3563     0.3483 0.3020 0.924111 0.0000 0.20260 0.48635
RACE        1.876e-01 8.834e-01 1.345e-02    0.6934     0.8033 0.8048 0.646920 0.2026 0.00000 0.01071
SOUTH       8.470e-01 6.224e-01 2.527e-02    0.4675     0.6096 0.6223 0.490163 0.4863 0.01071 0.00000

$statistic
           OCCUPATION  SECTOR   UNION EDUCATION EXPERIENCE       AGE     SEX    MARR    RACE   SOUTH
OCCUPATION     0.0000  7.5907  4.9902    0.6741     0.9636   -1.0114 -3.3015 -0.4254  1.3193  0.1930
SECTOR         7.5907  0.0000 -0.3098   -0.4866    -0.3036    0.3335 -2.5835  0.8360  0.1468 -0.4927
UNION          4.9902 -0.3098  0.0000   -0.1712    -0.2345    0.2802 -2.7690  1.5814 -2.4799 -2.2437
EDUCATION      0.6741 -0.4866 -0.1712    0.0000  -327.2105  308.6803  1.1807 -0.9233  0.3945 -0.7272
EXPERIENCE     0.9636 -0.3036 -0.2345 -327.2105     0.0000 1451.9092  1.2604 -0.9388  0.2493 -0.5109
AGE           -1.0114  0.3335  0.2802  308.6803  1451.9092    0.0000 -1.2310  1.0332 -0.2473  0.4928
SEX           -3.3015 -2.5835 -2.7690    1.1807     1.2604   -1.2310  0.0000  0.0953  0.4583 -0.6905
MARR          -0.4254  0.8360  1.5814   -0.9233    -0.9388    1.0332  0.0953  0.0000  1.2758  0.6966
RACE           1.3193  0.1468 -2.4799    0.3945     0.2493   -0.2473  0.4583  1.2758  0.0000 -2.5613
SOUTH          0.1930 -0.4927 -2.2437   -0.7272    -0.5109    0.4928 -0.6905  0.6966 -2.5613  0.0000

$n
[1] 534

$gp
[1] 8

$method
[1] "pearson"

和前面结果一致,EDUCATION,EXPERIENCE 和 AGE 三个变量 p 值显著,三者之间偏相关系数接近 1 。

同时发现其实有的相关性很低的变量之间相关性也是显著的。

现在我们明白状况了,那接下来要怎么办呢?解决的办法很多,比如主成分回归,岭回归和逐步回归等等。

这里呢,我们简单点,直接选一个 VIF > 10 的变量剔除出去看看。显然,年龄和工作经验这个肯定是高度相关的,没必要两个都纳入模型。年龄本身就可以反映工作经验。我们直接把 EXPERIENCE 剔除掉然后看看模型怎么样:

fit2<- lm(log(WAGE)~OCCUPATION+SECTOR+UNION+EDUCATION+AGE+SEX+MARR+RACE+SOUTH, data = CPS85)
summary(fit2)

Call:
lm(formula = log(WAGE) ~ OCCUPATION + SECTOR + UNION + EDUCATION + 
    AGE + SEX + MARR + RACE + SOUTH, data = CPS85)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.1602 -0.2908 -0.0051  0.2999  1.9793 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.50136    0.16479    3.04  0.00247 ** 
OCCUPATION  -0.00694    0.01309   -0.53  0.59631    
SECTOR       0.09101    0.03872    2.35  0.01912 *  
UNION        0.20002    0.05246    3.81  0.00015 ***
EDUCATION    0.08381    0.00773   10.85  < 2e-16 ***
AGE          0.01031    0.00175    5.91  6.3e-09 ***
SEX         -0.22010    0.03984   -5.52  5.2e-08 ***
MARR         0.07512    0.04189    1.79  0.07346 .  
RACE         0.05067    0.02852    1.78  0.07621 .  
SOUTH       -0.10319    0.04280   -2.41  0.01626 *  
---
Signif. codes:  0***0.001**0.01*0.05.0.1 ‘ ’ 1

Residual standard error: 0.44 on 524 degrees of freedom
Multiple R-squared:  0.318,	Adjusted R-squared:  0.306 
F-statistic: 27.1 on 9 and 524 DF,  p-value: <2e-16

现在 9 个变量里大部分都是显著的,而且 F 检验也显示没什么问题。我们在检查下 VIF:

car::vif(fit2)
OCCUPATION     SECTOR      UNION  EDUCATION        AGE        SEX       MARR       RACE      SOUTH 
     1.296      1.198      1.121      1.126      1.154      1.088      1.094      1.037      1.046 

嗯,现在所有变量 VIF < 4,共线性问题没了。


虽然博客看懂了,代码也都照做并且确实解决了共线性问题,但其实这里面涉及到的统计方法我都不是很熟悉,只能算是一知半解吧。有空还是要好好看看这部分的统计基础,以及 mctest 的 Manual 要看看。

RNA-Seq 数据处理记录

本来是 6 月份的东西,一直没有好好整理拖到年底了,唉....

毕业答辩过了,最大的坎儿迈过去了。准备开始处理手头拿到的 RNA-Seq 数据。当作是我的第一次实战。

实验设计是干预组和对照组各 4 只大鼠,很简单的 2 * 4 的设计。

首先想到的是看看 生信菜鸟团 有没有实战的 WorkFlow 用来参考,结果刚好就有一篇 HISAT2 + HT-Seq 的实战帖 ,所以就直接照着来一遍先。

最开始以为整个过程应该会比较顺利,因为生信菜鸟团的帖子已经很详细了,而且我手边还有一份 Nature Protocols 的详细的流程参考:Transcript-level expression analysis of RNA-seq experiments with HISAT, StringTie and Ballgown。最后发现我还是 Too young too simple 啊。

所以,最后自己强烈的感受就是。**看教程始终很简单,真正自己做的时候会在意外的地方趟进坑里。**实践才能出真知。

好吧,下面开始。

首先是原始数据,测序公司已经处理过得到的 clean 的原始数据:

➜ ls -lh
total 42G
-rwxrwxrwx 1 adam adam 2.6G May 19 18:01 CLP1.R1.clean.fastq.gz
-rwxrwxrwx 1 adam adam 2.8G May 19 17:57 CLP1.R2.clean.fastq.gz
-rwxrwxrwx 1 adam adam 2.4G May 19 17:53 CLP2.R1.clean.fastq.gz
-rwxrwxrwx 1 adam adam 2.6G May 19 18:00 CLP2.R2.clean.fastq.gz
-rwxrwxrwx 1 adam adam 2.8G May 19 16:06 CLP3.R1.clean.fastq.gz
-rwxrwxrwx 1 adam adam 3.1G May 19 18:03 CLP3.R2.clean.fastq.gz
-rwxrwxrwx 1 adam adam 2.6G May 19 21:22 CLP4.R1.clean.fastq.gz
-rwxrwxrwx 1 adam adam 2.8G May 19 21:33 CLP4.R2.clean.fastq.gz
-rwxrwxrwx 1 adam adam 2.6G May 19 16:57 NC1.R1.clean.fastq.gz
-rwxrwxrwx 1 adam adam 2.8G May 19 17:01 NC1.R2.clean.fastq.gz
-rwxrwxrwx 1 adam adam 2.3G May 19 16:55 NC2.R1.clean.fastq.gz
-rwxrwxrwx 1 adam adam 2.5G May 19 16:57 NC2.R2.clean.fastq.gz
-rwxrwxrwx 1 adam adam 2.6G May 19 16:01 NC3.R1.clean.fastq.gz
-rwxrwxrwx 1 adam adam 2.8G May 19 18:02 NC3.R2.clean.fastq.gz
-rwxrwxrwx 1 adam adam 2.4G May 19 21:24 NC5.R1.clean.fastq.gz
-rwxrwxrwx 1 adam adam 2.6G May 19 21:28 NC5.R2.clean.fastq.gz

可以看到数据还是蛮大的,没解压的一个数据都是 2~3G。

第一个坑

直接比对吧:

➜ reference=/home/adam/Bioinformatics/References/grcm38_tran/genome_tran
➜ hisat2 -p 3 -x $reference \ 
  -1 /media/adam/DATA/RNA-Seq.20180410/raw/CLP1.R1.clean.fastq.gz \
  -2 /media/adam/DATA/RNA-Seq.20180410/raw/CLP1.R2.clean.fastq.gz \
  -S /media/adam/DATA/RNA-Seq.20180410/sam/CLP1.sam \
  2> /media/adam/DATA/RNA-Seq.20180410/sam/CLP1.log

比对结果:

39332695 reads; of these:
39332695 (100.00%) were paired; of these:
37598182 (95.59%) aligned concordantly 0 times
1650296 (4.20%) aligned concordantly exactly 1 time
84217 (0.21%) aligned concordantly >1 times
---------------------
37598182 pairs aligned concordantly 0 times; of these:
23151 (0.06%) aligned discordantly 1 time
---------------------
37575031 pairs aligned 0 times concordantly or discordantly; of these:
75150062 mates make up the pairs; of these:
71576233 (95.24%) aligned 0 times
3405235 (4.53%) aligned exactly 1 time
168594 (0.22%) aligned >1 times
9.01% overall alignment rate

总共 9.01% 的比对率发觉不对。然后又比对了一个对照组的样本看看:

➜ reference=/home/adam/Bioinformatics/References/grcm38_tran/genome_tran
➜ hisat2 -p 3 -x $reference \ 
  -1 /media/adam/DATA/RNA-Seq.20180410/raw/NC1.R1.clean.fastq.gz \
  -2 /media/adam/DATA/RNA-Seq.20180410/raw/NC1.R2.clean.fastq.gz \
  -S /media/adam/DATA/RNA-Seq.20180410/sam/NC1.sam \
  2> /media/adam/DATA/RNA-Seq.20180410/sam/NC1.log

再看一波结果:

38339990 reads; of these:
38339990 (100.00%) were paired; of these:
36607018 (95.48%) aligned concordantly 0 times
1650547 (4.31%) aligned concordantly exactly 1 time
82425 (0.21%) aligned concordantly >1 times
----
36607018 pairs aligned concordantly 0 times; of these:
23932 (0.07%) aligned discordantly 1 time
----
36583086 pairs aligned 0 times concordantly or discordantly; of these:
73166172 mates make up the pairs; of these:
69327494 (94.75%) aligned 0 times
3644535 (4.98%) aligned exactly 1 time
194143 (0.27%) aligned >1 times
9.59% overall alignment rate

总比对率 9.59%。
不对不对,肯定有问题。

首先反复看了 HISAT2 的 Manual 确认我的参数没有用错。
直接把“HISAT2 low overall alignment rate” 作为关键字 Google 了一下,发现一个问题,基本上出这个情况的都是因为参考基因组用错。

这样我就去看了一下,我在 HISAT2 官网下的 M. musculus, GRCm38 基因组,我往上下翻着看了一下一堆基因组 R. norvegicus, UCSC rn6D. melanogaster等等我只认识线虫。
然后我就好奇的一个一个查了一下都是代表什么生物,最后发现 M. musculus是小鼠,而 R. norvegicus 是大鼠。而我的数据就是大鼠,也就是说我参考基因组真的用错了

好吧,我的锅。重新下载大鼠基因组,解压之后再来:

➜ REF=/home/adam/Bioinformatics/References/Rattus.Norvegicus.6/genome
➜ RAW_DIR=/media/adam/DATA/RNA-Seq.20180410/raw
➜ SAM_DIR=/media/adam/DATA/RNA-Seq.20180410/sam

➜ hisat2 -p 3 -x $REF -1 $RAW_DIR/CLP1.R1.clean.fastq.gz -2 $RAW_DIR/CLP1.R2.clean.fastq.gz -S $SAM_DIR/CLP_1.SAM 2> $SAM_DIR/CLP_1.LOG 

看结果:

39332695 reads; of these:
39332695 (100.00%) were paired; of these:
1514532 (3.85%) aligned concordantly 0 times
34474411 (87.65%) aligned concordantly exactly 1 time
3343752 (8.50%) aligned concordantly >1 times
----
1514532 pairs aligned concordantly 0 times; of these:
181725 (12.00%) aligned discordantly 1 time
----
1332807 pairs aligned 0 times concordantly or discordantly; of these:
2665614 mates make up the pairs; of these:
1480066 (55.52%) aligned 0 times
1020001 (38.27%) aligned exactly 1 time
165547 (6.21%) aligned >1 times
98.12% overall alignment rate

嗯,总比对率 98% 以上,果然。
而且发现这一次生成的 sam 文件明显是增大的。
然后就是一样的,把其他所有的都做了。

# 这里其实应该写个循环
➜ hisat2 -p 3 -x $REF -1 $RAW_DIR/CLP2.R1.clean.fastq.gz -2 $RAW_DIR/CLP2.R2.clean.fastq.gz -S $SAM_DIR/CLP_2.SAM 2> $SAM_DIR/CLP_2.LOG

➜ hisat2 -p 3 -x $REF -1 $RAW_DIR/CLP3.R1.clean.fastq.gz -2 $RAW_DIR/CLP3.R2.clean.fastq.gz -S $SAM_DIR/CLP_3.SAM 2> $SAM_DIR/CLP_3.LOG

➜ hisat2 -p 3 -x $REF -1 $RAW_DIR/CLP4.R1.clean.fastq.gz -2 $RAW_DIR/CLP4.R2.clean.fastq.gz -S $SAM_DIR/CLP_4.SAM 2> $SAM_DIR/CLP_4.LOG

➜ hisat2 -p 3 -x $REF -1 $RAW_DIR/NC1.R1.clean.fastq.gz -2 $RAW_DIR/NC1.R2.clean.fastq.gz -S $SAM_DIR/NC_1.SAM 2> $SAM_DIR/NC_1.LOG 

➜ hisat2 -p 3 -x $REF -1 $RAW_DIR/NC2.R1.clean.fastq.gz -2 $RAW_DIR/NC2.R2.clean.fastq.gz -S $SAM_DIR/NC_2.SAM 2> $SAM_DIR/NC_2.LOG

➜ hisat2 -p 3 -x $REF -1 $RAW_DIR/NC3.R1.clean.fastq.gz -2 $RAW_DIR/NC3.R2.clean.fastq.gz -S $SAM_DIR/NC_3.SAM 2> $SAM_DIR/NC_3.LOG

➜ hisat2 -p 3 -x $REF -1 $RAW_DIR/NC5.R1.clean.fastq.gz -2 $RAW_DIR/NC5.R2.clean.fastq.gz -S $SAM_DIR/NC_5.SAM 2> $SAM_DIR/NC_5.LOG

然后用 samtoolssam 文件排序并转成二进制的bam文件,这样可以大大减少文件占用体积。我的数据 sam 文件都是 35~40G 的样子,但是转成 bam 后每个文件大概 7~9G。

直接用循环一把梭哈(注意 samtools 默认会按照 position 即坐标位置排序,要想根据 name 排序要指定 -n 参数):

for i in `ls *.SAM`
	samtools sort -@ 3 -o bam/${i%.*}.sorted.BAM $i

最后就是用 HT-Seq 进行基因定量了。这一步必须要提供参考基因组的 GTF 文件,这里我就趟到了我的整个分析过程的第二个坑了。

第二个坑

这一步出错是由于之前不知道这个数据库提供基因组和注释的区别,我开始在 UCSC 没有找到 GTF 下载的地方,就直接跑去下载了 EnsemblGTF 注释文件。但实际上在比对这一步 HISAT2官网提供的是 UCSC 的基因组 index,所以后面定量也必须使用 UCSC 的注释文件。由于我两个混用了导致定量这一步始终无法得到结果。花费了一整天去搜资料,定位到是注释文件的问题。最终还是在 BioStars 上提问 Question: Help with rat RNA-Seq data with the HISAT-StringTie workflow ,最终在热心网友的的帮助下才下载到了 UCSC 的注释文件。

具体来说,UCSC 没有提供现成的 Rat 的基因组 GTF 拿来用,我们下载 ensGene.txt.gz 这个文件然后借助他们提供的 genePredToGtf 工具就可以自己制作 GTF 了:

cut -f 2- ensGene.txt > Ens.Gene.txt
genePredToGtf file Ens.Gene.txt Rn6.Ensembl.Gene.GTF

刚刚提到 samtools 默认以位置排序在这里就要派上用场了。htseq-count 的 help 里写道:

-r {pos,name}, --order {pos,name}
               'pos' or 'name'. Sorting order of <alignment_file>
               (default: name). Paired-end sequencing data must be
               sorted either by position or by read name, and the
               sorting order must be specified. Ignored for single-
               end data.

就是说对于 paired-end 测序必须显式指定 bam 文件的排序情况。 那我们就指定 -r pos 就行了:

➜ for i in `ls *.BAM`
htseq-count -f bam -s no -r pos -i gene_id $i  ~/Bioinformatics/References/Rattus.Norvegicus.6/UCSC.rn6.GTF  1> ../counts/${i%%.*}.geneCounts 2> ../counts/${i%%.*}.htseq.log

这样每个样本会输出一个xxx.geneCounts这样的文件,最后得到了一堆的 geneCounts 文件,直接 R 读进去合成一个data.frame就可以用 DESeq2edgeR 后续分析了。

关于读入多个数据合成一个,一点 hint:

files <- list.files(path = "/path/to/files", pattern = '.geneCounts')
do.call("cbind", lapply(files, read.csv, header = TRUE)) 

总结

  1. 参考基因组很重要
  2. 像这样很长的流程要中间步骤多记录,后面好总结和出问题方便排查
  3. 流程不熟的时候跑样本可以先只跑一两个试一下,不要一上来就循环一把梭哈
  4. 实践!实践!实践!

跟着 mimic-code 探索 MIMIC 数据之 tutorials (二)

mimic-code 的 tutorials还提供了 sql-crosstab,很短,我大概看了感觉不是很有用,先放着了。using_r_with_jupyter.ipynb 就是教你怎么用 Jupyter + R,没什么。explore-items.Rmd 是 MySQL + R,但是没太搞懂这是在干嘛,而且我也没 MySQL,代码转 Postgres 应该不难,我太懒了。直接看最后一个,cohort-selection.ipynb,打开看了 Postgres + Python,讲怎么选择病例队列的一些小技巧,感觉写得挺好的。就这个了,开始。

原文档用的 Python,我不喜欢。当然还是 R 好啦,所以我直接用里面的 sql 语句就行了。


Cohort selection

The aim of this tutorial is to describe how patients are tracked in the MIMIC-III database. By the end of this notebook you should:

  • Understand what subject_id, hadm_id, and icustay_id represent
  • Know how to set up a cohort table for subselecting a patient population
  • Understand the difference between service and physical location

Requirements:

  • MIMIC-III in a PostgreSQL database
  • Python packages installable with:
    pip install numpy pandas matplotlib psycopg2 jupyter

文档的目的是展示 MIMIC 中病例信息的跟踪追溯。主要讲解 subject_id, hadm_id, 和 icustay_id 代表着什么,怎么提取研究病例队列,以及理解患者接受 service 和患者物理位置之间的差别(老实说我都不知道这个到底是什么)。

我自己用的是 RStudio + PostgreSQL,所以代码相对原文档会有一些改动。

首先是设置和数据库连接和基本选项:

library(RPostgreSQL)
library(tidyverse)

# connect to PostgresSQL
drv <- dbDriver("PostgreSQL")
con <- dbConnect(
  drv = drv,
  dbname = "mimic",
  user = "postgres",
  .rs.askForPassword("Enter password for user postgres:")
)

# set the search path to the mimiciii schema
dbSendQuery(con, "SET search_path TO mimiciii, public;")

# 为了偷懒我写了一个方便查询数据库的函数
query <- function(query = query) {
  con %>%
    dbGetQuery(sql(query)) %>%
    as_tibble()
}

队列选择一般都是从这三个表开始: patients, admissions 以及 icustays:

  • patients: information about a patient that does not change - e.g. date of birth, genotypical sex
  • admissions: information recorded on hospital admission - admission type (elective, emergency), time of admission
  • icustays: information recorded on intensive care unit admission - primarily admission and discharge time

MIMIC-III 主要是关注 ICU 的数据库,所以我们一般都是想看患者在 ICU 的进科出科情况。也因此,一般在选取患者队列时都不会从病例作为切入(即通过 subject_id),而是通过 ICU 出入情况,即通过 icustays 表格中的 icustay_id切入。

query("SELECT subject_id, hadm_id, icustay_id
	     FROM icustays
         LIMIT 10;")
# 在仅仅是尝试性或者探索性的看看数据的时候一般都用 LIMIT 10

#-----

# A tibble: 10 x 3
   subject_id hadm_id icustay_id
 *      <int>   <int>      <int>
 1        268  110404     280836
 2        269  106296     206613
 3        270  188028     220345
 4        271  173727     249196
 5        272  164716     210407
 6        273  158689     241507
 7        274  130546     254851
 8        275  129886     219649
 9        276  135156     206327
10        277  171601     272866

计算 ICU 的住院时间:

query("SELECT subject_id, hadm_id, icustay_id
      , outtime - intime as icu_length_of_stay_interval
      , EXTRACT(EPOCH FROM outtime - intime) as icu_length_of_stay
      FROM icustays LIMIT 10;")

#----

# A tibble: 10 x 5
   subject_id hadm_id icustay_id icu_length_of_stay_interval icu_length_of_stay
 *      <int>   <int>      <int> <chr>                                    <dbl>
 1        268  110404     280836 3 days 05:58:33                         280713
 2        269  106296     206613 3 days 06:41:28                         283288
 3        270  188028     220345 2 days 21:27:09                         250029
 4        271  173727     249196 2 days 01:26:22                         177982
 5        272  164716     210407 1 day 14:53:09                          139989
 6        273  158689     241507 1 day 11:40:06                          128406
 7        274  130546     254851 8 days 19:32:32                         761552
 8        275  129886     219649 7 days 03:09:14                         616154
 9        276  135156     206327 1 day 08:06:29                          115589
10        277  171601     272866 17:33:02                                 63182

EXTRACT(EPOCH FROM ... )TIMESTAMP 中提出以秒为单位的 INTERVAL,所以真正要计算时间,还要除以 (60 * 60 *24):

query("SELECT subject_id, hadm_id, icustay_id
      , EXTRACT(EPOCH FROM outtime - intime)/60.0/60.0/24.0 as icu_length_of_stay 
      FROM icustays LIMIT 10;")

#---

# A tibble: 10 x 4
   subject_id hadm_id icustay_id icu_length_of_stay
 *      <int>   <int>      <int>              <dbl>
 1        268  110404     280836              3.25 
 2        269  106296     206613              3.28 
 3        270  188028     220345              2.89 
 4        271  173727     249196              2.06 
 5        272  164716     210407              1.62 
 6        273  158689     241507              1.49 
 7        274  130546     254851              8.81 
 8        275  129886     219649              7.13 
 9        276  135156     206327              1.34 
10        277  171601     272866              0.731

如果还想对 ICU 住院时间进行筛选,比如只想看住院超过 24h 的,就得先建个临时表格。比如:

query("WITH co AS
      (
        SELECT subject_id, hadm_id, icustay_id
        , EXTRACT(EPOCH FROM outtime - intime)/60.0/60.0/24.0 as icu_length_of_stay 
        FROM icustays LIMIT 10 
      ) 
      SELECT  co.subject_id, co.hadm_id, co.icustay_id, co.icu_length_of_stay 
      FROM co WHERE icu_length_of_stay >= 2;")

#---

# A tibble: 6 x 4
  subject_id hadm_id icustay_id icu_length_of_stay
*      <int>   <int>      <int>              <dbl>
1        268  110404     280836               3.25
2        269  106296     206613               3.28
3        270  188028     220345               2.89
4        271  173727     249196               2.06
5        274  130546     254851               8.81
6        275  129886     219649               7.13

这样就只筛选到住院时间 > 2 天的病例。

很多使用 MIMIC 数据库的研究都会聚焦于特定的人群。比如,MIMIC 中的数据包含了 ICU 中成人和新生儿的住院记录,但是一般研究是不会在这两个人群里同时开展的。所以很多研究的第一步就是从 icustays 表格中选择病例人群,即从这张表格中筛选合适的 icustay_id。上面的例子就是选取 ICU 住院时间超过 2 天的。

选取病例人群的时候,好的做法是构建一个队列表格。这个表格应该包含数据库中所有的 icustay_id,然后通过一个添加一个 binary flag 来指明每个病例是否要从研究人群中剔除。比如还是上面的筛选 ICU 住院时间 > 2 天的病例的例子:

query("WITH co AS 
      (
        SELECT subject_id, hadm_id, icustay_id
        , EXTRACT(EPOCH FROM outtime - intime)/60.0/60.0/24.0 as icu_length_of_stay 
        FROM icustays LIMIT 10 
      )
      SELECT co.subject_id, co.hadm_id, co.icustay_id, co.icu_length_of_stay,
      CASE
        WHEN co.icu_length_of_stay < 2 then 1 
      ELSE 0 END
      as exclusion_los FROM co;")

#---

# A tibble: 10 x 5
   subject_id hadm_id icustay_id icu_length_of_stay exclusion_los
 *      <int>   <int>      <int>              <dbl>         <int>
 1        268  110404     280836              3.25              0
 2        269  106296     206613              3.28              0
 3        270  188028     220345              2.89              0
 4        271  173727     249196              2.06              0
 5        272  164716     210407              1.62              1
 6        273  158689     241507              1.49              1
 7        274  130546     254851              8.81              0
 8        275  129886     219649              7.13              0
 9        276  135156     206327              1.34              1
10        277  171601     272866              0.731             1

之前的例子里,最后结果只返回了 6 行,因为有 4 行被我们筛选出去了。而在这里,所有的 10 行数据都在,但是最后一列显示有 4 行数据是不应该包含在我们的研究人群中的。
这种做法的好处在于在研究的最后,我们很容易总结整个研究人群的排除情况,也很容易根据需要作出修改。

再回想一下之前提到的剔除标准:标记非成人病例为剔除对象。所以,首先必须得知道病人在进入 ICU 时的年龄,这个需要用患者出生日期和 ICU 入院时间来计算。icustays 里的 intime 记录病人入 ICU 的时间,所以我们还需要从 patients 得到病人的出生日期。

query("WITH co AS
      (
        SELECT icu.subject_id, icu.hadm_id, icu.icustay_id
        , EXTRACT(EPOCH FROM outtime - intime)/60.0/60.0/24.0 as icu_length_of_stay
        ,icu.intime - pat.dob AS age FROM icustays icu
        INNER JOIN patients pat ON
          icu.subject_id = pat.subject_id LIMIT 10
      )
      SELECT co.subject_id, co.hadm_id, co.icustay_id, co.icu_length_of_stay, co.age,
      CASE
        WHEN co.icu_length_of_stay < 2 then 1
        ELSE 0 END
      as exclusion_los FROM co;")
      
#---

# A tibble: 10 x 6
   subject_id hadm_id icustay_id icu_length_of_stay age                 exclusion_los
 *      <int>   <int>      <int>              <dbl> <chr>                       <int>
 1          2  163353     243653             0.0918 21:20:07                        1
 2          3  145834     211552             6.06   27950 days 19:10:11             0
 3          4  185777     294638             1.68   17475 days 00:29:31             1
 4          5  178980     214757             0.0844 06:04:24                        1
 5          6  107064     228232             3.67   24084 days 21:30:54             0
 6          7  118037     278444             0.268  15:35:29                        1
 7          7  118037     236754             0.739  2 days 03:26:01                 1
 8          8  159514     262299             1.08   12:36:10                        1
 9          9  150750     220597             5.32   15263 days 13:07:02             0
10         10  184167     288409             8.09   11:39:05                        0

结果发现,再一次的,计算的年龄成了 INTERVAL。所以还得转换。转换有 3 种办法:

  • EXTRACT() 提取 INTERVAL,此时 INTERVAL天 + 小时 : 分钟 : 秒 这样的形式,然后作除法得到年(前面用到的做法);
  • 先用 PostgreSQL 的 AGE() 返回为年龄精确值,然后用 DATE_PART() 提取年数得到以年为单位的年龄;
  • 一样,AGE() 得到年龄精确值,DATE_PART() 分别提取年月日计算精确年龄。

我们把三种方法都试试看:

query("WITH co AS 
      (
      SELECT icu.subject_id, icu.hadm_id, icu.icustay_id
      , EXTRACT(EPOCH FROM outtime - intime)/60.0/60.0/24.0 as icu_length_of_stay
      , icu.intime - pat.dob AS age FROM icustays icu
      INNER JOIN patients pat ON
        icu.subject_id = pat.subject_id LIMIT 10
      )
      SELECT co.subject_id, co.hadm_id, co.icustay_id, co.icu_length_of_stay, co.age
      , EXTRACT('year' FROM co.age) as age_extract_year
      , EXTRACT('year' FROM co.age) 
        + EXTRACT('months' FROM co.age) / 12.0
        + EXTRACT('days' FROM co.age) / 365.242
        + EXTRACT('hours' FROM co.age) / 24.0 / 364.242 as age_extract_precise
      , EXTRACT('epoch' from co.age) / 60.0 / 60.0 / 24.0 / 365.242 as age_extract_epoch,
      CASE WHEN
        co.icu_length_of_stay < 2 then 1
      ELSE 0 END
      as exclusion_los FROM co;")

#---
# A tibble: 10 x 7
   subject_id icu_length_of_stay age                 age_extract_year age_extract_precise age_extract_epoch exclusion_los
 *      <int>              <dbl> <chr>                          <dbl>               <dbl>             <dbl>         <int>
 1          2             0.0918 21:20:07                           0            0.00240           0.00243              1
 2          3             6.06   27950 days 19:10:11                0           76.5              76.5                  0
 3          4             1.68   17475 days 00:29:31                0           47.8              47.8                  1
 4          5             0.0844 06:04:24                           0            0.000686          0.000693             1
 5          6             3.67   24084 days 21:30:54                0           65.9              65.9                  0
 6          7             0.268  15:35:29                           0            0.00172           0.00178              1
 7          7             0.739  2 days 03:26:01                    0            0.00582           0.00587              1
 8          8             1.08   12:36:10                           0            0.00137           0.00144              1
 9          9             5.32   15263 days 13:07:02                0           41.8              41.8                  0
10         10             8.09   11:39:05                           0            0.00126           0.00133              0

可以看到后面两种方法计算的年龄其实基本上没什么差别。而第一种办法,由于提取出来的实际上都是以天为单位的 INTERVAL,所以提取年得不到年龄的,只得到 0 了。所以结论就是,其实用不同的办法算得年龄没什么大的区别,按个人喜好自己定一个就 OK。后面我们都会用最简单的 EXTRACT(EPOCH FROM ... ) 这种方法。

然后我们就可以通过设置年龄必须 >= 16来把新生儿剔除掉了(虽然也把青少年剔除了,但是其实 MIMIC 只有新生儿和成人):

query("WITH co AS
      (
      SELECT icu.subject_id, icu.hadm_id, icu.icustay_id
      , EXTRACT(EPOCH FROM outtime - intime)/60.0/60.0/24.0 as icu_length_of_stay
      , EXTRACT('epoch' from icu.intime - pat.dob) / 60.0 / 60.0 / 24.0 / 365.242 as age
      FROM icustays icu INNER JOIN patients pat ON
        icu.subject_id = pat.subject_id LIMIT 10
      )
      SELECT co.subject_id, co.hadm_id, co.icustay_id, co.icu_length_of_stay, co.age,
      CASE WHEN
        co.icu_length_of_stay < 2 then 1
      ELSE 0 END as exclusion_los
      ,CASE WHEN co.age < 16 then 1
      ELSE 0 END as exclusion_age FROM co;")
 
#---

# A tibble: 10 x 7
   subject_id hadm_id icustay_id icu_length_of_stay       age exclusion_los exclusion_age
 *      <int>   <int>      <int>              <dbl>     <dbl>         <int>         <int>
 1          2  163353     243653             0.0918  0.00243              1             1
 2          3  145834     211552             6.06   76.5                  0             0
 3          4  185777     294638             1.68   47.8                  1             0
 4          5  178980     214757             0.0844  0.000693             1             1
 5          6  107064     228232             3.67   65.9                  0             0
 6          7  118037     278444             0.268   0.00178              1             1
 7          7  118037     236754             0.739   0.00587              1             1
 8          8  159514     262299             1.08    0.00144              1             1
 9          9  150750     220597             5.32   41.8                  0             0
10         10  184167     288409             8.09    0.00133              0             1

可以看到有 6 行因为年龄不足 16 岁而标记为待剔除,而且这 6 例里大部分也和之前的住院日 > 2 天有很多重合。

下面再尝试另一个常见的剔除标准:二次入 ICU 病例,不管是院内还是院外的。这么做的理由是筛选后可以达到很多统计分析所需要的各样本之间独立的要求。如果保留同一患者多次 ICU 住院信息,那么就必须考虑到这多次入院之间的高度相关性(同一患者因同样的情况多次入院),这对统计分析添加了不必要的麻烦。所以,我们通过 RANK() 对多次入院情况做排序编号:

query("WITH co AS
  (
  SELECT icu.subject_id, icu.hadm_id, icu.icustay_id
  , EXTRACT(EPOCH FROM outtime - intime)/60.0/60.0/24.0 as icu_length_of_stay
  , EXTRACT('epoch' from icu.intime - pat.dob) / 60.0 / 60.0 / 24.0 / 365.242 as age
  , RANK() OVER (PARTITION BY icu.subject_id ORDER BY icu.intime) AS icustay_id_order
  FROM icustays icu INNER JOIN patients pat ON
    icu.subject_id = pat.subject_id LIMIT 10
  )
  SELECT co.subject_id, co.hadm_id, co.icustay_id, co.icu_length_of_stay, co.age, co.icustay_id_order,
  CASE WHEN 
    co.icu_length_of_stay < 2 then 1
  ELSE 0 END as exclusion_los,
  CASE WHEN
    co.age < 16 then 1
  ELSE 0 END as exclusion_age FROM co;")

#---

# A tibble: 10 x 8
   subject_id hadm_id icustay_id icu_length_of_stay       age icustay_id_order exclusion_los exclusion_age
 *      <int>   <int>      <int>              <dbl>     <dbl>            <dbl>         <int>         <int>
 1          2  163353     243653             0.0918  0.00243                 1             1             1
 2          3  145834     211552             6.06   76.5                     1             0             0
 3          4  185777     294638             1.68   47.8                     1             1             0
 4          5  178980     214757             0.0844  0.000693                1             1             1
 5          6  107064     228232             3.67   65.9                     1             0             0
 6          7  118037     278444             0.268   0.00178                 1             1             1
 7          7  118037     236754             0.739   0.00587                 2             1             1
 8          8  159514     262299             1.08    0.00144                 1             1             1
 9          9  150750     220597             5.32   41.8                     1             0             0
10         10  184167     288409             8.09    0.00133                 1             0             1

可以对看到 subject_id 为 7 的患者就有两次入院信息。所以我们要做的就是再加入一个 CASE WHEN 把这样的病例去掉(虽然其实这个病例也会因为其他标准不符合而被剔除):

query("WITH co AS
  (
  SELECT icu.subject_id, icu.hadm_id, icu.icustay_id
  , EXTRACT(EPOCH FROM outtime - intime)/60.0/60.0/24.0 as icu_length_of_stay
  , EXTRACT('epoch' from icu.intime - pat.dob) / 60.0 / 60.0 / 24.0 / 365.242 as age
  , RANK() OVER (PARTITION BY icu.subject_id ORDER BY icu.intime) AS icustay_id_order
  FROM icustays icu INNER JOIN patients pat ON
    icu.subject_id = pat.subject_id LIMIT 10
  )
  SELECT co.subject_id, co.hadm_id, co.icustay_id, co.icu_length_of_stay, co.age, co.icustay_id_order,
  CASE WHEN
    co.icu_length_of_stay < 2 then 1
  ELSE 0 END
  AS exclusion_los,
  CASE WHEN
    co.age < 16 then 1
  ELSE 0 END AS exclusion_age,
  CASE WHEN
    co.icustay_id_order != 1
  THEN 1 ELSE 0 END AS exclusion_first_stay FROM co;")

#---

# A tibble: 10 x 9
   subject_id hadm_id icustay_id icu_length_of_stay       age icustay_id_order exclusion_los exclusion_age exclusion_first_stay
 *      <int>   <int>      <int>              <dbl>     <dbl>            <dbl>         <int>         <int>                <int>
 1          2  163353     243653             0.0918  0.00243                 1             1             1                    0
 2          3  145834     211552             6.06   76.5                     1             0             0                    0
 3          4  185777     294638             1.68   47.8                     1             1             0                    0
 4          5  178980     214757             0.0844  0.000693                1             1             1                    0
 5          6  107064     228232             3.67   65.9                     1             0             0                    0
 6          7  118037     278444             0.268   0.00178                 1             1             1                    0
 7          7  118037     236754             0.739   0.00587                 2             1             1                    1
 8          8  159514     262299             1.08    0.00144                 1             1             1                    0
 9          9  150750     220597             5.32   41.8                     1             0             0                    0
10         10  184167     288409             8.09    0.00133                 1             0             1                    0

可以看到 subject_id 为 7 的患者第 2 次的入院信息确实已经被标记为待剔除。

最后,我们可能还想根据入院接受治疗特定情况剔除掉部分人。因为不同科室接收的病人基本情况差别也很大,而通过剔除特定人群之后可以使研究的人群一致性更好。services 表格就提供了患者入院接受何种治疗的情况:

query("SELECT subject_id, hadm_id, transfertime, prev_service, curr_service
       FROM services LIMIT 10;")

#---

# A tibble: 10 x 5
   subject_id hadm_id transfertime        prev_service curr_service
 *      <int>   <int> <dttm>              <chr>        <chr>       
 1        471  135879 2122-07-22 14:07:27 TSURG        MED         
 2        471  135879 2122-07-26 18:31:49 MED          TSURG       
 3        472  173064 2172-09-28 19:22:15 NA           CMED        
 4        473  129194 2201-01-09 20:16:45 NA           NB          
 5        474  194246 2181-03-23 08:24:41 NA           NB          
 6        474  146746 2181-04-04 17:38:46 NA           NBB         
 7        475  139351 2131-09-16 18:44:04 NA           NB          
 8        476  161042 2100-07-05 10:26:45 NA           NB          
 9        477  191025 2156-07-20 11:53:03 NA           MED         
10        478  137370 2194-07-15 13:55:21 NA           NB 

从上面可以看到,curr_service是 current service 的缩写,prev_service在患者有转科的情况下记录转科前的科室,否则为 null。比如 subject_id 为 471 的患者发生过至少两次 service 的变更:一次从 TSURG 到 MED,另一次从 MED 到 TSURG(注:可能还有更多记录因为我们用了 LIMIT 10 而没有显示,可以通过 SELECT * FROM services WHERE subject_id = 471 进一步查看)。

表格里所有的 service 可以从 MIMIC 网站查看:http://mimic.physionet.org/mimictables/services/。简单来说就是这些:

Service Description
CMED Cardiac Medical - for non-surgical cardiac related admissions
CSURG Cardiac Surgery - for surgical cardiac admissions
DENT Dental - for dental/jaw related admissions
ENT Ear, nose, and throat - conditions primarily affecting these areas
GU Genitourinary - reproductive organs/urinary system
GYN Gynecological - female reproductive systems and breasts
MED Medical - general service for internal medicine
NB Newborn - infants born at the hospital
NBB Newborn baby - infants born at the hospital
NMED Neurologic Medical - non-surgical, relating to the brain
NSURG Neurologic Surgical - surgical, relating to the brain
OBS Obstetrics - conerned with childbirth and the care of women giving birth
ORTHO Orthopaedic - surgical, relating to the musculoskeletal system
OMED Orthopaedic medicine - non-surgical, relating to musculoskeletal system
PSURG Plastic - restortation/reconstruction of the human body (including cosmetic or aesthetic)
PSYCH Psychiatric - mental disorders relating to mood, behaviour, cognition, or perceptions
SURG Surgical - general surgical service not classified elsewhere
TRAUM Trauma - injury or damage caused by physical harm from an external source
TSURG Thoracic Surgical - surgery on the thorax, located between the neck and the abdomen
VSURG Vascular Surgical - surgery relating to the circulatory system

如果我们想剔除掉接受手术治疗的病人的,那就需要排除这些 service

  • CSURG
  • NSURG
  • ORTHO
  • PSURG
  • SURG
  • TSURG
  • VSURG

可以通过 %SURG or ORTHO通配符匹配搞定:

query("SELECT hadm_id, curr_service,
       CASE WHEN
        curr_service like '%SURG' then 1
         WHEN curr_service = 'ORTHO' then 1
       ELSE 0 END AS surgical
       FROM services se LIMIT 10;")

#---

# A tibble: 10 x 3
   hadm_id curr_service surgical
 *   <int> <chr>           <int>
 1  135879 MED                 0
 2  135879 TSURG               1
 3  173064 CMED                0
 4  129194 NB                  0
 5  194246 NB                  0
 6  146746 NBB                 0
 7  139351 NB                  0
 8  161042 NB                  0
 9  191025 MED                 0
10  137370 NB                  0

OK,该剔除的都标记好了。但是我们发现我们只有 hadm_id,而我们选取队列是以 icustay_id 为中心的。所以现在还要通过 hadm_idicustays 表格来一次 JOIN 得到 icustay_id

query("SELECT icu.hadm_id, icu.icustay_id, curr_service,
      CASE WHEN
        curr_service like '%SURG' then 1
      WHEN curr_service = 'ORTHO' then 1
      ELSE 0 END AS surgical
      FROM icustays icu LEFT JOIN services se ON
        icu.hadm_id = se.hadm_id LIMIT 10;")

#----

# A tibble: 10 x 4
   hadm_id icustay_id curr_service surgical
 *   <int>      <int> <chr>           <int>
 1  100001     275225 MED                 0
 2  100003     209281 MED                 0
 3  100006     291788 MED                 0
 4  100006     291788 OMED                0
 5  100007     217937 SURG                1
 6  100009     253656 CSURG               1
 7  100010     271147 GU                  0
 8  100011     214619 TRAUM               0
 9  100012     239289 SURG                1
10  100016     217590 MED                 0

然后现在新的问题又来了:一个 icustay_id 对应多个 service 怎么选择?其实这个是关于研究队列的选择的问题,而不是代码写法的问题。比如我们决定把来 ICU 之前是做手术的病人剔除掉,那么上面的 JOIN 就要改了:

query("SELECT icu.hadm_id, icu.icustay_id, se.curr_service,
      CASE WHEN curr_service like '%SURG' then 1
        WHEN curr_service = 'ORTHO' then 1
      ELSE 0 END AS surgical FROM icustays icu
      LEFT JOIN services se ON
        icu.hadm_id = se.hadm_id
      AND se.transfertime < icu.intime + interval '12' hour LIMIT 10;")

#----

# A tibble: 10 x 4
   hadm_id icustay_id curr_service surgical
 *   <int>      <int> <chr>           <int>
 1  100001     275225 MED                 0
 2  100003     209281 MED                 0
 3  100006     291788 MED                 0
 4  100007     217937 SURG                1
 5  100009     253656 CSURG               1
 6  100010     271147 GU                  0
 7  100011     214619 TRAUM               0
 8  100012     239289 SURG                1
 9  100016     217590 MED                 0
10  100017     258320 MED                 0

与前面的结果比较,发现 hadm_id = 100006 的患者 service = OMED 的行去掉了:因为这个患者的 OMED 是在 ICU 之后的,我们不纳入研究(虽然其实 OMED 是非手术)。注意上面代码的 JOIN 中我们用到了 + interval '12' hour ,这给我们的剔除标准增加了一点点宽容度。原因在于数据中记录的这些时间信息都是院内不同地方不同的人不同时刻进行录入的,所以必然有一些不一致。比如,一个 ICU 病人可能因为需要手术而发生 transfer,但是记录的转科时间上却在进入 ICU 的时间一小时后。这就属于行政上的“噪音”,而我们加入的 12 个小时有助于解决这个问题。再次说明,这个这是关于队列如何选择的问题——可能你觉得 12h 太长,2-4h 比较合适——但是其实对于我们的例子来说区别不大,因为 80% 的病人没有转科的情况。

最后,我们合并结果为每次 ICU 只有一个 service 记录。和前面一样,用到 RANK()

query("WITH serv AS
      (
      SELECT icu.hadm_id, icu.icustay_id, se.curr_service,
      CASE WHEN
        curr_service like '%SURG' then 1
      WHEN curr_service = 'ORTHO' then 1
      ELSE 0 END AS surgical,
      RANK() OVER (PARTITION BY icu.hadm_id ORDER BY se.transfertime DESC) as rank
      FROM icustays icu LEFT JOIN services se ON
        icu.hadm_id = se.hadm_id
      AND se.transfertime < icu.intime + interval '12' hour LIMIT 10
      )
      SELECT hadm_id, icustay_id, curr_service, surgical FROM serv
      WHERE rank = 1;")

#----

# A tibble: 10 x 4
   hadm_id icustay_id curr_service surgical
 *   <int>      <int> <chr>           <int>
 1  100001     275225 MED                 0
 2  100003     209281 MED                 0
 3  100006     291788 MED                 0
 4  100007     217937 SURG                1
 5  100009     253656 CSURG               1
 6  100010     271147 GU                  0
 7  100011     214619 TRAUM               0
 8  100012     239289 SURG                1
 9  100016     217590 MED                 0
10  100017     258320 MED                 0

然后最后的最后在和我们之前的筛选队列再 JOIN 一下:

query("WITH co AS
      (
      SELECT icu.subject_id, icu.hadm_id, icu.icustay_id
      , EXTRACT(EPOCH FROM outtime - intime)/60.0/60.0/24.0 as icu_length_of_stay
      , EXTRACT('epoch' from icu.intime - pat.dob) / 60.0 / 60.0 / 24.0 / 365.242 as age
      , RANK() OVER (PARTITION BY icu.subject_id ORDER BY icu.intime) AS icustay_id_order
      FROM icustays icu INNER JOIN patients pat ON
        icu.subject_id = pat.subject_id LIMIT 10),
      serv AS
      (
      SELECT icu.hadm_id, icu.icustay_id, se.curr_service
      , CASE WHEN
          curr_service like '%SURG' then 1
      WHEN
          curr_service = 'ORTHO' then 1
      ELSE 0 END as surgical
      , RANK() OVER (PARTITION BY icu.hadm_id ORDER BY se.transfertime DESC) as rank
      FROM icustays icu LEFT JOIN services se ON
        icu.hadm_id = se.hadm_id
        AND se.transfertime < icu.intime + interval '12' hour
      )
      SELECT co.subject_id, co.hadm_id, co.icustay_id
      , co.icu_length_of_stay, co.age, co.icustay_id_order
      , CASE WHEN
          co.icu_length_of_stay < 2 then 1
      ELSE 0 END AS exclusion_los
      , CASE WHEN
          co.age < 16 then 1
      ELSE 0 END AS exclusion_age
      , CASE WHEN
          co.icustay_id_order != 1 THEN 1
      ELSE 0 END AS exclusion_first_stay
      , CASE WHEN serv.surgical = 1 THEN 1
      ELSE 0 END as exclusion_surgical
      FROM co LEFT JOIN serv ON
          co.icustay_id = serv.icustay_id AND serv.rank = 1;")

#----

# A tibble: 10 x 10
   subject_id hadm_id icustay_id icu_length_of_stay       age icustay_id_order exclusion_los exclusion_age exclusion_first_stexclusion_surgic*      <int>   <int>      <int>              <dbl>     <dbl>            <dbl>         <int>         <int>               <int>             <int>
 1          6  107064     228232             3.67   65.9                     1             0             0                   0                 1
 2          7  118037     278444             0.268   0.00178                 1             1             1                   0                 0
 3          7  118037     236754             0.739   0.00587                 2             1             1                   1                 0
 4          3  145834     211552             6.06   76.5                     1             0             0                   0                 1
 5          9  150750     220597             5.32   41.8                     1             0             0                   0                 0
 6          8  159514     262299             1.08    0.00144                 1             1             1                   0                 0
 7          2  163353     243653             0.0918  0.00243                 1             1             1                   0                 0
 8          5  178980     214757             0.0844  0.000693                1             1             1                   0                 0
 9         10  184167     288409             8.09    0.00133                 1             0             1                   0                 0
10          4  185777     294638             1.68   47.8                     1             1             0                   0                 0

然后我们就顺利得到了需要的病人队列,可以开始提取数据了。

最后来总结一下我们的筛选流程(最后这一步也可以在 R 里写,嫌麻烦算了,直接复制粘贴到 Python 里了)

import pandas as pd
import numpy as np
import psycopg2
from IPython.display import display, HTML
sqluser='postgres'
dbname='mimic'
schema_name='mimiciii'

con = psycopg2.connect(dbname=dbname,user=sqluser, password='not_shown_here')

query_schema = 'set search_path to ' + schema_name + ';'

query = query_schema + """
WITH co AS
(
SELECT icu.subject_id, icu.hadm_id, icu.icustay_id, first_careunit
, EXTRACT(EPOCH FROM outtime - intime)/60.0/60.0/24.0 as icu_length_of_stay
, EXTRACT('epoch' from icu.intime - pat.dob) / 60.0 / 60.0 / 24.0 / 365.242 as age
, RANK() OVER (PARTITION BY icu.subject_id ORDER BY icu.intime) AS icustay_id_order
FROM icustays icu
INNER JOIN patients pat
  ON icu.subject_id = pat.subject_id
LIMIT 10
)
, serv AS
(
SELECT icu.hadm_id, icu.icustay_id, se.curr_service
, CASE
    WHEN curr_service like '%SURG' then 1
    WHEN curr_service = 'ORTHO' then 1
    ELSE 0 END
  as surgical
, RANK() OVER (PARTITION BY icu.hadm_id ORDER BY se.transfertime DESC) as rank
FROM icustays icu
LEFT JOIN services se
 ON icu.hadm_id = se.hadm_id
AND se.transfertime < icu.intime + interval '12' hour
)
SELECT
  co.subject_id, co.hadm_id, co.icustay_id, co.icu_length_of_stay
  , co.age
  , co.icustay_id_order
  , serv.curr_service
  , co.first_careunit
  , CASE
        WHEN co.icu_length_of_stay < 2 then 1
    ELSE 0 END
    AS exclusion_los
  , CASE
        WHEN co.age < 16 then 1
    ELSE 0 END
    AS exclusion_age
  , CASE 
        WHEN co.icustay_id_order != 1 THEN 1
    ELSE 0 END 
    AS exclusion_first_stay
  , CASE
        WHEN serv.surgical = 1 THEN 1
    ELSE 0 END
    as exclusion_surgical
FROM co
LEFT JOIN serv
  ON  co.icustay_id = serv.icustay_id
  AND serv.rank = 1
"""

df = pd.read_sql_query(query, con)

print('{:20s} {:5d}'.format('Observations', df.shape[0]))
idxExcl = np.zeros(df.shape[0],dtype=bool)
for col in df.columns:
    if "exclusion_" in col:
        print('{:20s} {:5d} ({:2.2f}%)'.format(col, df[col].sum(), df[col].sum()*100.0/df.shape[0]))
        idxExcl = (idxExcl) | (df[col]==1)

print('')
print('{:20s} {:5d} ({:2.2f}%)'.format('Total excluded', np.sum(idxExcl), np.sum(idxExcl)*100.0/df.shape[0]))

# --------

Observations            10
exclusion_los            6 (60.00%)
exclusion_age            6 (60.00%)
exclusion_first_stay     1 (10.00%)
exclusion_surgical       2 (20.00%)

Total excluded           9 (90.00%)

可以发现,由于我们前面建立了筛选的队列表格,所以最后想看整个筛选过程就变得很简单。


这篇文档真的觉得很有用,其一是很展示了每一步应该怎么写查询语句并有详细的解释;其二也是最重要的,给出了选择研究队列的一般理念。

THE END

所有 R 代码也贴在最后当备份了。

library(RPostgreSQL)
library(tidyverse)

query <- function(query = query) {
  con %>%
    dbGetQuery(sql(query)) %>%
    as_tibble()
}

# connect to DB -----------------------------------------------------------
drv <- dbDriver("PostgreSQL")
con <- dbConnect(
  drv = drv,
  dbname = "mimic",
  user = "postgres",
  .rs.askForPassword("Enter password for user postgres:")
)
# set the search path to the mimiciii schema
dbSendQuery(con, "SET search_path TO mimiciii, public;")
# being lazy
query("SELECT subject_id, hadm_id, icustay_id
       FROM icustays
       LIMIT 10")

query("SELECT subject_id, hadm_id, icustay_id
      , outtime - intime as icu_length_of_stay_interval
      , EXTRACT(EPOCH FROM outtime - intime) as icu_length_of_stay
      FROM icustays
      LIMIT 10;")

query("SELECT subject_id, hadm_id, icustay_id
      , EXTRACT(EPOCH FROM outtime - intime)/60.0/60.0/24.0 as icu_length_of_stay
      FROM icustays LIMIT 10;")

query("WITH co AS
      (
        SELECT subject_id, hadm_id, icustay_id
        , EXTRACT(EPOCH FROM outtime - intime)/60.0/60.0/24.0 as icu_length_of_stay
        FROM icustays LIMIT 10
      )
      SELECT  co.subject_id, co.hadm_id, co.icustay_id, co.icu_length_of_stay
      FROM co WHERE icu_length_of_stay >= 2;")

query("WITH co AS
      (
        SELECT subject_id, hadm_id, icustay_id
        , EXTRACT(EPOCH FROM outtime - intime)/60.0/60.0/24.0 as icu_length_of_stay
        FROM icustays LIMIT 10
      )
      SELECT co.subject_id, co.hadm_id, co.icustay_id, co.icu_length_of_stay,
      CASE
        WHEN co.icu_length_of_stay < 2 then 1
      ELSE 0 END
      as exclusion_los FROM co;")

# age ---------------------------------------------------------------------


query("WITH co AS
      (
      SELECT icu.subject_id, icu.hadm_id, icu.icustay_id
      , EXTRACT(EPOCH FROM outtime - intime)/60.0/60.0/24.0 as icu_length_of_stay
      , icu.intime - pat.dob AS age FROM icustays icu
      INNER JOIN patients pat ON
        icu.subject_id = pat.subject_id LIMIT 10
      )
      SELECT co.subject_id, co.icu_length_of_stay, co.age
      , EXTRACT('year' FROM co.age) as age_extract_year
      , EXTRACT('year' FROM co.age)
        + EXTRACT('months' FROM co.age) / 12.0
        + EXTRACT('days' FROM co.age) / 365.242
        + EXTRACT('hours' FROM co.age) / 24.0 / 364.242 as age_extract_precise
      , EXTRACT('epoch' from co.age) / 60.0 / 60.0 / 24.0 / 365.242 as age_extract_epoch,
      CASE WHEN
        co.icu_length_of_stay < 2 then 1
      ELSE 0 END
      as exclusion_los FROM co;")

query("WITH co AS
      (
      SELECT icu.subject_id, icu.hadm_id, icu.icustay_id
      , EXTRACT(EPOCH FROM outtime - intime)/60.0/60.0/24.0 as icu_length_of_stay
      , EXTRACT('epoch' from icu.intime - pat.dob) / 60.0 / 60.0 / 24.0 / 365.242 as age
      FROM icustays icu INNER JOIN patients pat ON
        icu.subject_id = pat.subject_id LIMIT 10
      )
      SELECT co.subject_id, co.hadm_id, co.icustay_id, co.icu_length_of_stay, co.age,
      CASE WHEN
        co.icu_length_of_stay < 2 then 1
      ELSE 0 END as exclusion_los
      ,CASE WHEN co.age < 16 then 1
      ELSE 0 END as exclusion_age FROM co;")



# readmission -------------------------------------------------------------

query("WITH co AS
  (
  SELECT icu.subject_id, icu.hadm_id, icu.icustay_id
  , EXTRACT(EPOCH FROM outtime - intime)/60.0/60.0/24.0 as icu_length_of_stay
  , EXTRACT('epoch' from icu.intime - pat.dob) / 60.0 / 60.0 / 24.0 / 365.242 as age
  , RANK() OVER (PARTITION BY icu.subject_id ORDER BY icu.intime) AS icustay_id_order
  FROM icustays icu INNER JOIN patients pat ON
    icu.subject_id = pat.subject_id LIMIT 10
  )
  SELECT co.subject_id, co.hadm_id, co.icustay_id, co.icu_length_of_stay, co.age, co.icustay_id_order,
  CASE WHEN
    co.icu_length_of_stay < 2 then 1
  ELSE 0 END as exclusion_los,
  CASE WHEN
    co.age < 16 then 1
  ELSE 0 END as exclusion_age FROM co;")

query("WITH co AS
  (
  SELECT icu.subject_id, icu.hadm_id, icu.icustay_id
  , EXTRACT(EPOCH FROM outtime - intime)/60.0/60.0/24.0 as icu_length_of_stay
  , EXTRACT('epoch' from icu.intime - pat.dob) / 60.0 / 60.0 / 24.0 / 365.242 as age
  , RANK() OVER (PARTITION BY icu.subject_id ORDER BY icu.intime) AS icustay_id_order
  FROM icustays icu INNER JOIN patients pat ON
    icu.subject_id = pat.subject_id LIMIT 10
  )
  SELECT co.subject_id, co.hadm_id, co.icustay_id, co.icu_length_of_stay, co.age, co.icustay_id_order,
  CASE WHEN
    co.icu_length_of_stay < 2 then 1
  ELSE 0 END
  AS exclusion_los,
  CASE WHEN
    co.age < 16 then 1
  ELSE 0 END AS exclusion_age,
  CASE WHEN
    co.icustay_id_order != 1
  THEN 1 ELSE 0 END AS exclusion_first_stay FROM co;")



# service -----------------------------------------------------------------

query("SELECT subject_id, hadm_id, transfertime, prev_service, curr_service
       FROM services LIMIT 10;")
query("SELECT * FROM services WHERE subject_id = 471;")

query("SELECT hadm_id, curr_service,
       CASE WHEN
        curr_service like '%SURG' then 1
         WHEN curr_service = 'ORTHO' then 1
       ELSE 0 END AS surgical
       FROM services se LIMIT 10;")

query("SELECT icu.hadm_id, icu.icustay_id, curr_service,
      CASE WHEN
        curr_service like '%SURG' then 1
      WHEN curr_service = 'ORTHO' then 1
      ELSE 0 END AS surgical
      FROM icustays icu LEFT JOIN services se ON
        icu.hadm_id = se.hadm_id LIMIT 10;")

query("SELECT * FROM services WHERE hadm_id=100006;")
query("SELECT * FROM icustays WHERE hadm_id=100006;")

query("SELECT icu.hadm_id, icu.icustay_id, se.curr_service,
      CASE WHEN curr_service like '%SURG' then 1
        WHEN curr_service = 'ORTHO' then 1
      ELSE 0 END AS surgical FROM icustays icu
      LEFT JOIN services se ON
        icu.hadm_id = se.hadm_id
      AND se.transfertime < icu.intime + interval '12' hour LIMIT 10;")

query("WITH serv AS
      (
      SELECT icu.hadm_id, icu.icustay_id, se.curr_service,
      CASE WHEN
        curr_service like '%SURG' then 1
      WHEN curr_service = 'ORTHO' then 1
      ELSE 0 END AS surgical,
      RANK() OVER (PARTITION BY icu.hadm_id ORDER BY se.transfertime DESC) as rank
      FROM icustays icu LEFT JOIN services se ON
        icu.hadm_id = se.hadm_id
      AND se.transfertime < icu.intime + interval '12' hour LIMIT 10
      )
      SELECT hadm_id, icustay_id, curr_service, surgical FROM serv
      WHERE rank = 1;")


# together ----------------------------------------------------------------

query("WITH co AS
      (
      SELECT icu.subject_id, icu.hadm_id, icu.icustay_id
      , EXTRACT(EPOCH FROM outtime - intime)/60.0/60.0/24.0 as icu_length_of_stay
      , EXTRACT('epoch' from icu.intime - pat.dob) / 60.0 / 60.0 / 24.0 / 365.242 as age
      , RANK() OVER (PARTITION BY icu.subject_id ORDER BY icu.intime) AS icustay_id_order
      FROM icustays icu INNER JOIN patients pat ON
        icu.subject_id = pat.subject_id LIMIT 10),
      serv AS
      (
      SELECT icu.hadm_id, icu.icustay_id, se.curr_service
      , CASE WHEN
          curr_service like '%SURG' then 1
      WHEN
          curr_service = 'ORTHO' then 1
      ELSE 0 END as surgical
      , RANK() OVER (PARTITION BY icu.hadm_id ORDER BY se.transfertime DESC) as rank
      FROM icustays icu LEFT JOIN services se ON
        icu.hadm_id = se.hadm_id
        AND se.transfertime < icu.intime + interval '12' hour
      )
      SELECT co.subject_id, co.hadm_id, co.icustay_id
      , co.icu_length_of_stay, co.age, co.icustay_id_order
      , CASE WHEN
          co.icu_length_of_stay < 2 then 1
      ELSE 0 END AS exclusion_los
      , CASE WHEN
          co.age < 16 then 1
      ELSE 0 END AS exclusion_age
      , CASE WHEN
          co.icustay_id_order != 1 THEN 1
      ELSE 0 END AS exclusion_first_stay
      , CASE WHEN serv.surgical = 1 THEN 1
      ELSE 0 END as exclusion_surgical
      FROM co LEFT JOIN serv ON
          co.icustay_id = serv.icustay_id AND serv.rank = 1;")

df <- query("WITH co AS
      (
            SELECT icu.subject_id, icu.hadm_id, icu.icustay_id
            , EXTRACT(EPOCH FROM outtime - intime)/60.0/60.0/24.0 as icu_length_of_stay
            , EXTRACT('epoch' from icu.intime - pat.dob) / 60.0 / 60.0 / 24.0 / 365.242 as age
            , RANK() OVER (PARTITION BY icu.subject_id ORDER BY icu.intime) AS icustay_id_order
            FROM icustays icu INNER JOIN patients pat ON
            icu.subject_id = pat.subject_id LIMIT 10),
            serv AS
            (
            SELECT icu.hadm_id, icu.icustay_id, se.curr_service
            , CASE WHEN
            curr_service like '%SURG' then 1
            WHEN
            curr_service = 'ORTHO' then 1
            ELSE 0 END as surgical
            , RANK() OVER (PARTITION BY icu.hadm_id ORDER BY se.transfertime DESC) as rank
            FROM icustays icu LEFT JOIN services se ON
            icu.hadm_id = se.hadm_id
            AND se.transfertime < icu.intime + interval '12' hour
            )
            SELECT co.subject_id, co.hadm_id, co.icustay_id
            , co.icu_length_of_stay, co.age, co.icustay_id_order
            , CASE WHEN
            co.icu_length_of_stay < 2 then 1
            ELSE 0 END AS exclusion_los
            , CASE WHEN
            co.age < 16 then 1
            ELSE 0 END AS exclusion_age
            , CASE WHEN
            co.icustay_id_order != 1 THEN 1
            ELSE 0 END AS exclusion_first_stay
            , CASE WHEN serv.surgical = 1 THEN 1
            ELSE 0 END as exclusion_surgical
            FROM co LEFT JOIN serv ON
            co.icustay_id = serv.icustay_id AND serv.rank = 1;")

跟着 mimic-code 探索 MIMIC 数据之 notebooks CRRT (二)

0.cover.BingWallpaper-2018-08-15

书接上回。
上回说到,我们首先自己搜索 d_items 表格找到对应的 item_id,然后运用专业知识对这些 item_id 进行筛选和归类,最后得到真正能够定义 CRRT 时间的那些 item_id

下面,我们就要通过最后得到的这些 item_id 来制定规则,看看到底 CRRT 的时间是如何对应到这些数据上的。

(上一篇放到 GitHub 上之后发现由于代码基本上都是 SQL,只是通过懒人函数 query() 套壳到 R 了,所以粘贴时按照 R 来高亮。导致高亮这么重要的功能基本上算是废了。所以这一篇决定还是直接贴 SQL 代码好了,要用的时候要么放到 psql 终端要么再套 query()的壳就行了。这样有个问题就是和上一篇其实代码上不统一,加之这一篇我也是边写边改,写到后面才决定整体干脆用 SQL 代码又回来改掉的,所以代码风格可能也不一致,写在这里算了,有空再改。反正长长的 Todo list 也不差再多一个了...)

To do:

  • 本篇与上篇代码统一
  • 检查本篇代码是否有误以及代码风格是否统一

Step 4: definition of concept using rules

我们再回头想想这个笔记本最开始的目的。我们是想得到每个病人 CRRT 的时间段,就是说对于每个 icustay_id 我们都要得到:

  • 一个 starttime
  • 一个 endtime

因为在一个病人住院期间,CRRT 有可能会有中断,因此对于一个 icustay_id 来说可能不止有一个 starttimeendtime,而且多个时段之间应该都不重叠。

回想一下,CHARTEENTS 就是存储各种事件的时刻(charttime)用的,而且所有东西都会记录为一个单独的时间点(即哪天几点几分做了什么)。因此对于 CHARTEVENTS 来说,我们现在的主要任务就是把一系列的 charttime 时间点转成一对一对的 starttimeendtime(即从哪天几点几分到哪天几点几分这样的时间段)。乍一看这个直接连续一个一个小地查看参数记录然后把它们组合起来就行了。第一个出现的 charttime 就作为 starttime ,最后一个就是 endtime。但是,事实上这些数据不仅仅存储在 CHARTEVENTS 里, INPUTEVENTS_MVPROCEDUREEVENTS_MV 也有。所以为了提高准确性,在 CHARTEVENTS 的基础上,我们必须得把这两个表的数据也考虑进来。对于 INPUTEVENTS_MV 这个还不是很复杂,因为 INPUTEVENTS 里对每个观测记录也有一个 charttime,所有我们只需要处理之前把这张表格和 CHARTEVENTS 组合起来就行了(可能用 SQL 的 UNION 语句吧)。

但是 PROCEDUREEVENTS_MV 就略微复杂点了,因为这个表格本来就有 starttimeendtime 这两列。所以我们怎么做呢,首先把 CHARTEVENTSINPUTEVENTS_MV 的数据提取合并完,之后再和 PROCEDUREEVENTS_MV 合并。

任务明确了我们就可以开始了。我们需要做这些:

  1. INPUTEVENTS_MV 的时间点数据聚合成时间段数据
  2. CHARTEVENTS 的时间点数据也聚合成时间段数据
  3. 把 1、2 得到的数据和 PROCEDUREEVENTS_MV 的比较,想办法把 1、2 得到的这两个数据合并起来
  4. 最后把 PROCEDUREEVENTS_MV 和上面 CHARTEVENTS + INPUTEVENTS_MV 得到的数据再合并起来得到所有 MetaVision 数据的时间段。

这个记事本本来作为示例,为了代码运行效率我们把查询限制为一个 icustay_id,我们这里使用 icustay_id = 246866(从前面可以看到这其实就是第一个 icustay_id)。所以每次代码 WHERE 里最后都要记得加上 AND icustay_id = 246866
原文这里还定义了函数用来在输出里去掉 icustay_id 和年月。

Aggregating INPUTEVENTS_MV

我们先来看 INPUTEVENTS_MV。这个表的每条记录都有一个 starttimeendtime。注意我们查询时要加上一个 statusdescription = 'Rewritten',因为这些都是没有没有实际执行而被重写的医嘱(用作审计用途,但是并没有说明用的药物是什么)。

SELECT linkorderid
  , orderid
  , case when itemid = 227525 then 'Calcium' else 'KCl' end as label
  , starttime, endtime
  , rate, rateuom
  , statusdescription
FROM inputevents_mv
WHERE itemid IN
(
  --227525,-- Calcium Gluconate (CRRT)
  227536 -- KCl (CRRT)
)
AND statusdescription != 'Rewritten'
AND icustay_id = '246866'
ORDER BY starttime, endtime;

得到:

* linkorderid orderid label starttime endtime rate rateuom statusdescription
0 KCl Day 11, 21:30 Day 12, 02:30 NaT 1 4.000000 mEq./hour FinishedRunning
1 KCl Day 11, 23:45 Day 12, 02:41 Day 12, 02:30 1 10.002273 mEq./hour FinishedRunning
2 KCl Day 12, 02:41 Day 12, 05:36 Day 12, 02:41 0 9.997713 mEq./hour FinishedRunning
3 KCl Day 12, 05:36 Day 12, 08:31 Day 12, 05:36 0 10.285715 mEq./hour FinishedRunning
4 KCl Day 12, 08:31 Day 12, 11:29 Day 12, 08:31 0 10.112360 mEq./hour FinishedRunning
5 KCl Day 12, 11:29 Day 12, 14:28 Day 12, 11:29 0 10.055866 mEq./hour FinishedRunning
6 KCl Day 12, 14:28 Day 12, 17:25 Day 12, 14:28 0 10.169492 mEq./hour FinishedRunning
7 KCl Day 12, 17:25 Day 12, 20:24 Day 12, 17:25 0 10.055866 mEq./hour FinishedRunning
8 KCl Day 12, 20:24 Day 12, 20:30 Day 12, 20:24 0 10.000000 mEq./hour Paused
9 KCl Day 12, 21:30 Day 12, 21:35 Day 12, 20:30 1 9.997634 mEq./hour Changed
10 KCl Day 12, 21:35 Day 13, 02:08 Day 12, 21:35 0 6.190549 mEq./hour FinishedRunning
11 KCl Day 13, 02:08 Day 13, 07:06 Day 13, 02:08 0 6.040268 mEq./hour FinishedRunning
12 KCl Day 13, 07:06 Day 13, 12:03 Day 13, 07:06 0 6.006060 mEq./hour FinishedRunning
13 KCl Day 13, 12:03 Day 13, 16:29 Day 13, 12:03 0 6.005904 mEq./hour Stopped
14 KCl Day 13, 18:15 Day 13, 23:15 Day 13, 16:29 1 6.000000 mEq./hour FinishedRunning
15 KCl Day 14, 15:28 Day 14, 18:47 Day 13, 23:15 1 6.000000 mEq./hour FinishedRunning
16 KCl Day 14, 18:47 Day 14, 19:01 Day 14, 18:47 0 6.000000 mEq./hour Changed
17 KCl Day 14, 19:01 Day 14, 23:04 Day 14, 19:01 0 4.007380 mEq./hour Changed
18 KCl Day 14, 23:04 Day 14, 23:18 Day 14, 23:04 0 5.871428 mEq./hour FinishedRunning
19 KCl Day 14, 23:18 Day 15, 02:28 Day 14, 23:18 0 5.905264 mEq./hour FinishedRunning
20 KCl Day 15, 02:28 Day 15, 05:44 Day 15, 02:28 0 5.969388 mEq./hour FinishedRunning
21 KCl Day 15, 05:44 Day 15, 08:57 Day 15, 05:44 0 5.906736 mEq./hour FinishedRunning
22 KCl Day 15, 08:57 Day 15, 12:08 Day 15, 08:57 0 5.905759 mEq./hour FinishedRunning
23 KCl Day 15, 12:08 Day 15, 15:17 Day 15, 12:08 0 5.904762 mEq./hour FinishedRunning
24 KCl Day 15, 15:17 Day 15, 18:34 Day 15, 15:17 0 5.908629 mEq./hour FinishedRunning
25 KCl Day 15, 18:34 Day 15, 21:46 Day 15, 18:34 0 5.906250 mEq./hour FinishedRunning
26 KCl Day 15, 21:46 Day 16, 01:01 Day 15, 21:46 0 5.907693 mEq./hour FinishedRunning
27 KCl Day 16, 01:01 Day 16, 04:18 Day 16, 01:01 0 5.908629 mEq./hour FinishedRunning
28 KCl Day 16, 04:18 Day 16, 07:36 Day 16, 04:18 0 5.903030 mEq./hour FinishedRunning
29 KCl Day 16, 07:36 Day 16, 10:54 Day 16, 07:36 0 5.903030 mEq./hour FinishedRunning
30 KCl Day 16, 10:54 Day 16, 14:12 Day 16, 10:54 0 5.903030 mEq./hour FinishedRunning
31 KCl Day 16, 14:12 Day 16, 16:04 Day 16, 14:12 0 5.905613 mEq./hour Stopped

(事实我没有改代码把年月去掉,我得到的结果和这个格式有一点点差别,我懒得再去改了直接用了原文。但我得到的结果和这个内容上没有差别)

正常情况下 linkorderid 会把同属于一个医嘱的项目但是可能输液速度发生改动的多行相关联,但是效果不是很好。8-10 行和 16-18 行连到一起了(即它们应该是相连续的,只是输液速度可能变了或者没变),但是我们发现很多应该是同一个的也没有连起来。我们想要的就是连续性事件都合并起来简化得到时间段,看来我们得检查数据然后把一行的 starttime 与上一行的 endtime 这样的行合并起来。大概步骤是:

  1. 创建一个二进制的 flag 用来标记新的 “event”,“event” 认为是多个时间上相连续的用药(不知道 administrtions 到底翻译成什么比较好...),即如果下一行与上一行在时间上不连续就认为是新的 “event” 并标记为 1,下一行在时间上与上一行直接连续则标记为 0
  2. 聚合得到的二进制 flag,然后给每个事件分配一个唯一的整数值(需要在 event 上 PARTITION
  3. 对于每个事件再创建一个用来标识最后一行的整数值(用来从最后一行获得有用的信息)
  4. 在每个 event 的 partition 上直接分组聚合,把多个连续的用药信息合并得到一个 starttimeendtime

我们一步一步来看代码怎么写。

Step 4.1: create a binary flag for new events

先看这段代码(这是不完整示例,不可运行):

WITH t1 AS
(
SELECT icustay_id
  , CASE WHEN itemid = 227525 then 'Calcium' else 'KCl' END AS label
  , starttime
  , endtime
  , CASE WHEN LAG(endtime) 
		  OVER (PARTITION BY icustay_id, itemid 
			  ORDER BY starttime, endtime) = starttime 
		THEN 0 ELSE 1 END
    AS new_event_flag
  , rate, rateuom
  , statusdescription
FROM inputevents_mv
WHERE itemid IN
(
  --227525,-- Calcium Gluconate (CRRT)
  227536 -- KCl (CRRT)
)
AND statusdescription != 'Rewritten'
AND icustay_id = '246866';

这段代码就可以把一个病人的 KCl 使用情况从 INPUTEVENTS_MV 中提取出来(因为仅仅是示例所以才加了针对一个病人的限制条件,把这个限制条件去掉就能查询所有人了)。

关键的代码是:

, CASE WHEN LAG(endtime) OVER
	(PARTITION BY icustay_id, itemid 
		ORDER BY starttime, endtime) = starttime
THEN 0 ELSE 1 END AS new_event_flag

这段代码的作用是生成一个布尔值,在当前行的 starttime 和上一行的 endtime 不等时为 1。即,这个布尔值 flag 标记了新的“event”。看看实际运行的效果:

WITH t1 AS
(
  SELECT icustay_id
  , CASE WHEN 
      itemid = 227525 THEN 'Calcium' 
    ELSE 'KCl' END AS label
  , starttime, endtime
  , LAG(endtime) OVER 
      (PARTITION BY icustay_id, itemid ORDER BY starttime, endtime)
      AS endtime_lag
  , CASE WHEN LAG(endtime) OVER 
      (PARTITION BY icustay_id, itemid ORDER BY starttime, endtime) = starttime THEN 0
    ELSE 1 END AS new_event_flag
  , rate, rateuom
  , statusdescription
  FROM inputevents_mv
  WHERE itemid IN
	  (
  --227525,-- Calcium Gluconate (CRRT)
  227536 -- KCl (CRRT)
	  )
  AND statusdescription != 'Rewritten'
  AND icustay_id = '246866'
)
SELECT 
label
, starttime, endtime, endtime_lag
, new_event_flag
, rate, rateuom
, statusdescription
FROM t1;
label starttime endtime endtime_lag new_event_flag rate rateuom statusdescription
0 KCl Day 11, 21:30 Day 12, 02:30 NaT 1 4.000000 mEq./hour FinishedRunning
1 KCl Day 11, 23:45 Day 12, 02:41 Day 12, 02:30 1 10.002273 mEq./hour FinishedRunning
2 KCl Day 12, 02:41 Day 12, 05:36 Day 12, 02:41 0 9.997713 mEq./hour FinishedRunning
3 KCl Day 12, 05:36 Day 12, 08:31 Day 12, 05:36 0 10.285715 mEq./hour FinishedRunning
4 KCl Day 12, 08:31 Day 12, 11:29 Day 12, 08:31 0 10.112360 mEq./hour FinishedRunning
5 KCl Day 12, 11:29 Day 12, 14:28 Day 12, 11:29 0 10.055866 mEq./hour FinishedRunning
6 KCl Day 12, 14:28 Day 12, 17:25 Day 12, 14:28 0 10.169492 mEq./hour FinishedRunning
7 KCl Day 12, 17:25 Day 12, 20:24 Day 12, 17:25 0 10.055866 mEq./hour FinishedRunning
8 KCl Day 12, 20:24 Day 12, 20:30 Day 12, 20:24 0 10.000000 mEq./hour Paused
9 KCl Day 12, 21:30 Day 12, 21:35 Day 12, 20:30 1 9.997634 mEq./hour Changed
10 KCl Day 12, 21:35 Day 13, 02:08 Day 12, 21:35 0 6.190549 mEq./hour FinishedRunning
11 KCl Day 13, 02:08 Day 13, 07:06 Day 13, 02:08 0 6.040268 mEq./hour FinishedRunning
12 KCl Day 13, 07:06 Day 13, 12:03 Day 13, 07:06 0 6.006060 mEq./hour FinishedRunning
13 KCl Day 13, 12:03 Day 13, 16:29 Day 13, 12:03 0 6.005904 mEq./hour Stopped
14 KCl Day 13, 18:15 Day 13, 23:15 Day 13, 16:29 1 6.000000 mEq./hour FinishedRunning
15 KCl Day 14, 15:28 Day 14, 18:47 Day 13, 23:15 1 6.000000 mEq./hour FinishedRunning
16 KCl Day 14, 18:47 Day 14, 19:01 Day 14, 18:47 0 6.000000 mEq./hour Changed
17 KCl Day 14, 19:01 Day 14, 23:04 Day 14, 19:01 0 4.007380 mEq./hour Changed
18 KCl Day 14, 23:04 Day 14, 23:18 Day 14, 23:04 0 5.871428 mEq./hour FinishedRunning
19 KCl Day 14, 23:18 Day 15, 02:28 Day 14, 23:18 0 5.905264 mEq./hour FinishedRunning
20 KCl Day 15, 02:28 Day 15, 05:44 Day 15, 02:28 0 5.969388 mEq./hour FinishedRunning
21 KCl Day 15, 05:44 Day 15, 08:57 Day 15, 05:44 0 5.906736 mEq./hour FinishedRunning
22 KCl Day 15, 08:57 Day 15, 12:08 Day 15, 08:57 0 5.905759 mEq./hour FinishedRunning
23 KCl Day 15, 12:08 Day 15, 15:17 Day 15, 12:08 0 5.904762 mEq./hour FinishedRunning
24 KCl Day 15, 15:17 Day 15, 18:34 Day 15, 15:17 0 5.908629 mEq./hour FinishedRunning
25 KCl Day 15, 18:34 Day 15, 21:46 Day 15, 18:34 0 5.906250 mEq./hour FinishedRunning
26 KCl Day 15, 21:46 Day 16, 01:01 Day 15, 21:46 0 5.907693 mEq./hour FinishedRunning
27 KCl Day 16, 01:01 Day 16, 04:18 Day 16, 01:01 0 5.908629 mEq./hour FinishedRunning
28 KCl Day 16, 04:18 Day 16, 07:36 Day 16, 04:18 0 5.903030 mEq./hour FinishedRunning
29 KCl Day 16, 07:36 Day 16, 10:54 Day 16, 07:36 0 5.903030 mEq./hour FinishedRunning
30 KCl Day 16, 10:54 Day 16, 14:12 Day 16, 10:54 0 5.903030 mEq./hour FinishedRunning
31 KCl Day 16, 14:12 Day 16, 16:04 Day 16, 14:12 0 5.905613 mEq./hour Stopped

上面的例子里为了清楚地展示这一查询的工作原理,我们特异添加了 endtime_lag 这一列。可以看到第一行的 endtime_lagnull,所以 new\_event_flag = 1。而下一行 endtime_lag != starttime,所以 new\_event_flag 又是 1。再然后,第 2 行(最左边标记为 2,下同)的 endtime_lag == starttime,所以 new\_event_flag = 0(时间上连续,所以不是新事件,记0)。这一连续事件一直持续到第 9 行再一次出现 endtime_lag != starttime(即一个新的事件,0 变为 1)。可以看到第 8 行甚至告诉我们原因:因为用药”Paused“(暂停)了。这就是我们上面提到的,一个事件的最后一行可能回提供有用的信息的意思。

Step 4.2: create a binary flag for new events

在 SQL 里,要把行通过分组聚合起来要用 partition。要用 partition,那我们得借助这些分组的某种唯一键值(一般是整数)。一旦有了这个键值就能用 SQL 标准的聚合操作,比如 MAX()MIN()等等这些(不指定特定列的情况下,SQL 的窗口函数运行的原则与这些函数相同)。

这样来说,我们下一步就是要用上面得到的 new_event_flag 在我们想要合并的行分组上再得到一个整数键值了。因为我们想要的是把新事件合并起来,那可以通过在 new_event_flag 上累加,当有新的事件时(new_event_flag = 1)这个值就会加 1,这样同属一个事件的行的这个和会一样,知道下一事件这个值就会再加 1。这样就很巧妙地为每一个事件分配了一个唯一的整数键值了。代码大概是:

SUM(new_event_flag) OVER
    (PARTITION BY icustay\_id, label
    ORDER BY starttime, endtime)
AS time_partition

看看实际效果:

WITH t1 AS
  (
    SELECT icustay_id
    , CASE WHEN
        itemid = 227525
        THEN 'Calcium' ELSE 'KCl' END AS label
    , starttime, endtime
    , CASE WHEN
        LAG(endtime) OVER
          (PARTITION BY icustay_id, itemid ORDER BY starttime, endtime) = starttime
        THEN 0 ELSE 1 END AS new_event_flag
    , rate, rateuom, statusdescription
    FROM inputevents_mv
    WHERE itemid IN
      (
        --227525,-- Calcium Gluconate (CRRT)
        227536 -- KCl (CRRT)
      )
    AND statusdescription != 'Rewritten'
    AND icustay_id = '246866'
  )
  ,t2 AS
  (
    SELECT icustay_id
    , label, starttime, endtime, new_event_flag
    , SUM(new_event_flag) OVER
        (PARTITION BY icustay_id, label ORDER BY starttime, endtime) AS time_partition
    , rate, rateuom, statusdescription
    FROM t1
  )
SELECT
label
, starttime, endtime
, new_event_flag
, time_partition
, rate, rateuom, statusdescription
FROM t2
ORDER BY starttime, endtime;

得到:

* label starttime endtime new_event_flag time_partition rate rateuom statusdescription
0 KCl Day 11, 21:30 Day 12, 02:30 1 1 4.000000 mEq./hour FinishedRunning
1 KCl Day 11, 23:45 Day 12, 02:41 1 2 10.002273 mEq./hour FinishedRunning
2 KCl Day 12, 02:41 Day 12, 05:36 0 2 9.997713 mEq./hour FinishedRunning
3 KCl Day 12, 05:36 Day 12, 08:31 0 2 10.285715 mEq./hour FinishedRunning
4 KCl Day 12, 08:31 Day 12, 11:29 0 2 10.112360 mEq./hour FinishedRunning
5 KCl Day 12, 11:29 Day 12, 14:28 0 2 10.055866 mEq./hour FinishedRunning
6 KCl Day 12, 14:28 Day 12, 17:25 0 2 10.169492 mEq./hour FinishedRunning
7 KCl Day 12, 17:25 Day 12, 20:24 0 2 10.055866 mEq./hour FinishedRunning
8 KCl Day 12, 20:24 Day 12, 20:30 0 2 10.000000 mEq./hour Paused
9 KCl Day 12, 21:30 Day 12, 21:35 1 3 9.997634 mEq./hour Changed
10 KCl Day 12, 21:35 Day 13, 02:08 0 3 6.190549 mEq./hour FinishedRunning
11 KCl Day 13, 02:08 Day 13, 07:06 0 3 6.040268 mEq./hour FinishedRunning
12 KCl Day 13, 07:06 Day 13, 12:03 0 3 6.006060 mEq./hour FinishedRunning
13 KCl Day 13, 12:03 Day 13, 16:29 0 3 6.005904 mEq./hour Stopped
14 KCl Day 13, 18:15 Day 13, 23:15 1 4 6.000000 mEq./hour FinishedRunning
15 KCl Day 14, 15:28 Day 14, 18:47 1 5 6.000000 mEq./hour FinishedRunning
16 KCl Day 14, 18:47 Day 14, 19:01 0 5 6.000000 mEq./hour Changed
17 KCl Day 14, 19:01 Day 14, 23:04 0 5 4.007380 mEq./hour Changed
18 KCl Day 14, 23:04 Day 14, 23:18 0 5 5.871428 mEq./hour FinishedRunning
19 KCl Day 14, 23:18 Day 15, 02:28 0 5 5.905264 mEq./hour FinishedRunning
20 KCl Day 15, 02:28 Day 15, 05:44 0 5 5.969388 mEq./hour FinishedRunning
21 KCl Day 15, 05:44 Day 15, 08:57 0 5 5.906736 mEq./hour FinishedRunning
22 KCl Day 15, 08:57 Day 15, 12:08 0 5 5.905759 mEq./hour FinishedRunning
23 KCl Day 15, 12:08 Day 15, 15:17 0 5 5.904762 mEq./hour FinishedRunning
24 KCl Day 15, 15:17 Day 15, 18:34 0 5 5.908629 mEq./hour FinishedRunning
25 KCl Day 15, 18:34 Day 15, 21:46 0 5 5.906250 mEq./hour FinishedRunning
26 KCl Day 15, 21:46 Day 16, 01:01 0 5 5.907693 mEq./hour FinishedRunning
27 KCl Day 16, 01:01 Day 16, 04:18 0 5 5.908629 mEq./hour FinishedRunning
28 KCl Day 16, 04:18 Day 16, 07:36 0 5 5.903030 mEq./hour FinishedRunning
29 KCl Day 16, 07:36 Day 16, 10:54 0 5 5.903030 mEq./hour FinishedRunning
30 KCl Day 16, 10:54 Day 16, 14:12 0 5 5.903030 mEq./hour FinishedRunning
31 KCl Day 16, 14:12 Day 16, 16:04 0 5 5.905613 mEq./hour Stopped

上面的例子(希望是)清楚地展示了如何通过对 KCl 用药情况上用窗函数 partitionnew_event_flag的累加得到一个新的列 time_partition

Step 4.3: create an integer to mark the last row of an event

从前面我们知道,每个事件的最后一个 statusdescription可能会提供关于事件为何停止的有用信息,所以我们想到应该为每个事件的最后一行加上一个 flag,及 statusdescription为本事件最后一个时标记为 1

WITH t1 AS
  (
    SELECT icustay_id
    , CASE WHEN 
        itemid = 227525 THEN 'Calcium' 
      ELSE 'KCl' END AS label
    , starttime, endtime
    , CASE WHEN LAG(endtime) OVER 
        (PARTITION BY icustay_id, itemid ORDER BY starttime, endtime) = starttime
        THEN 0
      ELSE 1 END AS new_event_flag
    , rate, rateuom, statusdescription
    FROM inputevents_mv
    WHERE itemid IN
    (
      --227525,-- Calcium Gluconate (CRRT)
      227536 -- KCl (CRRT)
    )
    AND statusdescription != 'Rewritten'
    AND icustay_id = '246866'
  )
  , t2 AS
  (
    SELECT 
    icustay_id, label
    , starttime, endtime
    , SUM(new_event_flag) OVER 
        (PARTITION BY icustay_id, label 
	        ORDER BY starttime, endtime)
	    AS time_partition 
    , rate, rateuom, statusdescription
    FROM t1
  )
  , t3 AS
  (
    SELECT
    icustay_id, label
    , starttime, endtime, time_partition 
    , rate, rateuom, statusdescription
    , ROW_NUMBER() OVER 
        (PARTITION BY icustay_id, label, time_partition 
            ORDER BY starttime DESC, endtime DESC) 
        AS lastrow
    FROM t2
  )
SELECT 
label, starttime, endtime, time_partition
, rate, rateuom
, statusdescription, lastrow
FROM t3
ORDER BY starttime, endtime;

得到:

* label starttime endtime time_partition rate rateuom statusdescription lastrow
0 KCl Day 11, 21:30 Day 12, 02:30 1 4.000000 mEq./hour FinishedRunning 1
1 KCl Day 11, 23:45 Day 12, 02:41 2 10.002273 mEq./hour FinishedRunning 8
2 KCl Day 12, 02:41 Day 12, 05:36 2 9.997713 mEq./hour FinishedRunning 7
3 KCl Day 12, 05:36 Day 12, 08:31 2 10.285715 mEq./hour FinishedRunning 6
4 KCl Day 12, 08:31 Day 12, 11:29 2 10.112360 mEq./hour FinishedRunning 5
5 KCl Day 12, 11:29 Day 12, 14:28 2 10.055866 mEq./hour FinishedRunning 4
6 KCl Day 12, 14:28 Day 12, 17:25 2 10.169492 mEq./hour FinishedRunning 3
7 KCl Day 12, 17:25 Day 12, 20:24 2 10.055866 mEq./hour FinishedRunning 2
8 KCl Day 12, 20:24 Day 12, 20:30 2 10.000000 mEq./hour Paused 1
9 KCl Day 12, 21:30 Day 12, 21:35 3 9.997634 mEq./hour Changed 5
10 KCl Day 12, 21:35 Day 13, 02:08 3 6.190549 mEq./hour FinishedRunning 4
11 KCl Day 13, 02:08 Day 13, 07:06 3 6.040268 mEq./hour FinishedRunning 3
12 KCl Day 13, 07:06 Day 13, 12:03 3 6.006060 mEq./hour FinishedRunning 2
13 KCl Day 13, 12:03 Day 13, 16:29 3 6.005904 mEq./hour Stopped 1
14 KCl Day 13, 18:15 Day 13, 23:15 4 6.000000 mEq./hour FinishedRunning 1
15 KCl Day 14, 15:28 Day 14, 18:47 5 6.000000 mEq./hour FinishedRunning 17
16 KCl Day 14, 18:47 Day 14, 19:01 5 6.000000 mEq./hour Changed 16
17 KCl Day 14, 19:01 Day 14, 23:04 5 4.007380 mEq./hour Changed 15
18 KCl Day 14, 23:04 Day 14, 23:18 5 5.871428 mEq./hour FinishedRunning 14
19 KCl Day 14, 23:18 Day 15, 02:28 5 5.905264 mEq./hour FinishedRunning 13
20 KCl Day 15, 02:28 Day 15, 05:44 5 5.969388 mEq./hour FinishedRunning 12
21 KCl Day 15, 05:44 Day 15, 08:57 5 5.906736 mEq./hour FinishedRunning 11
22 KCl Day 15, 08:57 Day 15, 12:08 5 5.905759 mEq./hour FinishedRunning 10
23 KCl Day 15, 12:08 Day 15, 15:17 5 5.904762 mEq./hour FinishedRunning 9
24 KCl Day 15, 15:17 Day 15, 18:34 5 5.908629 mEq./hour FinishedRunning 8
25 KCl Day 15, 18:34 Day 15, 21:46 5 5.906250 mEq./hour FinishedRunning 7
26 KCl Day 15, 21:46 Day 16, 01:01 5 5.907693 mEq./hour FinishedRunning 6
27 KCl Day 16, 01:01 Day 16, 04:18 5 5.908629 mEq./hour FinishedRunning 5
28 KCl Day 16, 04:18 Day 16, 07:36 5 5.903030 mEq./hour FinishedRunning 4
29 KCl Day 16, 07:36 Day 16, 10:54 5 5.903030 mEq./hour FinishedRunning 3
30 KCl Day 16, 10:54 Day 16, 14:12 5 5.903030 mEq./hour FinishedRunning 2
31 KCl Day 16, 14:12 Day 16, 16:04 5 5.905613 mEq./hour Stopped 1

Step 4.4: aggregate to merge together contiguous start/end times

现在我们在 time_partition (一个 time_partition 对应一个事件)的基础上对 starttimeendtime进行聚合(即在每个事件的基础上聚合):

  • 想要的是第一个 starttime,因此用 MIN(starttime)
  • 想要的是最后一个 endtime,因此用 MAX(endtime)
  • 想要的是最后一个 statusdescription,所以我们在仅有最后一行不是 null 的列上聚合

最后一步不是很直观,我们来看代码:

, MIN(CASE WHEN lastrow = 1 THEN statusdescription ELSE null END) AS statusdescription

聚合函数会忽略 null 值,所以我们新建的这一列仅在 lastrow = 1 是不为 null,因此保证了聚合函数最终只会返回 lastrow = 1 的行。而这个聚合函数其实用 MIN()MAX() 都可以,因为这个聚合操作最终只会在一个值上起作用(因为每个事件里 lastrow = 1 只有一个)。

综合一下,我们最终的查询长这样:

WITH t1 AS
  (
    SELECT icustay_id
    , CASE WHEN
        itemid = 227525
      THEN 'Calcium' ELSE 'KCl' END AS label
    , starttime, endtime
    , CASE WHEN
        LAG(endtime) OVER
          (PARTITION BY icustay_id, itemid 
	          ORDER BY starttime, endtime) = starttime
      THEN 0
      ELSE 1 END AS new_event_flag
    , rate, rateuom, statusdescription
    FROM inputevents_mv
    WHERE itemid IN
      (
      227525,-- Calcium Gluconate (CRRT)
      227536 -- KCl (CRRT)
      )
    AND statusdescription != 'Rewritten'
    AND icustay_id = '246866'
  )
  , t2 AS
  (
    SELECT
    icustay_id, label
    , starttime, endtime
    , SUM(new_event_flag) OVER
        (PARTITION BY icustay_id, label 
	        ORDER BY starttime, endtime)
	    AS time_partition
    , rate, rateuom, statusdescription
    FROM t1
  )
  , t3 AS
  (
    SELECT
    icustay_id, label
    , starttime, endtime, time_partition
    , rate, rateuom, statusdescription
    , ROW_NUMBER() OVER
        (PARTITION BY icustay_id, label, time_partition
          ORDER BY starttime DESC, endtime DESC)
	    AS lastrow
    FROM t2
  )
SELECT
label
--, time_partition
, MIN(starttime) AS starttime
, MAX(endtime) AS endtime
, MIN(rate) AS rate_min
, MAX(rate) AS rate_max
, MIN(rateuom) AS rateuom
, MIN(CASE WHEN
        lastrow = 1 THEN statusdescription
      ELSE null END)
  AS statusdescription
FROM t3
GROUP BY icustay_id, label, time_partition
ORDER BY starttime, endtime;

得到:

* label starttime endtime rate_min rate_max rateuom statusdescription
0 KCl Day 11, 21:30 Day 12, 02:30 4.000000 4.000000 mEq./hour FinishedRunning
1 KCl Day 11, 23:45 Day 12, 20:30 9.997713 10.285715 mEq./hour Paused
2 Calcium Day 11, 23:45 Day 12, 20:30 1.201625 2.002708 grams/hour Paused
3 Calcium Day 12, 21:30 Day 13, 15:54 1.206690 1.805171 grams/hour FinishedRunning
4 KCl Day 12, 21:30 Day 13, 16:29 6.005904 9.997634 mEq./hour Stopped
5 KCl Day 13, 18:15 Day 13, 23:15 6.000000 6.000000 mEq./hour FinishedRunning
6 Calcium Day 13, 18:15 Day 13, 23:15 1.602136 1.602136 grams/hour Paused
7 KCl Day 14, 15:28 Day 16, 16:04 4.007380 6.000000 mEq./hour Stopped
8 Calcium Day 14, 15:28 Day 16, 16:05 1.196013 1.990426 grams/hour Stopped

结果看起来没什么问题。所以现在就可以去掉 icustay_id = '246866'这个限制条件查询所有病人数据了(注意这时候是用 R 了,作用是把这个查询所有 INPUTEVENTS_MV 病人的完整的查询语句暂时记下来,后面要用到所有病人数据的时候直接套壳 query() 就能迅速拿到数据而不用再把长长的查询语句再复制粘贴一遍了):

query_inputevents <- "
WITH t1 AS
  (
    SELECT icustay_id
    , CASE WHEN
        itemid = 227525 THEN 'Calcium'
      ELSE 'KCl' END AS label
    , starttime, endtime
    , CASE WHEN LAS(endtime) OVER
        (PARTITION BY icustay_id, itemid
	        ORDER BY starttime, endtime) = starttime
      THEN 0
	  ELSE 1 END AS new_event_flag
    , rate, rateuom
    , statusdescription
    FROM inputevents_mv
    WHERE itemid IN
      (
      227525,-- Calcium Gluconate (CRRT)
      227536 -- KCl (CRRT)
      )
    AND statusdescription != 'Rewritten'
  )
  , t2 as
  (
    SELECT
    icustay_id, label
    , starttime, endtime
    , SUM(new_event_flag) OVER
        (PARTITION BY icustay_id, label ORDER BY starttime, endtime)
        AS time_partition
    , rate, rateuom, statusdescription
    FROM t1
  )
  , t3 as
  (
    SELECT
    icustay_id, label
    , starttime, endtime
    , time_partition
    , rate, rateuom, statusdescription
    , ROW_NUMBER() OVER
        (PARTITION BY icustay_id, label, time_partition
          ORDER BY starttime DESC, endtime DESC)
      AS lastrow
    FROM t2
  )
SELECT
icustay_id
, time_partition AS num
, MIN(starttime) AS starttime
, max(endtime) AS endtime
, label
--, MIN(rate) AS rate_min
--, max(rate) AS rate_max
--, MIN(rateuom) AS rateuom
--, MIN(CASE WHEN 
			lastrow = 1 THEN statusdescription
		ELSE null END)
	AS statusdescription
FROM t3
GROUP BY icustay_id, label, time_partition
ORDER BY starttime, endtime;
"

Conclusion

现在我们对于合并 INPUTEVENTS_MV 里的连续事件有了一个很好的方法。但注意,一般情况下没有必要这么做,因为 linkorderid 其实就是为了帮我们把同一时间连接起来的。举个例子,我们看一看 ICU 里一个非常常用的镇静药物丙泊酚:

WITH t1 AS
  (
    SELECT
      icustay_id, di.label
      , mv.linkorderid, mv.orderid
      , starttime, endtime
      , rate, rateuom
      , amount, amountuom
    FROM inputevents_mv mv
    INNER JOIN d_items di ON
      mv.itemid = di.itemid
    AND statusdescription != 'Rewritten'
    AND icustay_id = '246866'
    AND mv.itemid = 222168
  )
SELECT 
  label
  , linkorderid, orderid
  , starttime, endtime
  , rate, rateuom
  , amount, amountuom
FROM t1
ORDER BY starttime, endtime;

可以得到:

* label linkorderid orderid starttime endtime rate rateuom amount amountuom
0 Propofol 1405816 1405816 Day 09, 18:29 Day 10, 00:14 50.002502 mcg/kg/min 17.250863 mg
1 Propofol 1405816 2101314 Day 10, 01:01 Day 10, 01:05 50.002502 mcg/kg/min 0.200010 mg
2 Propofol 1405816 7312240 Day 10, 01:05 Day 10, 08:05 40.001221 mcg/kg/min 16.800513 mg
3 Propofol 1405816 7169415 Day 10, 08:15 Day 10, 12:00 40.001221 mcg/kg/min 9.000275 mg
4 Propofol 1405816 5852722 Day 10, 12:05 Day 10, 12:40 40.001221 mcg/kg/min 1.400043 mg
5 Propofol 1405816 3365285 Day 10, 12:40 Day 10, 14:00 20.000627 mcg/kg/min 1.600050 mg
6 Propofol 522225 522225 Day 10, 14:00 Day 10, 14:01 None 10.000001 mg
7 Propofol 1405816 5245063 Day 10, 14:00 Day 10, 14:07 40.001254 mcg/kg/min 0.280009 mg
8 Propofol 2703553 2703553 Day 10, 14:05 Day 10, 14:06 None 10.000001 mg
9 Propofol 1405816 6687581 Day 10, 14:07 Day 11, 08:45 30.001253 mcg/kg/min 33.541401 mg
10 Propofol 4912696 4912696 Day 10, 16:10 Day 10, 16:11 None 10.000001 mg
11 Propofol 3838086 3838086 Day 10, 16:55 Day 10, 16:56 None 10.000001 mg
12 Propofol 5665808 5665808 Day 11, 01:51 Day 11, 01:52 None 10.000001 mg
13 Propofol 1405816 3755617 Day 11, 09:10 Day 11, 13:36 30.001253 mcg/kg/min 7.980333 mg

可以看到 linkorderid 也可以很好地把连续的时间组合到了一起,而不需要我们上面辛辛苦苦这么多步骤。它同时也区分了不同的用药,上述第 6 行可以看到有一个 “1 分钟” 的用药。这其实是 MetaVision 系统的表格(结尾带有 _mv 的)标记瞬间事件的方法——具体到用药上来说,这是使用了丸剂(相对于静滴来说,口服丸剂是瞬间完成的)。(2018-10-11 更新:这里其实是推注用药,英语用 Bolus 表示一次性推注,用以和静脉滴注相区别)

用这个数据我们可以像之前那样在每个事件 partition 进行聚合,但是现在我们根本就不需要创建 partition 了,因为这其实就是 linkorderid

WITH t1 AS
  (
    SELECT icustay_id
      , di.itemid, di.label
      , mv.linkorderid, mv.orderid
      , starttime, endtime
      , amount, amountuom
      , rate, rateuom
    FROM inputevents_mv mv
    INNER JOIN d_items di ON
      mv.itemid = di.itemid
    AND statusdescription != 'Rewritten'
    AND icustay_id = '246866'
    AND mv.itemid = 222168
  )
    SELECT icustay_id
      , label, linkorderid
      , MIN(starttime) AS starttime
      , max(endtime) AS endtime
      , MIN(rate) AS rate_min
      , MAX(rate) AS rate_max
      , MAX(rateuom) AS rateuom
      , MIN(amount) AS amount_min
      , MAX(amount) AS amount_max
      , MAX(amountuom) AS amountuom
    FROM t1
    GROUP BY icustay_id, itemid, label, linkorderid
    ORDER BY starttime, endtime;

得到:

* label linkorderid starttime endtime rate_min rate_max rateuom amount_min amount_max amountuom
0 Propofol 1405816 Day 09, 18:29 Day 11, 13:36 20.000627 50.002502 mcg/kg/min 0.200010 33.541401 mg
1 Propofol 522225 Day 10, 14:00 Day 10, 14:01 None 10.000001 10.000001 mg
2 Propofol 2703553 Day 10, 14:05 Day 10, 14:06 None 10.000001 10.000001 mg
3 Propofol 4912696 Day 10, 16:10 Day 10, 16:11 None 10.000001 10.000001 mg
4 Propofol 3838086 Day 10, 16:55 Day 10, 16:56 None 10.000001 10.000001 mg
5 Propofol 5665808 Day 11, 01:51 Day 11, 01:52 None 10.000001 10.000001 mg

丸剂那一行没有 rate(用药速度),这很正常,丸剂只有剂量没有用药速度。


又这么长了,奇怪。再分一篇吧。Peace。

让 R 在完成任务时发送通知或者叮一声

2017-06-22

让 R 在完成任务时发送通知或者叮一声

Is there a way to make R beep/play a sound at the end of a script?

  1. Throw a Beep

    install.packages("beepr")
    library(beepr)
    beep()

    The package is developped at GitHub.

    Usage

    beep(sound = 1, expr = NULL)

    Arguments

    sound character string or number specifying what sound to be played by either specifying one of the built in sounds or specifying the path to a wav file. The default is 1. Possible sounds are:

    1. random

    2. "ping"

    3. "coin"

    4. "fanfare"

    5. "complete"

    6. "treasure"

    7. "ready"

    8. "shotgun"

    9. "mario"

    10. "wilhelm"

    11. "facebook"

    12. "sword"

  2. To get a message, you can use notify-send command:

system("notify-send \"R script finished running\"")

Shell 下字符串取子集和重定向

screenshot_2018-06-15_10-22-22

最近处理数据经常需要取某个字符串一部分用来重命名的情况,比如 sample1.fastq.gz 比对到基因组想要取出来 sample1 用来命名生成的 sam 文件或者 log,PC 跑起来太慢 log 不能盯着看结果太多又必须要求重定向。每次都要查一下取子集和输出和错误怎么重定向,自己都烦了,干脆写在这儿了。

字符串取子集

Shell 里字符串取子集用到 ${} 这样的命令形式。下面通过例子来说明。

我们先定义了一一个变量:
file=/dir1/dir2/dir3/my.file.txt

下面就用 ${ }分別获得不同的值:

${file#*/}:拿掉第一个 / 及其左边的字串:dir1/dir2/dir3/my.file.txt
${file##*/}:拿掉最后一个 / 及其左边的字串:my.file.txt
${file#*.}:拿掉第一个 . 及其左边的字串:file.txt
${file##*.}:拿掉最后一个. 及其左边的字串:txt

${file%/*}:拿掉最后一个 / 及其右邊的字串:/dir1/dir2/dir3
${file%%/*}:拿掉第一个 / 及其右边的字串:(空值)
${file%.*}:拿掉最后一个 . 及其右边的字串:/dir1/dir2/dir3/my.file
${file%%.*}:拿掉第一个 . 及其右边的字串:/dir1/dir2/dir3/my

关于 %# 谁是左谁是右,简单的记法就是看这两个键位在普通 QWERTY 键盘上的位置:
# 是去掉左边(在键盘上 #% 的左边)
% 是去掉右边(在键盘上 %# 的右边)
符号用一次是最小匹配﹔两个连用就是最大匹配。

${file:0:5}:提取最左边的 5 个字节:/dir1
${file:5:5}:提取第 5 个字节右边的连续 5 个字节:/dir2

对变量里的的字串作替换:
${file/dir/path}:把第一个 dir 替换为 path/path1/dir2/dir3/my.file.txt
${file//dir/path}:把全部 dir 替换为 path/path1/path2/path3/my.file.txt

重定向

Linux 中有三种标准输入输出,它们分别是STDINSTDOUTSTDERR,分别对应的数字是012

  • STDIN是标准输入,默认从键盘读取信息;
  • STDOUT是标准输出,默认将输出结果输出至终端;
  • STDERR是标准错误,默认将输出结果输出至终端。

由于STDOUTSTDERR都会默认显示在终端上,为了区分二者的信息,就有了编号1表示STDOUT2表示STDERR

搞清楚标准输入输出和错误输出再来看具体的命令形式就简单多了:

  1. 终端执行 $ command 后,默认情况下,执行结果STDOUT作为标准输出和STDERR错误输出(如果有的话)都直接被在终端打印出来,或者说终端直接显示出来。
  2. 终端执行$ command 1> out.txt 2> err.txt 后,会将STDOUTSTDERR分别存放至out.txterr.txt 中。该命令也可以写成下面三种形式
 $ command > out.txt 2> err.txt
 $ command 2> err.txt >out.txt
 $ command 2> err.txt 1> out.txt

即顺序谁前谁后无所谓,而且默认输出就是 1,所以它是可以直接省略掉的。

  1. $ command > file 2>&1 命令里, & 并不是后台或者 AND 的意思,放在>后面的&,表示重定向的目标不是一个普通文件,而是一个文件描述符,是标准输入输出这些。所以2> 1 代表将 STDERR 即错误输出重定向到当前路径下文件名为 1普通文件 中,而 2> &1 却是代表将重定向到标准输出。而由于标准输出已经被重定向到 file 中,因此最终的结果为标准输出和错误输出都被重定向到 file 中。
    &> file 是一种特殊的用法,也可以写成 >& file,二者的意思完全相同,都等价于 >file 2> &1,这里&> 或者 >& 都应该视作整体,分开没有单独的含义。

Tmux 入门和初步配置

cover

今天决定学一下 Tmux 怎么用,因为经常发现要开几个 Terminal Tabs 处理不同的东西,然后要对照的时候来回切真的很累。手按快捷键累,眼睛盯着来回在跳动的文本也累。Vim 本身其实也很容易可以左右上下的切分视图,但是切出来的都是 Vim。虽然想要运行 Shell 命令也不是不可以,但是终究没有 Tmux 直接切分终端来得方便。

当然 Tmux 好处还有很多,比如 ssh 连接到服务器打开 session,detach 到后台,断开 ssh 之后再连上去 atach 回之前的 sesion 东西都在。但基本上我看上的就是切割 pane 这个了,也懒得说其他了,用得上再去学。

1. What is TMUX

由于我用的 Debian sid,所以安装肯定是没什么说的:

sudo apt install tmux

This APT has Super Cow Powers.

首先 Tmux 到底是什么呢?Yet another terminal?我们 man tmux看看:

tmux — terminal multiplexer

哦,名字原来这么来的。但是这个 multiplexer 是个什么东西?这单词我都不认识。谷歌翻译告诉我,复用器的意思。所以,tmux 的中文名就是终端复用器。那复用器又是个什么东西.....不知道。算了,往下看吧:


DESCRIPTION
     tmux is a terminal multiplexer: it enables a number of terminals to be created, 
     accessed, and controlled from a single screen.  tmux may be detached from a 
     screen and continue running in the background, then later reattached.

     When tmux is started it creates a new session with a single window and displays
     it on screen.  A status line at the bottom of the screen shows information on the 
     current session and is used to enter interactive commands.

     A session is a single collection of pseudo terminals under the management of tmux.
     Each session has one or more windows linked to it.  A window occupies the entire 
     screen and may be split into rectangular panes, each of which is a separate pseudo 
     terminal (the pty(4) manual page documents the technical details of pseudo terminals).
     Any number of tmux instances may connect to the same session, and any number of 
     windows may be present in the same session.  Once all sessions are killed, tmux exits.

     Each session is persistent and will survive accidental disconnection (such as ssh(1) 
     connection timeout) or intentional detaching (with the ‘C-b d’ key strokes).
     tmux may be reattached using:

           $ tmux attach

     In tmux, a session is displayed on screen by a client and all sessions are managed by 
     a single server.  The server and each client are separate processes which communicate 
     through a socket in /tmp.

我就一边看一边随便翻译一下,中间夹杂我的个人想法。可能有错误。

tmux 是个终端复用器,终端复用器是什么呢?它允许在一个屏幕(screen)里创建、使用和控制多个终端。这个屏其实按我们就是一个 terminal 窗口了。

看到这里我就稍稍懂了一点了。因为和早期计算机(感觉这里叫计算机比叫电脑合适)其实是服务器+终端这样的模式,只有通过终端连接进行操作。而这个终端不一定和服务器直接通过物理网线什么的接起来,可能根本就不在一个地方,而是通过远程 ssh 之类的这样连接的。说到底其实就是个远程连上去的显示器 + 键盘之类的外设,因为服务器自己是没有任何界面供人进行交互操作的。然后连上去一登录就是个现在的 Linux tty 这样的命令行界面。可以看到 tty 就是直接只有一屏,不存在窗口的概念,所以就说屏幕(screen)。其实这样翻译不是很准确,屏幕给人的感觉是屏幕这个实体的东西,但是这里应该指显示的一屏内容。这时候我们用来连计算机这个终端是物理的实体存在的东西。但因为我们现在大都是图形化的一个窗口来模拟这个终端,所以现在我们用的 terminal 的都叫做 Terminal Emulator 这就很好理解了,我们的这个软件窗口就是模拟当初连计算机的那个实体终端的。tmux 就是把一个终端复用,相当于一个变成了多个了。所以 tmux 的 screen 对应到我们 terminal 里使用时其实我们的 terminal emulator 的一整个窗口。tmux 下面出现的 server - client 概念其实也从这儿来,我们终端连计算机就是服务器-客户端(终端)这样的模型。

tmux 能从当前屏 detach 掉,但是它还会在后台运行,并且可以随时 atach 回去。tmux 启动时会在屏幕上创建并显示在一个只有一个 window 的新 session。屏幕底部会显示一个状态栏。状态栏就展示了当前 session 的一些信息,并可以用来输入交互式的命令。

session 是 tmux 管理的多个伪终端的集合。每个 session 下可以有一个或者多个 window,而每个 window 都会直接占据当前屏幕全部并且可以进一步分割成多个矩形 pane。每个 pane 里都都一个伪终端。任意个 tmux 可以连接到一个 session,一个 session 里可以有任意个 window。所有 session 都关掉的时候 tmux 会退出。

每个 session 都是可以保持的,所以意外的 ssh 连接断掉或者按到 detach 的快捷键时都可以再 attach 回去。

tmux 由 client 展示在屏幕上,而多个 session 都由同一个 server 管理。server 和每个 client 之间的连接都是单独的进程,并可以通过 /tmp 下的一个 socket 来互相沟通。

2. First look

嗯,终于看完了。大概对 tmux 也有个了解了。层级关系:

服务器(server) + 客户端(client) --> 会话(session) --> 窗口(window) --> 窗格(pane)

所以基本上我们打开一个 terminal 进入 tmux 的同时就生成了一个 server-client 连接,并且同时创建一个 session。这个 session 会直接占据当前 terminal 整个显示空间。然后一个 session 下面可以有多个 window,就相当于我们在 terminal 里开了几个标签。一个 window 还能切分成 pane,这就是我想用的切分终端窗口的功能了。关系捋清楚了,再看我们的封面图:

cover

这是一个 session 打开了之后的样子,下面状态栏显示这个 session 名字叫做 MySn,里面有 0 -10 一共 11 个 window 并且每个 window 都有对应的名字(编号从 0 开始。然后我也不知道为啥我啥也没配置打开默认就这么多 window 了...)。当前在第一个(编号 0)叫做 zsh 的 window 里,下划线和星号就是表示的当前所在 window 了。然后当前这个 window 被分成了 3 个 pane,左上的 vim 打开了 tmux 的配置文件 ~/.tmux.conf,左下运行的 screenfetch,右边大 pane 打开了 man tmux。嗯,概念基本清晰了,开始配置了。

3. 配置 tmux

搞清楚概念了我们就要开始配置然后弄得自己顺手了。

首先得知道 tmux 和 vim 一样,一切操作靠键盘快捷键。类似于快捷键前面按前缀 Ctrl、Alt、Super 一样,tmux 内部快捷键也有一个 prefix,默认是 Ctrl + b。但是我手小,这个键位有挑战性,所以按照网上很多人一样第一时间改掉这个,我改为了 Ctrl + x。所以下面快捷键都会以 Ctrl + x开头,但是我写做 <prefix>。(其实开始是用 Ctrl + a,这也是网上我看别人用得最多的。但是用了发现也 Shell 本身 Ctrl + a 到命令行首冲突,所以又改为 Ctrl + x😢)

一旦进入 tmux 由于几乎全是快捷键操作,会有一点第一次接触 vim 一样的手足无措感。所以得之后进入退出和管理当前的 session:

3.1 tmux 进入退出

终端 tmux 就能直接进去了。其他:

  • tmux new -s:建立新 session(-s 其实就是 session 咯),后面可以接名字
  • <prefix> d: 退出会话,回到 Shell 的终端环境。这个和 Shell 类似,Ctrl + d 退出。但是这里其实是 detach 的意思

这下平安了。知道怎么怎么进怎么出。稍微安心点。继续:

  • tmux ls :查看当前后台 session。同时也会列出 session 名和里面有多少 window
  • tmux a -t Sn 进入后台的名为 Sn 的会话,a 是 attach(写 attach 也可以),t 是 target session
  • tmux rename -t OldName NewName:重命名 session。<prefix> $ 一样进入重命名 session 状态,<prefix> , 则重命名当前的 window
  • <prefix> s:S 应该是 status,显示当前 session 的信息。会详细现实所有 window 及其 pane 的信息,而且可以方向键选择切换
  • tmux kill-session -t Sn:结束名为 Sn 的 session。<prefix> ::进入命令模式(状态栏颜色会变)。此时输入 kill-session -t Sn一样可以结束 Sn 这个 session,直接 kill-session 结束当前这个 session,此时终端 tmux ls 再看后台的这个 session 就没了。

3.2 其他快捷键和配置

全局快捷键

知道怎么进入退出和简单管理 sesion 之后,现在我们可以开始用了。然后就是了解快捷键和配置快捷键了。上面已经说过我们改了默认的前缀 Ctrl + bCtrl + x。在~/.tmux.conf配置文件里就是:

# Change the prefix key to C-a
unbind C-b
set -g prefix C-x
bind C-x send-prefix

然后由于开始一直在改配置文件,想直接生效而不用一直退出重新打开,我们定义一个 prefix + r 来重加载配置文件:

# <prefix>-r ro reload config
bind r source-file ~/.tmux.conf \; display-message "Config reloaded"

然后按照网上的推荐,把切换 pane 的快捷键改成了 vim style:

# vim style switching panes
#up
bind-key k select-pane -U
#down
bind-key j select-pane -D
#left
bind-key h select-pane -L
#right
bind-key l select-pane -R
# 以及大写用来调节大小
bind L resize-pane -L 10    # 向左扩展
bind R resize-pane -R 10    # 向右扩展
bind K resize-pane -U 5     # 向上扩展
bind J resize-pane -D 5     # 向下扩展

但是我还嫌麻烦,最后 Google 了终于找到了鼠标选定 pane 和调节 pane 大小的选项了:

# Turn the mouse on, but without copy mode dragging
set -g mouse on
unbind -n MouseDrag1Pane
unbind -Tcopy-mode MouseDrag1Pane

这下就爽了,鼠标选 pane 和 window 以及调节 pane 大小,少记好几个快捷键。

我还把默认配置文件一打开配置好的 11 个 window 那些行直接删掉了。所以现在我杀掉所有 session 重新建一个进去的话就只有默认 session,只有一个窗口。我们先 <prefix> $ 给 session 取个名字,然后 <prefix> , 给这个 window 取名字。

现在只有一个 window,想新开一个呢?新开了一个我想要切回去呢?一大波快捷键来了:

window

  • <prefix> c 就会创建一个新的 window,c 大概是 create 吧
  • <prefix> p:切换到上一个window,p 就是 previous
  • <prefix> n: next,下一个window
  • <prefix> 0: 切换到 0 号window,依次类推可切到任意窗口
  • <prefix> w :window,会列出当前 session 所有window,通过上、下键切换窗口
  • <prefix> &: 关闭当前 window,会有确认提示

pane

好了,终于到了怎么使用 pane 了:

  • <prefix> %:创建垂直切割的 pane(水平线形成左右 pane)
  • <prefix> ":创建水平分割的 pane(水平线形成上下 pane)
  • <prefix> o:在 pane 之间循环切换。当前活动 pane 四周切割线为绿色
  • <prefix> ArrowKey:方向键切 pane。当然上面也定义了 vim style 的切换键
  • <prefix> z:zoom,最大化当前 pane,再按一次 <prefix> z 恢复原样
  • <prefix> t:在 pane 里显示一个数字时钟,t 就是 time 咯
  • <prefix> q:会显示当前 window 所有 pane 的编号,在编号消失之前(要眼疾手快!)按数字就能切过去了。q 大概是 query吧
  • <prefix> x:关闭当前 pane,会有确认提示

4. 我的自定义配置

前面说了很多我改的选项,其实有一个最不爽的我没说。默认 <prefix> %/” 切分 pane 这个我是真的无力吐槽。感觉一点道理没有记不住不说,% 还得按 Shift,累死人。所以我按照形象记忆,把键盘的 |- 设置为切分。由于 | 需要 Shift + \,所以最终直接绑 \,反正键盘上能看到就 OK。

以及其他一些选项不细说,贴上我的配置文件吧:

#################################
##########    Options    ########
#################################

# Turn the mouse on, but without copy mode dragging
# this also enable mouse to choose or resize a pane,  as well as to choose window
set -g mouse on
unbind -n MouseDrag1Pane
unbind -Tcopy-mode MouseDrag1Pane
# tweak status line
set -g status-right "%H:%M"
set -g window-status-current-attr "underscore"
# 提示信息的持续时间;设置足够的时间以避免看不清提示,单位为毫秒
set-option -g display-time 5000
# 控制台激活后的持续时间;设置合适的时间以避免每次操作都要先激活控制台
set-option -g repeat-time 1000
set-window-option -g display-panes-time 1500
# enable utf-8
set -gq status-utf8 on
# use 256 colors
set-option -g default-terminal "screen-256color"
# Enable RGB colour if running in xterm(1)
set-option -sa terminal-overrides ",xterm*:Tc"
# Change the default $TERM to tmux-256color
set -g default-terminal "tmux-256color"
# scrollback buffer n lines
set-option -g history-limit 100000                 
# 窗口的初始序号默认为 0 开始,这里设置为1
set-option -g base-index 1
# pane 一样设置为 1 开始
set-window-option -g pane-base-index 1
# No bells at all
set -g bell-action none
# Keep windows around after exit?
set -g remain-on-exit off

# If running inside tmux ($TMUX is set), then change the status line to red
%if #{TMUX}
set -g status-bg red
%endif

##############################################
############ keyboard shortcuts ##############
##############################################

# Change the prefix key to C-a
set -g prefix C-x
unbind C-b
bind C-x send-prefix

# <prefix>-r ro reload config
bind r source-file ~/.tmux.conf \; display-message "Config reloaded"

# vim style switching panes
#up
bind-key k select-pane -U
#down
bind-key j select-pane -D
#left
bind-key h select-pane -L
#right
bind-key l select-pane -R
# 向左扩展
bind L resize-pane -L 10
# 向右扩展
bind R resize-pane -R 10
# 向上扩展
bind K resize-pane -U 5
 # 向下扩展
bind J resize-pane -D 5

# select last window with <prefix> + C-l
bind-key C-l select-window -l

# [prefix |] / [prefix -] to split panes
unbind '"'
unbind %
bind-key \ split-window -h
bind-key - split-window -v

# ESC to start vim style copy and paste
bind Escape copy-mode
bind-key -Tcopy-mode-vi 'v' send -X begin-selection
bind-key -Tcopy-mode-vi 'y' send -X copy-selection
unbind p
bind p pasteb
setw -g mode-keys vi      # Vi风格选择文本

好了,暂时这样用。用一阵子快捷键肯定会调整的。到时候就不再更新这里了,反正最后都传到我的 Linux-config-bak 这个 Github repo 备份了。

参考:

解决 Debian 中 RStudio 和 Mendeley 下 Fcitx 输入法不能使用的问题

2018-12-08 更新

最新的 RStudio 版本为 1.2.1114,Qt 版本为 Qt-5.11.1。RStudio 自带的 libQt5* 文件保存在 /usr/lib/rstudio/lib 下,按照之前的方法移除这些文件的办法又失效了。不得已只能又一次自己编译了。
简单记录如下:

  • 下载 qt-opensource-linux-x64-5.11.1.run,安装
  • Terminal 临时 exportPATHLD_LIBRARY_PATH
  • 编译 fcitx-qt5
  • 得到的 platforminputcontexts/libfcitxplatforminputcontextplugin.so 复制到 /usr/lib/rstudio/plugins/platforminputcontexts
export PATH=/opt/qt5/5.11.1/gcc_64/bin:$PATH
export LD_LIBRARY_PATH=/opt/qt5/5.11.1/gcc_64/lib:$LD_LIBRARY_PATH

# double check
echo $PATH
echo $LD_LIBRARY_PATH

cd /path/to/fcitx-qt5
cmake .
make -j 4

最新的 libfcitxplatforminputcontextplugin.so 已经同步更新到我的 repo,不会或者懒得编译的人自己去下载吧。


2018-08-16 重要更新

目前发现一种最最最最最最简单的办法。Debian 下进入 /usr/lib/rstudio/bin 目录,直接删掉所有 libQt5 开头的文件和 qt.conf 即可(测试时不要直接删掉,重命名备份就行了)。
我的做法是:

cd /usr/lib/rstudio/bin
sudo mkdir Qt
sudo mv libQt5* Qt
sudo mv qt.conf Qt

然后再打开 RStudio 测试 Fcitx 输入法是否可用。

这个方法的原理在于,Fcitx 在 RStudio 里不能用是因为 RStudio 使用的 Qt 库版本与系统版本不同,而我们系统的 Fcitx 在编译时是链接到系统的 Qt 库版本的。而我们把 RStudio 自带的 Qt 库删掉之后会迫使 RStudio 调用系统的 Qt 库,即迫使它调用了 Fcitx 一样的版本库,所以这时候 RStudio 和其他 Qt 程序一样就能直接调用 Fcitx 本来的插件了( libfcitxplatforminputcontextplugin.so 这个插件在 Debian 就是 fcitx-frontend-qt5 这个包)。

说明

这个方法并不是我想到的,来自统计之都论坛的一篇帖子: win10下Rstudio切换中文输入法问题 。感谢 @linjinzhen

以下为更新前原文。


2017-12-20

一直以来 QT-based App 下 Fcitx 无法输入中文的问题都让我很恼火,用的比较多的 RStudio 先后两次去 Support 发帖无果,在他们的 GitHub 也发了 Issue,他们标记了 bug 之后就啥也没干。Mendeley 也去发帖过一次,官方回复大概意思是“知道了,但是目前这个问题优先级很低”.....

网上看到过几次有人说自己编译 QTfcitx-qt 得到 libfcitxplatforminputcontextplugin.so放到指定位置就可以。我试过下载别人编译好的试过都没成功,自己不是很会编译啥的怕把系统组件搞乱又懒得编译.....直到前天又看到 Mendeley Fcitx Problem 这个帖子,在下面留言,po 主表示多几个 QT 也不会搞坏系统我才决定尝试下。

其实以前用的是 Zotero 用来管理文献,也写过另一篇博文 #5 ,后来某一次升级之后 Zotero 就打不开了.....终端打开没有任何提示信息。可惜我整理得好好的文献库也没了。然后我就转到 Mendeley 了。

好的,废话少说,Let's get started!

Qt 编译安装

发现 Mendeley Desktop 使用的 Qt 5.5.1 。

下载 qt-everywhere-opensource-src-5.5.1.tar.xz 并解压,进入目录。

# 准备安装 QT 的目录
sudo mkdir /opt/qt.5..5.1
./configure --prefix=/opt/qt5.5.1 -no-openssl

碰到一大堆报错 XCB 啥的,查了下直接加 -qt-xcb 就行了,我也不知道 XCB 干嘛的,这不是重点。
(configure --help 可以获得编译 Qt 详尽的选项说明)

./configure --prefix=/opt/qt5.5.1 -no-openssl -qt-xcb

顺利同过。然后三部曲后两步:

make -j4
# 燃烧吧 CPU。
# 我的 Intel Core i5-6300HQ @ 4x 3.2GHz 大概编了 20~30min。
sudo make install

fcitx-qt5

接下来是 fcitx-qt5。在编译它之前要让刚刚编译好的 Qt 发挥作用,所以要改路径,我的做法也是临时export一下,只要这个终端不关都能起作用,但是要记得后面的过程都在这个终端完成。

export PATH="/opt/qt5.5.1/bin/:$PATH"

git clone https://github.com/fcitx/fcitx-qt5.git
cd fcitx-qt5
cmake .

是的,又有问题了。

CMake Error at CMakeLists.txt:8 (find_package):
  Could not find a package configuration file provided by "ECM" (requested
  version 1.4.0) with any of the following names:

    ECMConfig.cmake
    ecm-config.cmake

  Add the installation prefix of "ECM" to CMAKE_PREFIX_PATH or set "ECM_DIR"
  to a directory containing one of the above files.  If "ECM" provides a
  separate development package or SDK, be sure it has been installed.


-- Configuring incomplete, errors occurred!
See also "/path/to/fcitx-qt5/CMakeFiles/CMakeOutput.log".

Google 一下,哦,sudo apt install extra-cmake-modules 就行了。继续:

cmake .

# 错误又来了
-- Could NOT find XKBCommon_XKBCommon (missing: XKBCommon_XKBCommon_LIBRARY XKBCommon_XKBCommon_INCLUDE_DIR) 
CMake Error at /usr/share/cmake-3.9/Modules/FindPackageHandleStandardArgs.cmake:137 (message):
  Could NOT find XKBCommon (missing: XKBCommon_LIBRARIES XKBCommon) (Required
  is at least version "0.5.0")
Call Stack (most recent call first):
  /usr/share/cmake-3.9/Modules/FindPackageHandleStandardArgs.cmake:377 (_FPHSA_FAILURE_MESSAGE)
  cmake/FindXKBCommon.cmake:30 (find_package_handle_standard_args)
  CMakeLists.txt:33 (find_package)


-- Configuring incomplete, errors occurred!
See also "/path/to/fcitx-qt5/CMakeFiles/CMakeOutput.log".

WTF???....不要急不要急,Google 一下,哦,sudo apt install libxkbcommon-dev。继续:

cmake .

# 呵呵
-- Found XKBCommon_XKBCommon: /usr/lib/x86_64-linux-gnu/libxkbcommon.so (found version "0.7.1") 
-- Found XKBCommon: /usr/lib/x86_64-linux-gnu/libxkbcommon.so (found suitable version "0.7.1", minimum required is "0.5.0") found components:  XKBCommon 
CMake Error at CMakeLists.txt:36 (find_package):
  By not providing "FindFcitx.cmake" in CMAKE_MODULE_PATH this project has
  asked CMake to find a package configuration file provided by "Fcitx", but
  CMake did not find one.

  Could not find a package configuration file provided by "Fcitx" (requested
  version 4.2.8) with any of the following names:

    FcitxConfig.cmake
    fcitx-config.cmake

  Add the installation prefix of "Fcitx" to CMAKE_PREFIX_PATH or set
  "Fcitx_DIR" to a directory containing one of the above files.  If "Fcitx"
  provides a separate development package or SDK, be sure it has been
  installed.


-- Configuring incomplete, errors occurred!
See also "/path/to/fcitx-qt5/CMakeFiles/CMakeOutput.log".

哦,知道了,Google。哦,sudo apt install fcitx-libs-dev。好,继续:

cmake .
#  过了..............
make -j4

手别抖不要惯性 sudo make install,不需要。
现在platforminputcontext目录下应该已经有了新鲜出炉的libfcitxplatforminputcontextplugin.so了,然后就好了:

sudo cp platforminputcontext/libfcitxplatforminputcontextplugin.so /opt/mendeleydesktop/plugins/qt/plugins/platforminputcontexts

终端打开 Mendeley 试试 Fcitx 已经可以用了。不保险,直接鼠标点点点菜单找到 Mendeley 打开输入法还没挂,OK。

RStudio

接下来一样,在 RStudio 菜单的关于里看了下,基于 Qt-5.4.0,那就下载 qt-everywhere-opensource-src-5.4.0.tar.xz好了。
以为可以收工了?怎么可能,Naive。

./configure --prefix=/opt/qt.5.4.0 -no-openssl -qt-xcb 直接报错:

ln -s libQt5Widgets.so.5.4.0 libQt5Widgets.so
ln -s libQt5Widgets.so.5.4.0 libQt5Widgets.so.5
ln -s libQt5Widgets.so.5.4.0 libQt5Widgets.so.5.4
rm -f ../../lib/libQt5Widgets.so.5.4.0
mv -f libQt5Widgets.so.5.4.0  ../../lib/ 
rm -f ../../lib/libQt5Widgets.so
rm -f ../../lib/libQt5Widgets.so.5
rm -f ../../lib/libQt5Widgets.so.5.4
mv -f libQt5Widgets.so ../../lib/ 
mv -f libQt5Widgets.so.5 ../../lib/ 
mv -f libQt5Widgets.so.5.4 ../../lib/ 
make[3]: Leaving directory '/home/adam/Downloads/Persepolis/qt-everywhere-opensource-src-5.4.0/qtbase/src/widgets'
make[2]: Leaving directory '/home/adam/Downloads/Persepolis/qt-everywhere-opensource-src-5.4.0/qtbase/src'
Makefile:45: recipe for target 'sub-src-make_first' failed
make[1]: *** [sub-src-make_first] Error 2
make[1]: Leaving directory '/home/adam/Downloads/Persepolis/qt-everywhere-opensource-src-5.4.0/qtbase'
Makefile:70: recipe for target 'module-qtbase-make_first' failed
make: *** [module-qtbase-make_first] Error 2

一头雾水,连报错信息都基本没有。二话不说,Google,靠谱的办法试试看,比如这个帖子:Build Qt Static Make Error - [SOLVED], 官方论坛官方回答,看着靠谱。哦:

./configure --prefix=/opt/qt.5.4.0 -release -opensource -confirm-license -static -qt-xcb -no-openssl -no-glib -no-pulseaudio -no-alsa -opengl desktop -nomake examples -nomake tests

# 然后真的过了
make -j4
# 燃烧吧 CPU。Winter is Coming!!!!!!


rm -f ../../lib/libQt5Widgets.a
mv -f libQt5Widgets.a ../../lib/ 
make[3]: Leaving directory '/home/adam/Downloads/Persepolis/qt-everywhere-opensource-src-5.4.0/qtbase/src/widgets'
make[2]: Leaving directory '/home/adam/Downloads/Persepolis/qt-everywhere-opensource-src-5.4.0/qtbase/src'
Makefile:45: recipe for target 'sub-src-make_first' failed
make[1]: *** [sub-src-make_first] Error 2
make[1]: Leaving directory '/home/adam/Downloads/Persepolis/qt-everywhere-opensource-src-5.4.0/qtbase'
Makefile:70: recipe for target 'module-qtbase-make_first' failed
make: *** [module-qtbase-make_first] Error 2

还是上面那个报错....我也不知道为啥了,好吧老实点先把不知道的选项拿掉,本着对官方论坛官方回答的相信,那一堆复制过来的的选项我都没看。重新来过:

./configure --prefix=/opt/qt.5.4.0 -release -opensource -confirm-license -no-openssl -qt-xcb -nomake examples -nomake tests

...........

Makefile:45: recipe for target 'sub-src-make_first' failed
make[1]: *** [sub-src-make_first] Error 2
make[1]: Leaving directory '/home/adam/Downloads/Persepolis/qt-everywhere-opensource-src-5.4.0/qtbase'
Makefile:70: recipe for target 'module-qtbase-make_first' failed
make: *** [module-qtbase-make_first] Error 2

报错依然,上网一顿查,Google 看了 N 多都是交叉编译的问题,感觉很奇怪而且错误和我不完全一样。百度,各种论坛都是提问题的没有回答的。

N 久无果,中间 2~3个小时过去了。

我开始思索是不是我哪里做法有问题。

这时我突然记起来之前尝试编译 RStudio 的时候,从 RStudio 的 GitHub 里的安装依赖的脚本里看到编译 RStudio 的时候会依照里面的设置从他们自己的 AWS 服务器上下载他们精(魔)简(改)的 QT binary 的。这洋一想我直接去用他们的 QT 编译岂不是更好。二话不说去 GitHub 看他们的 QT 放在哪儿。你看他们的 rstudio/dependencies/linux/install-qt-sdk里写的:

# presume 5.4.0
QT_VERSION=5.4.0

# test for libgstreamer
which apt-cache > /dev/null
if [ $? == 0 ]; then
  # debian (currently no test for CentOS based systems)
  apt-cache show libgstreamer1.0 > /dev/null
  if [ $? == 0 ]; then
    QT_VERSION=5.4.2
  fi
fi

QT_SDK_BINARY=QtSDK-${QT_VERSION}-${QT_ARCH}.tar.gz
QT_SDK_URL=https://s3.amazonaws.com/rstudio-buildtools/$QT_SDK_BINARY

# set Qt SDK dir if not already defined
if [ -z "$QT_SDK_DIR" ]; then
  QT_SDK_DIR=~/Qt${QT_VERSION}
fi

if ! test -e $QT_SDK_DIR
then
   # download and install
   wget $QT_SDK_URL -O /tmp/$QT_SDK_BINARY
   cd `dirname $QT_SDK_DIR`
   tar xzf /tmp/$QT_SDK_BINARY
   rm /tmp/$QT_SDK_BINARY
else
   echo "Qt $QT_VERSION SDK already installed"
fi

暴力暴力,够社会。

直接自己拼接出 QtSDK-5.4.0 的地址下下来了。由于这个已经是 binary 了就不需要我再编译了,直接用就行。
然后就是跟前面差不多了,十分顺利,没出错。解压他们的 QT 放到 /opt/qt.5.4.0,然后重新编译 fictx-qt5,得到libfcitxplatforminputcontextplugin.so

刚刚是 Mendeley 所以最后libfcitxplatforminputcontextplugin.so就拷贝到/opt/mendeleydesktop/plugins/qt/plugins/platforminputcontexts/,即谁要给谁。同理,RStudio 就应该拷贝到/usr/lib/rstudio/bin/plugins/platforminputcontexts/了。

然后试了下 RStudio 终于,Fcitx 起来了。

总结

看起来我这个基本上一个小时能解决,事实上我从昨晚到现在,用的总时间起码昨晚 3+ 小时,今天早上到现在下午起码 4+ 小时。中间我为了记录过程开了一个记事本现在 600+ 行了.....庭有枇杷树,吾妻死之年手植也,今已亭亭如盖矣。
但是我仍然很开心,我觉得我知道了一些新东西,踩了一些新的坑,问题最后也解决了(大概下次谁更新了可能会再来一次,但下次应该就轻车熟路了)。

哦对了,我自己编译的 libfcitxplatforminputcontextplugin.so 我建了一个 repo,也许谁要用的话可以试一试,在知乎上碰到以为用 Ubuntu 16.04 的知友用了我编译的文件解决了 ta 的输入法问题,我表示很开心。

说了这么多,总结:

  1. Google;
  2. 耐心;
  3. 尝试。

放在最后不代表不重要

  1. 编译 Qt 的configure
    /opt建立相应文件夹后,再建立一个指向这个文件夹的软链接qt5。这么做的理由在 BLFS 的 HandBook 中编译 Qt5 有说明:Qt-5.4.2 ,深以为然。
../configure -v -prefix /opt/qt5 -shared -largefile -accessibility -no-qml-debug -force-pkg-config \
-release -opensource -confirm-license -optimized-qmake \
-system-zlib -no-mtdev -system-libpng -system-libjpeg -system-freetype -fontconfig -system-harfbuzz \
-no-compile-examples -icu -qt-xcb -qt-xkbcommon -xinput2 -glib \
-no-pulseaudio -no-alsa -gtkstyle -no-openssl \
-nomake examples -nomake tests -no-compile-examples -skip qtdoc

具体参数的含义还是去看 help 输出。

  1. 编译安装完 Qt 后,首先应该把 Qt 的bin 目录加到PATH里,这里的建议还是export这样做。
    比较重要的是LD_LIBRARY_PATH的问题。
    首先看看最终我们需要的libfcitxplatforminputcontextplugin.so到底需要些什么:
➜  ~ ldd /opt/mendeleydesktop/plugins/qt/plugins/platforminputcontexts/libfcitxplatforminputcontextplugin.so
	linux-vdso.so.1 (0x00007ffc89d4a000)
	libQt5Gui.so.5 => /opt/qt.5.5.1/lib/libQt5Gui.so.5 (0x00007faee03c0000)
	libQt5DBus.so.5 => /opt/qt.5.5.1/lib/libQt5DBus.so.5 (0x00007faee0d24000)
	libxkbcommon.so.0 => /usr/lib/x86_64-linux-gnu/libxkbcommon.so.0 (0x00007faee0180000)
	libQt5Core.so.5 => /opt/qt.5.5.1/lib/libQt5Core.so.5 (0x00007faedfcc6000)
	libstdc++.so.6 => /usr/lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007faedf941000)
	libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007faedf5ae000)
	libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007faedf396000)
	libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007faedefdc000)
	libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007faededbe000)
	libpng16.so.16 => /usr/lib/x86_64-linux-gnu/libpng16.so.16 (0x00007faedeb8b000)
	libz.so.1 => /lib/x86_64-linux-gnu/libz.so.1 (0x00007faede971000)
	libGL.so.1 => /usr/lib/x86_64-linux-gnu/libGL.so.1 (0x00007faede6e5000)
	libicui18n.so.57 => /usr/lib/x86_64-linux-gnu/libicui18n.so.57 (0x00007faede271000)
	libicuuc.so.57 => /usr/lib/x86_64-linux-gnu/libicuuc.so.57 (0x00007faeddecc000)
	libicudata.so.57 => /usr/lib/x86_64-linux-gnu/libicudata.so.57 (0x00007faedc44f000)
	libpcre16.so.3 => /usr/lib/x86_64-linux-gnu/libpcre16.so.3 (0x00007faedc1e8000)
	libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007faedbfe4000)
	libgthread-2.0.so.0 => /usr/lib/x86_64-linux-gnu/libgthread-2.0.so.0 (0x00007faedbde2000)
	libglib-2.0.so.0 => /lib/x86_64-linux-gnu/libglib-2.0.so.0 (0x00007faedbace000)
	librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007faedb8c6000)
	/lib64/ld-linux-x86-64.so.2 (0x00007faee0b79000)
	libGLX.so.0 => /usr/lib/x86_64-linux-gnu/libGLX.so.0 (0x00007faedb695000)
	libGLdispatch.so.0 => /usr/lib/x86_64-linux-gnu/libGLdispatch.so.0 (0x00007faedb3df000)
	libpcre.so.3 => /lib/x86_64-linux-gnu/libpcre.so.3 (0x00007faedb16d000)
	libX11.so.6 => /usr/lib/x86_64-linux-gnu/libX11.so.6 (0x00007faedae2d000)
	libXext.so.6 => /usr/lib/x86_64-linux-gnu/libXext.so.6 (0x00007faedac1b000)
	libxcb.so.1 => /usr/lib/x86_64-linux-gnu/libxcb.so.1 (0x00007faeda9f3000)
	libXau.so.6 => /usr/lib/x86_64-linux-gnu/libXau.so.6 (0x00007faeda7ef000)
	libXdmcp.so.6 => /usr/lib/x86_64-linux-gnu/libXdmcp.so.6 (0x00007faeda5e9000)
	libbsd.so.0 => /lib/x86_64-linux-gnu/libbsd.so.0 (0x00007faeda3d4000)

发现对于对应 Qt 的话,需要libQt5Gui.so.5libQt5DBus.so.5libQt5Core.so.5这三个库。
看看系统到底有没有这 3 个库呢:

➜  ~ locate libQt5Core.so.5
/home/adam/.aspera/connect/lib/libQt5Core.so.5
/home/adam/Downloads/Persepolis/qt-everywhere-opensource-src-5.5.1/qtbase/lib/libQt5Core.so.5
/home/adam/Downloads/Persepolis/qt-everywhere-opensource-src-5.5.1/qtbase/lib/libQt5Core.so.5.5
/home/adam/Downloads/Persepolis/qt-everywhere-opensource-src-5.5.1/qtbase/lib/libQt5Core.so.5.5.1
/home/adam/Programs/foxitsoftware/lib/libQt5Core.so.5
/home/adam/Programs/foxitsoftware/lib/libQt5Core.so.5.3
/home/adam/Programs/foxitsoftware/lib/libQt5Core.so.5.3.2
/home/adam/miniconda3/envs/Python_27/lib/libQt5Core.so.5
/home/adam/miniconda3/envs/Python_27/lib/libQt5Core.so.5.6
/home/adam/miniconda3/envs/Python_27/lib/libQt5Core.so.5.6.2
/home/adam/miniconda3/lib/libQt5Core.so.5
/home/adam/miniconda3/lib/libQt5Core.so.5.6
/home/adam/miniconda3/lib/libQt5Core.so.5.6.2
/home/adam/miniconda3/pkgs/qt-5.6.2-2/lib/libQt5Core.so.5
/home/adam/miniconda3/pkgs/qt-5.6.2-2/lib/libQt5Core.so.5.6
/home/adam/miniconda3/pkgs/qt-5.6.2-2/lib/libQt5Core.so.5.6.2
/home/adam/miniconda3/pkgs/qt-5.6.2-3/lib/libQt5Core.so.5
/home/adam/miniconda3/pkgs/qt-5.6.2-3/lib/libQt5Core.so.5.6
/home/adam/miniconda3/pkgs/qt-5.6.2-3/lib/libQt5Core.so.5.6.2
/home/adam/miniconda3/pkgs/qt-5.6.2-4/lib/libQt5Core.so.5
/home/adam/miniconda3/pkgs/qt-5.6.2-4/lib/libQt5Core.so.5.6
/home/adam/miniconda3/pkgs/qt-5.6.2-4/lib/libQt5Core.so.5.6.2
/home/adam/miniconda3/pkgs/qt-5.6.2-5/lib/libQt5Core.so.5
/home/adam/miniconda3/pkgs/qt-5.6.2-5/lib/libQt5Core.so.5.6
/home/adam/miniconda3/pkgs/qt-5.6.2-5/lib/libQt5Core.so.5.6.2
/opt/mendeleydesktop/lib/qt/libQt5Core.so.5
/opt/qt.5.5.1/lib/libQt5Core.so.5
/opt/qt.5.5.1/lib/libQt5Core.so.5.5
/opt/qt.5.5.1/lib/libQt5Core.so.5.5.1
/usr/lib/rstudio/bin/libQt5Core.so.5
/usr/lib/rstudio/bin/libQt5Core.so.5.4.2
/usr/lib/x86_64-linux-gnu/libQt5Core.so.5
/usr/lib/x86_64-linux-gnu/libQt5Core.so.5.9
/usr/lib/x86_64-linux-gnu/libQt5Core.so.5.9.2
➜  ~ locate ibQt5DBus.so.5 
/home/adam/.aspera/connect/lib/libQt5DBus.so.5
/home/adam/Downloads/Persepolis/qt-everywhere-opensource-src-5.5.1/qtbase/lib/libQt5DBus.so.5
/home/adam/Downloads/Persepolis/qt-everywhere-opensource-src-5.5.1/qtbase/lib/libQt5DBus.so.5.5
/home/adam/Downloads/Persepolis/qt-everywhere-opensource-src-5.5.1/qtbase/lib/libQt5DBus.so.5.5.1
/home/adam/Programs/foxitsoftware/lib/libQt5DBus.so.5
/home/adam/Programs/foxitsoftware/lib/libQt5DBus.so.5.3
/home/adam/Programs/foxitsoftware/lib/libQt5DBus.so.5.3.2
/home/adam/miniconda3/envs/Python_27/lib/libQt5DBus.so.5
/home/adam/miniconda3/envs/Python_27/lib/libQt5DBus.so.5.6
/home/adam/miniconda3/envs/Python_27/lib/libQt5DBus.so.5.6.2
/home/adam/miniconda3/lib/libQt5DBus.so.5
/home/adam/miniconda3/lib/libQt5DBus.so.5.6
/home/adam/miniconda3/lib/libQt5DBus.so.5.6.2
/home/adam/miniconda3/pkgs/qt-5.6.2-2/lib/libQt5DBus.so.5
/home/adam/miniconda3/pkgs/qt-5.6.2-2/lib/libQt5DBus.so.5.6
/home/adam/miniconda3/pkgs/qt-5.6.2-2/lib/libQt5DBus.so.5.6.2
/home/adam/miniconda3/pkgs/qt-5.6.2-3/lib/libQt5DBus.so.5
/home/adam/miniconda3/pkgs/qt-5.6.2-3/lib/libQt5DBus.so.5.6
/home/adam/miniconda3/pkgs/qt-5.6.2-3/lib/libQt5DBus.so.5.6.2
/home/adam/miniconda3/pkgs/qt-5.6.2-4/lib/libQt5DBus.so.5
/home/adam/miniconda3/pkgs/qt-5.6.2-4/lib/libQt5DBus.so.5.6
/home/adam/miniconda3/pkgs/qt-5.6.2-4/lib/libQt5DBus.so.5.6.2
/home/adam/miniconda3/pkgs/qt-5.6.2-5/lib/libQt5DBus.so.5
/home/adam/miniconda3/pkgs/qt-5.6.2-5/lib/libQt5DBus.so.5.6
/home/adam/miniconda3/pkgs/qt-5.6.2-5/lib/libQt5DBus.so.5.6.2
/opt/mendeleydesktop/lib/qt/libQt5DBus.so.5
/opt/mendeleydesktop/lib/qt/libQt5DBus.so.5.5
/opt/mendeleydesktop/lib/qt/libQt5DBus.so.5.5.1
/opt/qt.5.5.1/lib/libQt5DBus.so.5
/opt/qt.5.5.1/lib/libQt5DBus.so.5.5
/opt/qt.5.5.1/lib/libQt5DBus.so.5.5.1
/usr/lib/rstudio/bin/libQt5DBus.so.5
/usr/lib/rstudio/bin/libQt5DBus.so.5.4.2
/usr/lib/x86_64-linux-gnu/libQt5DBus.so.5
/usr/lib/x86_64-linux-gnu/libQt5DBus.so.5.9
/usr/lib/x86_64-linux-gnu/libQt5DBus.so.5.9.2
➜  ~ locate libQt5Gui.so.5
/home/adam/.aspera/connect/lib/libQt5Gui.so.5
/home/adam/Downloads/Persepolis/qt-everywhere-opensource-src-5.5.1/qtbase/lib/libQt5Gui.so.5
/home/adam/Downloads/Persepolis/qt-everywhere-opensource-src-5.5.1/qtbase/lib/libQt5Gui.so.5.5
/home/adam/Downloads/Persepolis/qt-everywhere-opensource-src-5.5.1/qtbase/lib/libQt5Gui.so.5.5.1
/home/adam/Programs/foxitsoftware/lib/libQt5Gui.so.5
/home/adam/Programs/foxitsoftware/lib/libQt5Gui.so.5.3
/home/adam/Programs/foxitsoftware/lib/libQt5Gui.so.5.3.2
/home/adam/miniconda3/envs/Python_27/lib/libQt5Gui.so.5
/home/adam/miniconda3/envs/Python_27/lib/libQt5Gui.so.5.6
/home/adam/miniconda3/envs/Python_27/lib/libQt5Gui.so.5.6.2
/home/adam/miniconda3/lib/libQt5Gui.so.5
/home/adam/miniconda3/lib/libQt5Gui.so.5.6
/home/adam/miniconda3/lib/libQt5Gui.so.5.6.2
/home/adam/miniconda3/pkgs/qt-5.6.2-2/lib/libQt5Gui.so.5
/home/adam/miniconda3/pkgs/qt-5.6.2-2/lib/libQt5Gui.so.5.6
/home/adam/miniconda3/pkgs/qt-5.6.2-2/lib/libQt5Gui.so.5.6.2
/home/adam/miniconda3/pkgs/qt-5.6.2-3/lib/libQt5Gui.so.5
/home/adam/miniconda3/pkgs/qt-5.6.2-3/lib/libQt5Gui.so.5.6
/home/adam/miniconda3/pkgs/qt-5.6.2-3/lib/libQt5Gui.so.5.6.2
/home/adam/miniconda3/pkgs/qt-5.6.2-4/lib/libQt5Gui.so.5
/home/adam/miniconda3/pkgs/qt-5.6.2-4/lib/libQt5Gui.so.5.6
/home/adam/miniconda3/pkgs/qt-5.6.2-4/lib/libQt5Gui.so.5.6.2
/home/adam/miniconda3/pkgs/qt-5.6.2-5/lib/libQt5Gui.so.5
/home/adam/miniconda3/pkgs/qt-5.6.2-5/lib/libQt5Gui.so.5.6
/home/adam/miniconda3/pkgs/qt-5.6.2-5/lib/libQt5Gui.so.5.6.2
/opt/mendeleydesktop/lib/qt/libQt5Gui.so.5
/opt/qt.5.5.1/lib/libQt5Gui.so.5
/opt/qt.5.5.1/lib/libQt5Gui.so.5.5
/opt/qt.5.5.1/lib/libQt5Gui.so.5.5.1
/usr/lib/rstudio/bin/libQt5Gui.so.5
/usr/lib/rstudio/bin/libQt5Gui.so.5.4.2
/usr/lib/x86_64-linux-gnu/libQt5Gui.so.5
/usr/lib/x86_64-linux-gnu/libQt5Gui.so.5.9
/usr/lib/x86_64-linux-gnu/libQt5Gui.so.5.9.2

发现一个很有意思的事情:我们要得库文件系统/usr/lib/x86_64-linux-gnu/下有一份,Miniconda 有一份,我们编译出的/opt/qt.5.5.1/lib/有一份,还有哪里呢,/usr/lib/rstudio/bin//opt/mendeleydesktop/lib/qt/。这个就有意思了,就是说库其实好多份,好,系统有的和 Miniconda 的不说,版本不对,我们自己编译的不说。软件自己竟然带了一份,那就有个便利了,那就是说理论上我们编译这些库完全是多此一举,因为我们完全可以直接链接到软件自带的库啊,这样的话不用说库版本绝对没问题。

所以我们需要干嘛呢,export LD_LIBRARY_PATH要么使用自己编译出来的 Qt 库,要么使用软件自己带的库。我试验了下,两种办法都可以。

我仔细看了之前的博客,不知道为什么竟然没有提到LD_LIBRARY_PATH的事情,但是最后libfcitxplatforminputcontextplugin.so链接到了/opt下我自己编译的库文件,想来我中间可能export过但是自己忘了,现在才发现这个才是最重要的步骤啊。惭愧。

Linux 下的共享库

2018-02-28

linux 2014 dark

通过上次那个 Rtudio 输入法的事情 #12 ,我越来越觉得编译啊共享库啊什么的很有趣,然后我懂的太少。所以补课看了一些东西,这一篇我觉得很基础,也很有启发性。把这篇和之前的 #6 #7 #12 一起看理解下很重要。

下文原文来自博客园上的一篇博文 在 Linux 使用 GCC 编译C语言共享库,有删改。

这是一篇很基础的博文,通过一个小例子说明 Linux 下共享库的创建和使用。明白这些对于软件的编译会有很多帮助。

正式开始前,我们先看看源代码到运行程序之间发生了什么:

  1. 预处理:这个阶段处理所有预处理指令。基本上就是源代码中所有以 ‘#’ 开始的行,例如 #define#include
  2. 编译:一旦源文件预处理完毕,接下来就是编译。因为许多人提到编译时都是指整个程序构建过程,因此本步骤也称作“compilation proper”。本步骤将“.c”文件转换为“.o”文件。
  3. 连接:这一步将所有的对象文件和库文件串联起来使之成为最后的可运行程序。需要注意的是,静态库实际上已经植入到你的程序中,而共享库,只是在程序中包含了对它们的引用。现在你有了一个完整的程序,随时可以运行。当你从 shell 中启动它,它就被传递给了加载器。
  4. 加载:本步骤发生在程序启动时。首先程序需要被扫描以便引用共享库。程序中所有被发现的引用都立即生效,对应的库也被映射到程序。

第3步和第4步就是共享库的奥秘所在。

下面通过一个例子来说明这个过程。

首先我们在工作目录下有三个文件foo.hfoo.cmain.c
foo.h文件的内容为:

#ifndef foo_h__
#define foo_h__
 
extern void foo(void);
 
#endif  // foo_h__

foo.c文件的内容为:

#include <stdio.h>
 
void foo(void)
{
    puts("Hello, I'm a shared library");
}

main.c文件的内容为:

#include <stdio.h>
#include "foo.h"
 
int main(void)
{
    puts("This is a shared library test...");
    foo();
    return 0;
}

foo.h定义了一个接口连接我们的库,这个库里只有一个简单的函数,foo()foo.c包含了这个函数的实现,main.c是一个用到我们库的驱动程序。
接下来我们看看怎么在编译过程中使用共享库生成最终的可执行程序。

Step 1: 编译无约束位代码

我们需要把我们库的源文件编译成无约束位代码。无约束位代码是存储在主内存中的机器码,执行的时候与绝对地址无关。

$ gcc -c -Wall -Werror -fpic foo.c

这一步会得到对象文件foo.o

Step 2: 从一个对象文件创建共享库

现在让我们将对象文件变成共享库。我们将其命名为libfoo.so

$ gcc -shared -o libfoo.so foo.o

现在就得到了libfoo.so文件了。

Step 3: 链接共享库

现在我们得到共享库了,下一步就是编译main.c并让它链接到我们创建的这个共享库上。我们将最终的运行程序命名为test
注意:-lfoo选项并不是搜寻foo.o,而是libfoo.so。GCC 编译器会假定所有的库都是以“lib”开头,以“.so”或“.a”结尾(“.so”是指 shared object 共享对象或者 shared libraries 共享库,“.a”是指 archive 档案,或者静态连接库)。

$ gcc -Wall -o test main.c -lfoo -lc

会出现报错:

/usr/bin/ld: cannot find -lfoo
collect2: ld returned 1 exit status

即编译器没有找到我们的共享库libfoo.so,链接器并不知道该去哪里找libfoo.so(事实上是不会去标准系统路径以外的地方去找共享库)。我们要指定 GCC 去哪找共享库。

GCC有一个默认的搜索列表,但我们的工作目录并不在那个列表中。我们需要告诉 GCC 去哪里找到libfoo.so。这就要用到-L选项。
在本例中,我们将使用当前目录.

$ gcc -Wall -o test main.c -L. -lfoo -lc

这样就能顺利编译出可执行文件test。我们执行看看:

$ ./test 
./test: error while loading shared libraries: libfoo.so: cannot open shared object file: No such file or directory

报错了,出错原因还是找不到libfoo.so文件。虽然链接的时候我们通过指定路径链接成功了,但是运行时libfoo.so一样找不到。
那要怎么指定呢?两个办法:

  • 把需要的库文件(本例中的libfoo.so)移动到系统标准路径去;
  • 通过LD_LIBRARY_PATH环境变量或者rpath选项临时启用非标准路径中的库文件。

重点看看第二个方法是怎么做的。

使用 LD_LIBRARY_PATH 环境变量

先看看目前的LD_LIBRARY_PATH是什么:

$ echo $LD_LIBRARY_PATH

这个环境变量内容目前为空,即没有存储任何路径。
现在把当前工作目录添加到LD_LIBRARY_PATH中:

$ LD_LIBRARY_PATH=.:$LD_LIBRARY_PATH
$ ./test
./test: error while loading shared libraries: libfoo.so: cannot open shared object file: No such file or directory

为什么还报错呢?
虽然我们的目录在LD_LIBRARY_PATH中,但是我们还没有导出它。在 Linux 中,如果你不将修改导出到一个环境变量,这些修改是不会被子进程继承的。加载器和我们的测试程序没有继承我们所做的修改。要修复这个问题很简单,export一下就行了:

$ export LD_LIBRARY_PATH=.:$LD_LIBRARY_PATH
$ ./test
This is a shared library test...
Hello, I'm a shared library
$ unset LD_LIBRARY_PATH

这下终于可以了。

LD_LIBRARY_PATH很适合做快速测试,尤其在没有权限将需要的库放到系统标准路径或者只是想临时做测试的情况下。
另一方面,导出LD_LIBRARY_PATH变量意味着可能会造成其他依赖LD_LIBRARY_PATH的程序出现问题,因此在做完测试后最好将LD_LIBRARY_PATH恢复成之前的样子。

使用 rpath 选项

再来看看 rpath 选项的用法:

# make sure LD_LIBRARY_PATH is set to default
$ unset LD_LIBRARY_PATH
$ gcc -Wall -o test main.c -L. -Wl,-rpath=. -lfoo -lc
$ ./test
This is a shared library test...
Hello, I'm a shared library

也没问题。

rpath方法有一个优点,对于每个程序编译时我们都可以通过这个选项单独罗列它自己的共享库位置,因此不同的程序可以在指定的路径去加载需要的库文件,而不需要一次次的去指定LD_LIBRARY_PATH环境变量。

附:

  1. Shared Libraries(共享库) 和 Static Libraries(静态库)区别
  • 共享库是以“.so”(Windows 平台为“.dll”,Mac OS 平台为“.dylib”)作为后缀的文件。所有和库有关的代码都在这一个文件中,程序在运行时引用它。使用共享库的程序只会引用共享库中它要用到的那段代码。

    静态库是以“.a”(Windows平台为“.lib”)作为后缀的文件。所有和库有关的代码都在这一个文件中,静态库在编译时就被直接链接到了程序中。使用静态库的程序从静态库拷贝它要使用的代码到自身当中。

  • 两种库各有千秋。
    使用共享库可以减少程序中重复代码的数量,让程序体积更小。而且让你可以用一个功能相同的对象来替换共享对象,这样可以在增加性能的同时不用重新编译那些使用到该库的程序。但是使用共享库会小额增加函数的执行的成本,同样还会增加运行时的加载成本,因为共享库中的符号需要关联到它们使用的东西上。共享库可以在运行时加载到程序中,这是二进制插件系统最通用的一种实现机制。
    静态库总体上增加了程序体积,但它也意味着你无需随时随地都携带一份要用到的库的拷贝。因为代码在编译时就已经被关联在一起,因此在运行时没有额外的消耗。

  1. GCC 首先在/usr/local/lib搜索库文件,其次在/usr/lib,然后搜索-L参数指定路径,搜索顺序和-L参数给出路径的顺序一致。

  2. 默认的 GNU 加载器ld.so,按以下顺序搜索库文件:

  • 首先搜索程序中DT_RPATH区域,除非还有DT_RUNPATH区域。
  • 其次搜索LD_LIBRARY_PATH。如果程序是setuid/setgid,出于安全考虑会跳过这步。
  • 搜索DT_RUNPATH区域,除非程序是setuid/setgid
  • 搜索缓存文件/etc/ld/so/cache(停用该步可以使用-z nodeflib参数)
  • 搜索默认目录/lib,然后/usr/lib(停用该步请使用-z nodeflib参数)。

MIMIC III 数据 + postgreSQL

20180620_13-30-29

MIMIC III 数据 + postgreSQL

申请的 MIMIC III 数据库 今天终于通过了,下载发现一堆 csv.gz 大小也是惊人。所以自己一一个用表格怕是不可能了,只能去用他们推荐的数据库管理了。

安装和配置

首先第一步就是用官方提供的 mimin-code 来构建数据库了。官方推荐postgreSQL,那就用这个好了。
Debian 的话安装postgresql倒是没什么,直接sudo apt install postgresql就 KO 了。看了一下版本

~ psql --version
psql (PostgreSQL) 10.4 (Debian 10.4-2)

还是挺新的。很多基础的东西文档都已经覆盖,但是人懒就是没看,碰到问题查,感觉浪费的时间也不少。

安装完之后默认会创建postgres这个用户,然后我就psql -U postgres 打算登录,结果报错:

psql: FATAL:  Peer authentication failed for user "postgres"

上网查了一下就解决了。解决办法是
编辑/etc/postgresql/10/main/pg_hba.conf文件,将

# Database administrative login by Unix domain socket
local   all             postgres                                peer

peer改成trust,然后 systemctl restart postgresql.service 重启下服务重载设置应该就能登录了。确保posgres用户可用之后就能直接用 mimic-code 提供的脚本构建数据库了,由于这个数据很大,构建数据库需要一会儿,可以先坐和放宽。

简单使用

现在只有postgres用户,数据我自己用,我们显然想自己的用户也是 super user 的。所以下面就是给我自己授权了。

首先 psql -U postgres 登录 postgres 用户,然后:

CREATE USER XXX;
ALTER USER XXX SUPERUSER CREATEDB;
\du

就行了。这样我自己也能管理数据库了。

要进入 mimic 数据库,直接psql mimic就行了。

值得注意的是,mimic-code 提供了很多concepts,就是已经定义好的一些疾病和数据提取方法。但是按照 README 里写的直接make concepts并不能直接生成这些数据,我自己就看了Makefile,发现根本就没写concepts的规则,也难怪直接 make 不行。所以需要把mimic-code/buildmimic/postgres/Makefile复制一份到mimic-code/concepts/,然后自己编辑加入concepts规则。具体做法倒很简单,直接把上面的那些规则复制一份然后改动具体调用的sql文件为当前目录下的make-concepts.sql就行了。
这个也不知道是 mimic-code 本来设置如此还是我搞错,但是反正黑猫白猫吧,那些物化视图我倒是都顺利生成了。

最后来看看MIMIC III里的数据的样子:

patients

嗯,很好,数据库建立完毕,剩下的就是怎么用数据库和导入 R 分析的问题了。

简单的 Conda 入门

2017-05-10 

我选择了 Miniconda,因为不喜欢 Anaconda 那种巨无霸全家桶。

官方文档地址:Managing environments


基础

安装 Miniconda 时,默认自带一个名为root的环境,可以直接使用

source activate root

即可激活。在环境内执行pip install fooconda install foo一样都将会为当前root环境装包

添加 conda 的 TUNA 镜像

conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/free/

# 搜索时显示Cheel的nn地址
conda config --set show_channel_urls yes

该命令会生成~/.condarc文件,记录对 conda 的配置,直接手动创建、编辑该文件是相同的效果。

为某个环境装包:

conda install --name bunnies beautiful-soup

查看某个环境中已经安装的所有包:

conda list -n snowflakes

删除某个环境中的某个包:

conda remove --name bunnies iopro

更新已安装的包(condapython本身也可能这样更新):

conda update biopython
# update all pkgs:
conda update --all

conda 环境管理

新建一个环境,使用2.7版本的 Python 并且命名为Python_27:

# create an env
conda create --name Python_27 python=2.7

激活/关闭环境:

source activate snowflakes
source deactivate snowflakes

查看当前已经存在的所有环境:

# list all envs
conda info --envs
# or
conda env list

此时已经激活的环境前面带有×标识

克隆一个已存在的环境env_orgenv_copy

# clone an env
conda create --name env_copy --clone org_env

删除一个环境:

# remove an env
conda remove --name flowers --all

导出环境到文件:为了方便其他人可以获得与你完全相同的环境,可以导出环境到文件。

  1. 激活这一环境:
source activate env_name
  1. 导出环境到文件:
conda env export > environment.yml

导出的文件会包含pipconda安装的包。

  1. 根据environment.yml新建环境
conda env create -f environment.yml

Bioconda

1. Install conda

Bioconda requires the conda package manager to be installed. If you have an Anaconda Python installation, you already have it. Otherwise, the best way to install it is with the Miniconda package. The Python 3 version is recommended.

2. Set up channels

After installing conda you will need to add the bioconda channel as well as the other channels bioconda depends on. It is important to add them in this order so that the priority is set correctly (that is, bioconda is highest priority).

The conda-forge channel contains many general-purpose packages not already found in the defaults channel. The r channel contains common R packages used as dependencies for bioconda packages.

conda config --add channels conda-forge
conda config --add channels defaults
conda config --add channels r
conda config --add channels bioconda

R 启动时自动加载包的正确姿势

2017-12-21

今天看 Hadley Wickham 大大的《R for Data Science》的时候无意踩坑了,记录一下。

看到章节 4. Workflow: basics4.4 节做练习的时候,本来这一章十分简单,5 分钟看完的,练习也简单,基本上就是拼写错误啥的。然后第二题:

Tweak each of the following R commands so that they run correctly:

library(tidyverse)

ggplot(dota = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy))

fliter(mpg, cyl = 8)
filter(diamond, carat > 3

ggplot 里面 data 写成了 dota,嗯好的,原来 Hadley 大大也是宅男。

改过来之后 OK 了。

filter写成了fliter= 应该是 ==,口亨,so easy。

然后, 然后,还是报错:

R >>> filter(mpg, cyl == 8)
Error in stats::filter(mpg, cyl == 8) : object 'cyl' not found
In addition: Warning messages:
1: In data.matrix(data) : NAs introduced by coercion
2: In data.matrix(data) : NAs introduced by coercion
3: In data.matrix(data) : NAs introduced by coercion
4: In data.matrix(data) : NAs introduced by coercion
5: In data.matrix(data) : NAs introduced by coercion
6: In data.matrix(data) : NAs introduced by coercion

我瞬间炸了。

把眼睛凑近点看以为是不是哪个 l 其实是个 1 之类的,发现没问题啊。

实在不行,我觉得可能是代码是复制粘贴的,有看不见的字符之类的问题,决定手打一遍,然后, 然后,然后:

R >>> filter(mpg, cyl == 8)
Error in stats::filter(mpg, cyl == 8) : object 'cyl' not found
In addition: Warning messages:
1: In data.matrix(data) : NAs introduced by coercion
2: In data.matrix(data) : NAs introduced by coercion
3: In data.matrix(data) : NAs introduced by coercion
4: In data.matrix(data) : NAs introduced by coercion
5: In data.matrix(data) : NAs introduced by coercion
6: In data.matrix(data) : NAs introduced by coercion

我一度以为我眼睛瞎了。

算了,可能环境乱了,我重新开一个 R 试试,然后, 然后,然后, 然后:

R >>> filter(mpg, cyl == 8)
Error in stats::filter(mpg, cyl == 8) : object 'cyl' not found
In addition: Warning messages:
1: In data.matrix(data) : NAs introduced by coercion
2: In data.matrix(data) : NAs introduced by coercion
3: In data.matrix(data) : NAs introduced by coercion
4: In data.matrix(data) : NAs introduced by coercion
5: In data.matrix(data) : NAs introduced by coercion
6: In data.matrix(data) : NAs introduced by coercion

。。。卒 。。。

只能 Google 了,结果还真有悲摧的人碰到这个问题你别说,Unable to run examples ,直接在 Hadley 的 GitHub repo 里提问了。大大不愧是大大,一语道破真相:

Are you loading dplyr in your .Rprofile?

可不是嘛,我偷懒在 ~/.Rprofile 里加载了好几个常用的包。这个在我另一篇文里写了: R启动设置。 当时还只加载了 colorout 这个包。之后我加了几个,其中就包括 tidyverse,然后 dplyr 作为光荣的 tidyverse 全家桶的一员当然也就一起加载了。

Hadley 下面解释了原因,并且再下面还有人直接提出了解决方案:

That's a bad idea for exactly this reason. It gets loaded before stats, so stats::filter() overrides dplyr::filter()

A better way to handle this is to set the defaultPackages option, and ensure the packages are set in the order you wish to load them. E.g. in your .Rprofile you could have:

.First <- function() {
   autoloads <- c("dplyr", "ggplot2", "Hmisc")
   options(defaultPackages = c(getOption("defaultPackages"), autoloads))
}

就是说因为 dplyr 加载太早,早于 stats,所以最后 stats::filter 覆盖了 dplyr::filter。也就是说上面报错是 stats::filter 在报错(细心一点其实早就应该看到啊)。验证一下:

R >>> stats::filter(mpg, cyl == 8)
Error in stats::filter(mpg, cyl == 8) : object 'cyl' not found
In addition: Warning messages:
1: In data.matrix(data) : NAs introduced by coercion
2: In data.matrix(data) : NAs introduced by coercion
3: In data.matrix(data) : NAs introduced by coercion
4: In data.matrix(data) : NAs introduced by coercion
5: In data.matrix(data) : NAs introduced by coercion
6: In data.matrix(data) : NAs introduced by coercion

R >>> dplyr::filter(mpg, cyl == 8)
# A tibble: 70 x 11
   manufacturer              model displ  year   cyl      trans   drv   cty   hwy    fl   class
          <chr>              <chr> <dbl> <int> <int>      <chr> <chr> <int> <int> <chr>   <chr>
 1         audi         a6 quattro   4.2  2008     8   auto(s6)     4    16    23     p midsize
 2    chevrolet c1500 suburban 2wd   5.3  2008     8   auto(l4)     r    14    20     r     suv
 3    chevrolet c1500 suburban 2wd   5.3  2008     8   auto(l4)     r    11    15     e     suv
 4    chevrolet c1500 suburban 2wd   5.3  2008     8   auto(l4)     r    14    20     r     suv
 5    chevrolet c1500 suburban 2wd   5.7  1999     8   auto(l4)     r    13    17     r     suv
 6    chevrolet c1500 suburban 2wd   6.0  2008     8   auto(l4)     r    12    17     r     suv
 7    chevrolet           corvette   5.7  1999     8 manual(m6)     r    16    26     p 2seater
 8    chevrolet           corvette   5.7  1999     8   auto(l4)     r    15    23     p 2seater
 9    chevrolet           corvette   6.2  2008     8 manual(m6)     r    16    26     p 2seater
10    chevrolet           corvette   6.2  2008     8   auto(s6)     r    15    25     p 2seater
# ... with 60 more rows

小样儿,果不其然啊。

然后改了加载方式之后, defaultPackages 本身就有 stats,这样就保证 stats 会先加载而 dplyr 后加载并且 filter 不会被覆盖。

至此问题圆满解决。

人生何处不踩坑。

顺便,现在我的.Rprofile 升级了:

# customized options
options(prompt="\033[0;36mR >>> \033[0m", continue="... ")
options(editor="vim", menu.graphics=FALSE)
options(stringsAsFactors = FALSE, show.signif.stars = TRUE, digits = 4)

# launch Bioconductor and set Bioconductor mirror at startup
#source("http://bioconductor.org/biocLite.R")
#options(BioC_mirror="http://mirrors.ustc.edu.cn/bioc/")   # mighty USTC
#options(BioC_mirror="https://mirrors.tuna.tsinghua.edu.cn/bioconductor")   # TUNA then
#source("https://bioc.ism.ac.jp/biocLite.R")   # an alternative Japanese mirror

# set CRAN mirror. Better add at least two in case that one of them stops working
options(repos=c("http://mirrors.ustc.edu.cn/CRAN/", "https://mirrors.tongji.edu.cn/CRAN/",
                "http://mirrors.tuna.tsinghua.edu.cn/CRAN/", "https://mirrors.aliyun.com/CRAN/"))

# lanuch Bioconductor may take too long, disable auto start and and define a DIY func
# to start when needed
source.bio <- function(){
	source("http://bioconductor.org/biocLite.R")
	options(BioC_mirror="http://mirrors.ustc.edu.cn/bioc/")
}


# useful little customized functions
cd <- setwd
pwd <- getwd
hh <- function(d) {
  row_num <- min(5,nrow(d))
  col_num <- min(5,ncol(d))
  return(d[1:row_num,1:col_num])
}

# load favorite packages automatically at startup
options(defaultPackages=c(getOption("defaultPackages"), 'beepr',
       "colorout"))

# display greeting message at startup
.First <- function(){
	message("Welcome back, ", Sys.getenv("USER"),"!\n","Current working directory: ", getwd(),
                "\nDate and time: ", format(Sys.time(), "%Y-%m-%d %H:%M"), "\r\n")
	# display a message when all above loaded successfully
	message("###### SUCCESSFULLY LOADED. LET'S DO THIS! ######")
}

# goodbye at closing
.Last <- function() {
	cat("\nGoodbye at ", date(), "\n")
}

决定把配置文件也存在 GitHub 备份了。

Microsoft R Open 的安装与配置

昨天偶然在网上看到看到关于不同版本 R 的速度对比的文章 R, R with Atlas, R with OpenBLAS and Revolution R Open: which is fastest?,被结果惊到了,最快的 Revolution R Open 碾压 Vanilla R,而且相比 OPENBLAS 和 ATLAS R 都有优势,简直是孤独求败。然后我搜了一下,发现 Revolution R Open 已经变成 Microsoft R Open 了。虽然是开源,但是对于微软家的东西还是有点不是很喜欢吧。看了一下还和 Intel 搞的 MKL 直接一起下下来了,这简直就是搞黑科技垄断啊。

算了,吐槽到此为止,安装上看一下。

下载安装

首先我是 Debian sid,没什么好说的,直接用提供的 Ubuntu 版本就行了,2018-07-14 最新版本为 3.5.0

安装呢没啥好说的,文档 简单得很,解压,运行 shell 脚本就完了。

值得一提的是,微软始终还是那个微软,看到这个提示:

Important!
After installing, the default R path is updated to point to R installed with Microsoft R Open 3.5.0, which is under lib64/R/bin/R.
The CRAN repository points to a snapshot from Jan 01, 2018. This means that every user of Microsoft R Open has access to the same set of CRAN package versions. To get packages from another date, use the checkpoint package, installed with Microsoft R Open.

我就知道微软出品的本色,霸道。还记得重装系统时会被 Windows 覆盖掉的大名湖畔的 grub2 吗哈哈哈哈?

启动和配置

按照官方文档的说法,装完后 MRO 会自动设置为默认,所以 Terminal 直接 R 启动就好:

➜  ~ R

R version 3.5.0 (2018-04-23) -- "Joy in Playing"
Copyright (C) 2018 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

  Natural language support but running in an English locale

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.


 *** caught segfault ***
address 0x50, cause 'memory not mapped'

Traceback:
 1: dyn.load(libPath)
 2: doTryCatch(return(expr), name, parentenv, handler)
 3: tryCatchOne(expr, names, parentenv, handlers[[1L]])
 4: tryCatchList(expr, classes, parentenv, handlers)
 5: tryCatch(expr, error = function(e) {    call <- conditionCall(e)    if (!is.null(call)) {        if (identical(call[[1L]], quote(doTryCatch)))             call <- sys.call(-4L)        dcall <- deparse(call)[1L]        prefix <- paste("Error in", dcall, ": ")        LONG <- 75L        sm <- strsplit(conditionMessage(e), "\n")[[1L]]        w <- 14L + nchar(dcall, type = "w") + nchar(sm[1L], type = "w")        if (is.na(w))             w <- 14L + nchar(dcall, type = "b") + nchar(sm[1L],                 type = "b")        if (w > LONG)             prefix <- paste0(prefix, "\n  ")    }    else prefix <- "Error : "    msg <- paste0(prefix, conditionMessage(e), "\n")    .Internal(seterrmessage(msg[1L]))    if (!silent && isTRUE(getOption("show.error.messages"))) {        cat(msg, file = outFile)        .Internal(printDeferredWarnings())    }    invisible(structure(msg, class = "try-error", condition = e))})

....


Possible actions:
1: abort (with core dump, if enabled)
2: normal R exit
3: exit R without saving workspace
4: exit R saving workspace
Selection: 

Great!:(

我不知道啥错误,反正看着挺严重。选 3 吧,退出不保存。然后就发现了一条算是比较熟悉的报错:

Warning message:
In doTryCatch(return(expr), name, parentenv, handler) :
  unable to load shared object '/opt/microsoft/ropen/3.5.0/lib64/R/modules//R_X11.so':
  libpng12.so.0: cannot open shared object file: No such file or directory

这个用 Linux 久了都知道,缺 libpng12.so.0 这个库文件嘛。第一反应是看看系统到底有没有这个呢?

➜  ~ locate libpng12.so.0
/home/adam/.aspera/connect/lib/libpng12.so.0
/opt/kingsoft/wps-office/office6/libpng12.so.0
/opt/kingsoft/wps-office/office6/libpng12.so.0.46.0

有点意思,WPS 带了一个,后续就简单了:

➜  ~ ll /opt/kingsoft/wps-office/office6/libpng12.so.0 
lrwxrwxrwx 1 root root 18 Jun  5 03:22 /opt/kingsoft/wps-office/office6/libpng12.so.0 -> libpng12.so.0.46.0
➜  ~ sudo ln -s /opt/kingsoft/wps-office/office6/libpng12.so.0.46.0 /opt/microsoft/ropen/3.5.0/lib64/R/lib/libpng12.so.0

然后再 R 启动看看发现没问题了。RStudio 打开看了一下,也是 MRO 了。library("limma")没问题。

当然,如果没有你的系统没有 libpng12.so.0 那也可以装,DebianCN 源 里有 libpng12 了,直接 sudo apt install libpng12 就行了,测试了一下发现是可以的。其他发行版的话可能得自己编译了。

嗯,这些基本没问题了

吗?

还没完

我为什么上面说 基本没问题了 呢?

因为 MRO 自动变成我的默认 R 了,这太不没问题了好吗!这是 Linux,充满自由,选择的 Linux 世界。凭什么装上就设置默认,我的选择呢?官方说法十分轻描淡写:

Tip: You can also manage multiple side-by-side installations of any application using the alternatives command (or update-alternatives on Ubuntu). This command allows you create and manage symbolic links to the different installations, and thus easily refer to the installation of your choice.

里面还假惺惺地给了 alternatives 命令的帮助页面链接而不是直接提供具体做法,可以这很微软。
正确的做法不应该是安装时候不设置默认,然后下面给出如果想设置默认要怎么办然后给 alternatives 帮助链接吗?

吐槽再次完毕,我们下面来自己掌控怎么设置到底谁才是系统默认的 R 版本。

  • 我之前装的是 R 3.5.1 (2018-07-02) -- "Feather Spray"R 可执行文件路径为 /usr/lib/R/bin/R
  • 而 MRO 刚刚看到了,装在 /op/ 下,具体可执行文件路径 /opt/microsoft/ropen/3.5.0/lib64/R/bin/R
  • 我们在终端直接 R 其实执行是我们 PATH里存在 R 命令,而上述两个显然都不在 PATH
  • whereis R 看一下,发现其实执行的是 /usr/bin/R这个命令,而这个命令本身是一个软链接:/usr/bin/R -> /opt/microsoft/ropen/3.5.0/lib64/R/bin/R

所以基本上真相大白了,系统默认用哪个 R 就是通过 /usr/bin/R这个软链接来控制的。那我们想要哪个默认直接改这个软链接的指向就行了。

这当然是最直观的办法,而 Debian 里呢,我们可以通过 update-alternatives来配置,参考博文 Alternative Versions of R 。我们要做的就是让 update-alternatives 知道我们这两个 R 都在哪里,然后用 update-alternatives --install <link> <name> <path> <priority> 设置它们各自的优先级就行了,priority 大的就是默认。

sudo rm /usr/bin/R
sudo update-alternatives --install /usr/bin/R R /usr/lib/R/bin/R 200
sudo update-alternatives --install /usr/bin/R R /opt/microsoft/ropen/3.5.0/lib64/R/bin/R 100

这样我们就重新把原来的 R 设置为默认了。终端打开或者 RStudio 都没问题。而且现在由系统 update-alternatives 接管了版本管理,以后我们要更改也十分简单:

➜  ~ update-alternatives --list R  
/opt/microsoft/ropen/3.5.0/lib64/R/bin/R
/usr/lib/R/bin/R
➜  ~ sudo update-alternatives --config R
There are 2 choices for the alternative R (providing /usr/bin/R).

  Selection    Path                                      Priority   Status
------------------------------------------------------------
* 0            /usr/lib/R/bin/R                           200       auto mode
  1            /opt/microsoft/ropen/3.5.0/lib64/R/bin/R   100       manual mode
  2            /usr/lib/R/bin/R                           200       manual mode

Press <enter> to keep the current choice[*], or type selection number: 

list 能看到可选的 R 版本,而 config 就能自己选择哪个作为默认了。

THE END.


2018-11-10 更新

今天发现 MRO-3.5.1 已经出来了,下下来解压直接安装会报错:

(Reading database ... 188397 files and directories currently installed.)
Preparing to unpack .../microsoft-r-open-mro-3.5.1.deb ...
dpkg-divert: error: 'diversion of /usr/bin/R to /usr/bin/R.distrib by microsoft-r-open-mro-3.5.1' clashes with 'diversion of /usr/bin/R to /usr/bin/R.distrib by microsoft-r-open-mro-3.5.0'
dpkg-divert: error: 'diversion of /usr/bin/Rscript to /usr/bin/Rscript.distrib by microsoft-r-open-mro-3.5.1' clashes with 'diversion of /usr/bin/Rscript to /usr/bin/Rscript.distrib by microsoft-r-open-mro-3.5.0'
dpkg: error processing archive /home/adam/Downloads/microsoft-r-open/deb/microsoft-r-open-mro-3.5.1.deb (--install):
 new microsoft-r-open-mro-3.5.1 package pre-installation script subprocess returned error exit status 2
Errors were encountered while processing:
 /home/adam/Downloads/microsoft-r-open/deb/microsoft-r-open-mro-3.5.1.deb

好像是和已经安装的 MRO-3.5.0 有冲突,所以就直接先 sudo apt purge microsoft-r-open-mro-3.5.0,然后再安装就 OK 了。之后启动 R 一样会报错 libpng12.so.0 not found,解决办法同前。然后 MRO 依然会自动成为系统默认,解决办法依然同前。

R 启动设置

2017-04-27

在 Linux 系统中,R 启动时默认加载 ~/.Rprofile 文件,这就为自定义多种 R 选项提供了方便。

我的~/.Rprofile文件内容:

# 设置启动时工作目录
setwd("/home/adam/Bioinformatics")

# 设置一些选项
options("pdfviewer"="evince")
options(prompt="R>", digits=4, show.signif.stars=TRUE)
options(menu.graphics=FALSE)
options(stringsAsFactors = FALSE)

# 设置默认镜像源
source("http://bioconductor.org/biocLite.R")
options(BioC_mirror="http://mirrors.ustc.edu.cn/bioc/")
options(repos=c("http://mirrors.ustc.edu.cn/CRAN/","https://mirrors.aliyun.com/CRAN/","http://mirrors.tuna.tsinghua.edu.cn/CRAN/"))

# 有用的小功能
cd <- setwd
pwd <- getwd
hh <- function(d) d[1:5,1:5]

# 经常需要的包
library("colorout")

# 加载完成后打印信息提示
message("###### SUCCESSFULLY LOADED. LET'S DO THIS! ######")

SQLBolt 课程学习笔记四(13-18 + X 课)

SQL Lesson 13: Inserting rows

第 13 课,添加行

We've spent quite a few lessons on how to query for data in a database, so it's time to start learning a bit about SQL schemas and how to add new data.
前面都是在学怎么查询数据库,现在该了解一下数据库 Schema(模式)和如何向数据中添加新的行(即观测)了。

What is a Schema?

We previously described a table in a database as a two-dimensional set of rows and columns, with the columns being the properties and the rows being instances of the entity in the table. In SQL, the database schema is what describes the structure of each table, and the datatypes that each column of the table can contain.
Schema 是什么呢?前面我们描述表格时都是由行列数据组成的二位数据,行代表观测对象,列代表属性。在数据库中,Schema 用来描述表格的结构以及表格中每列数据能包含的数据类型。

Example: Correlated subquery
For example, in our Movies table, the values in the Year column must be an Integer, and the values in the Title column must be a String.
比如在前面用到的 Movies 表格中,Year 这一列的值必须是 Interger 类型,Title 列的值必须是 String。

This fixed structure is what allows a database to be efficient, and consistent despite storing millions or even billions of rows.
这种固定的结构使得数据库十分高效,并且在存储百万甚至上亿行数据时仍然十分稳定。(又是一波吹啊)

Inserting new data

When inserting data into a database, we need to use an INSERT statement, which declares which table to write into, the columns of data that we are filling, and one or more rows of data to insert. In general, each row of data you insert should contain values for every corresponding column in the table. You can insert multiple rows at a time by just listing them sequentially.
向数据库中添加数据时需要使用 INSERT 语句声明所要添加数据的表格以及我们添加数据到哪些列,以及我们要添加的一行或者多行数据。一般的,我们每添加一行都应该包含每一列对应的值。连续写多行数据就可以一次性添加多行。语法:

-- Insert statement with values for all columns
INSERT INTO mytable
VALUES (value_or_expr, another_value_or_expr, …),
       (value_or_expr_2, another_value_or_expr_2, …),
       …;

In some cases, if you have incomplete data and the table contains columns that support default values, you can insert rows with only the columns of data you have by specifying them explicitly.
有时候我们手上只要不完整的数据或者表格中有的列有默认值,添加行数据的时候可以选择通过显式的指定来添加我们有的那些列。语法:

-- Insert statement with specific columns
INSERT INTO mytable
(column, another_column, …)
VALUES (value_or_expr, another_value_or_expr, …),
      (value_or_expr_2, another_value_or_expr_2, …),
      …;

In these cases, the number of values need to match the number of columns specified. Despite this being a more verbose statement to write, inserting values this way has the benefit of being forward compatible. For example, if you add a new column to the table with a default value, no hardcoded INSERT statements will have to change as a result to accommodate that change.
这种情况下,每行添加的值的数目必须和指定的列数相匹配。这种添加数据的方式不仅可读性强,而且还能向后兼容。举个例子,如果现在表格中有新的具有默认值的列,非硬编码的 INSERT 语句就需要根据表格的新结构进行修改了。

In addition, you can use mathematical and string expressions with the values that you are inserting.
This can be useful to ensure that all data inserted is formatted a certain way.
另外,添加数据行时也可以使用数学和字符表达式。这可以用来确保添加的都是以某种方式格式化过的数据。比如:

Example Insert statement with expressions
INSERT INTO boxoffice
(movie_id, rating, sales_in_millions)
VALUES (1, 9.9, 283742034 / 1000000);

练习题

In this exercise, we are going to play studio executive and add a few movies to the Movies to our portfolio. In this table, the Id is an auto-incrementing integer, so you can try inserting a row with only the other columns defined.
这次练习我们将向前面用到的电影数据中添加更多电影。这个表格中 Id 列是一个自增整数,所以添加新行的时候可以只添加其他列的值就行了。数据:

1.movie

  1. Add the studio's new production, Toy Story 4 to the list of movies (you can use any director)
    向表格中添加新电影《Toy Story 4》,导演是谁无所谓。

既然 Id 不用指定,那么添加的就是非完整数据,必须显式指定列名咯

INSERT INTO Movies (Title, Director, Year, Length_minutes)
    VALUES ("Toy Story 4", "John Lasseter", 2018, 100)
  1. Toy Story 4 has been released to critical acclaim! It had a rating of 8.7, and made 340 million domestically and 270 million internationally. Add the record to the BoxOffice table.
    《Toy Story 4》很火,评分和国内外票房分别是这么多这么多以及这么多,添加到 BoxOffice 中。

有一点点小陷阱,就是票房都是百万为单位,得乘以百万算回去然后添加:

INSERT INTO Boxoffice (Movie_id, Rating, Domestic_sales, International_sales)
    VALUES (15, 8.7, 340*1000000, 270*1000000)

收工。


SQL Lesson 14: Updating rows

第 14 课,更新行。其实就是改数据嘛。

In addition to adding new data, a common task is to update existing data, which can be done using an UPDATE statement. Similar to the INSERT statement, you have to specify exactly which table, columns, and rows to update. In addition, the data you are updating has to match the data type of the columns in the table schema.
需要更新数据的时候很常见,这时候就要用到 UPDATE 语句了。和 INSERT 语句类似,这时候需要指定哪个表、哪些列和哪些行。另外,改动也必须符合 Schema 中规定好的数据类型。语法:

-- Update statement with values
UPDATE mytable
SET column = value_or_expr, 
    other_column = another_value_or_expr, 
    …
WHERE condition;

The statement works by taking multiple column/value pairs, and applying those changes to each and every row that satisfies the constraint in the WHERE clause.
这个语句把多个列/值数据更改应用到满足 WHERE 语句的行中。

Taking care

Most people working with SQL will make mistakes updating data at one point or another. Whether it's updating the wrong set of rows in a production database, or accidentally leaving out the WHERE clause (which causes the update to apply to all rows), you need to be extra careful when constructing UPDATE statements.

One helpful tip is to always write the constraint first and test it in a SELECT query to make sure you are updating the right rows, and only then writing the column/value pairs to update.
注意,大多数人在用 SQL 的时候不可避免的会在改数据的时候时不时出现点错误。比如在生产环境不小心改错了很多行,或者不小心把 WHERE 从句给掉了(然后全部数据都被改了),所以用 UPDATE 的时候要多留个心眼。

哈哈哈哈写掉 WHERE,想想都好激动呢

Alt text


练习题

It looks like some of the information in our Movies database might be incorrect, so go ahead and fix them through the exercises below.
数据中有些错误,修改一下吧。

Alt text

  1. The director for A Bug's Life is incorrect, it was actually directed by John Lasseter
    《程序猿的一生》的导演错了,应该是 John Lasseter
UPDATE Movies SET
	Director = "John Lasseter"
WHERE Title = "A Bugs's Life";
  1. The year that Toy Story 2 was released is incorrect, it was actually released in 1999
    《Toy Story 2》年份错了,改成 1999:
UPDATE Movies SET
	Year = 1999
WHERE Title = "Toy Story 2";
  1. Both the title and directory for Toy Story 8 is incorrect! The title should be "Toy Story 3" and it was directed by Lee Unkrich
    一个数据错了俩,故意的吧。
UPDATE Movies SET
    Title = "Toy Story 3"-- 开始掉了这个逗号一直报错不知道错在哪儿,注意!
    Director = "Lee Unkrich"
WHERE Title = "Toy Story 8";

撒花。


SQL Lesson 15: Deleting rows

第 15 课,删除行。

When you need to delete data from a table in the database, you can use a DELETE statement, which describes the table to act on, and the rows of the table to delete through the WHERE clause.
想从表格里删除行的话需要用 DELETE 语句,语句里指定操作的表格并用 WHERE 来指定删除哪些行。语法:

-- Delete statement with condition
DELETE FROM mytable
WHERE condition;

If you decide to leave out the WHERE constraint, then all rows are removed, which is a quick and easy way to clear out a table completely (if intentional).
如果不要 WHERE 的话,所有行就都删掉了,这是一种迅速的删除表格的办法。
跑路跑路....

Taking extra care

Like the UPDATE statement from last lesson, it's recommended that you run the constraint in a SELECT query first to ensure that you are removing the right rows. Without a proper backup or test database, it is downright easy to irrevocably remove data, so always read your DELETE statements twice and execute once.
和上节课说的 UPDATE 一样,最好首先用 SELECT 看一下会被删除的行是否正确。在没有备份或者测试数据库的时候,非常容易一不小心把数据给搞没了,所以执行 DELETE 之前一定要仔细检查一下。


练习题

The database needs to be cleaned up a little bit, so try and delete a few rows in the tasks below.
下面的表格需要清理一下:

4.movie

  1. This database is getting too big, lets remove all movies that were released before 2005.
    数据库太大了,删掉 2005 年之前的电影。
    好任性的理由....
DELETE FROM Movies WHERE Year < 2005;
  1. Andrew Stanton has also left the studio, so please remove all movies directed by him.
    Andrew Stanton 不在这儿干了,把所有他导演的电影删掉。
DELETE FROM Movies WHERE Director = "Andrew Stanton"

SQL Lesson 16: Creating tables

第 16 课,创建表格

When you have new entities and relationships to store in your database, you can create a new database table using the CREATE TABLE statement.
当有新数据要储存到数据库时就要使用 CREATE TABLE 来创建新的表格。语法:

-- Create table statement w/ optional table constraint and default value
CREATE TABLE IF NOT EXISTS mytable (
    column DataType TableConstraint DEFAULT default_value,
    another_column DataType TableConstraint DEFAULT default_value,
    …
);

The structure of the new table is defined by its table schema, which defines a series of columns. Each column has a name, the type of data allowed in that column, an optional table constraint on values being inserted, and an optional default value.
新表格的结构由 Schema 定义,它指定了一系列的列。每一列有列名,允许存储的数据类型,可选的对于插入数据的限制性条件,以及可选的默认值。

If there already exists a table with the same name, the SQL implmentation will usually throw an error, so to suppress the error and skip creating a table if one exists, you can use the IF NOT EXISTS clause.
如果数据库中已经存在相同的表格名,SQL 通常会报错,所以一般为了避免报错和表格已经存在情况下创建同名表格可以使用 IF NOT EXISTS 语句。

Table data types

表格的数据类型

Different databases support different data types, but the common types support numeric, string, and other miscellaneous things like dates, booleans, or even binary data. Here are some examples that you might use in real code.
不同的数据库支持不同的数据类型,但是常见的有数字,字符串和其他类型,比如日期,布尔值甚至是二进制数据。下面是一些可能会用到的常见的例子:

Data type Description
INTEGER, BOOLEAN The integer datatypes can store whole integer values like the count of a number or an age. In some implementations, the boolean value is just represented as an integer value of just 0 or 1.
FLOAT, DOUBLE, REAL The floating point datatypes can store more precise numerical data like measurements or fractional values. Different types can be used depending on the floating point precision required for that value.
CHARACTER(num_chars), VARCHAR(num_chars), TEXT The text based datatypes can store strings and text in all sorts of locales. The distinction between the various types generally amount to underlaying efficiency of the database when working with these columns. Both the CHARACTER and VARCHAR (variable character) types are specified with the max number of characters that they can store (longer values may be truncated), so can be more efficient to store and query with big tables.
DATE, DATETIME SQL can also store date and time stamps to keep track of time series and event data. They can be tricky to work with especially when manipulating data across timezones.
BLOB Finally, SQL can store binary data in blobs right in the database. These values are often opaque to the database, so you usually have to store them with the right metadata to requery them.

Table constraints

表格限制条件

We aren't going to dive too deep into table constraints in this lesson, but each column can have additional table constraints on it which limit what values can be inserted into that column. This is not a comprehensive list, but will show a few common constraints that you might find useful.
我们不打算在这节课深入讲这个,但是要知道,每一列都可以通过限制条件来限制哪些值可以填进这一列。下面列出的仅仅是部分很有用的:

Constraint Description
PRIMARY KEY This means that the values in this column are unique, and each value can be used to identify a single row in this table.
AUTOINCREMENT For integer values, this means that the value is automatically filled in and incremented with each row insertion. Not supported in all databases.
UNIQUE This means that the values in this column have to be unique, so you can't insert another row with the same value in this column as another row in the table. Differs from the PRIMARY KEY in that it doesn't have to be a key for a row in the table.
NOT NULL This means that the inserted value can not be NULL.
**CHECK (expression) ** This is allows you to run a more complex expression to test whether the values inserted are value. For example, you can check that values are positive, or greater than a specific size, or start with a certain prefix, etc.
FOREIGN KEY This is a consistency check which ensures that each value in this column corresponds to another value in a column in another table.
For example, if there are two tables, one listing all Employees by ID, and another listing their payroll information, the FOREIGN KEY can ensure that every row in the payroll table corresponds to a valid employee in the master Employee list.

An example

Here's an example schema for the Movies table that we've been using in the lessons up to now.
下面是我们上课一直用的 Movies 这个表的 Schema:

-- Movies table schema
CREATE TABLE movies (
    id INTEGER PRIMARY KEY,
    title TEXT,
    director TEXT,
    year INTEGER, 
    length_minutes INTEGER
);

练习

In this exercise, you'll need to create a new table for us to insert some new rows into.
这次练习需要自己建表了:

  1. Create a new table named Database with the following columns:
  • Name A string (text) describing the name of the database
  • Version A number (floating point) of the latest version of this database
  • Download_count An integer count of the number of times this database was downloaded
    This table has no constraints.
CREATE TABLE Database (
	Name TEXT,
	Version FLOAT,
	Download_count INTEGER
);

睡个午觉,起来继续。zzzzzzzzz....


下一课,继续吧。

SQL Lesson 17: Altering tables

第 17 课,改表格。

As your data changes over time, SQL provides a way for you to update your corresponding tables and database schemas by using the ALTER TABLE statement to add, remove, or modify columns and table constraints.
随着时间我们的数据会变化,SQL 提供了 ALTER TABLE 语句用来通过增删和改动数据列或者表格属性来更新相应的表格和数据库 Schema。

Adding columns 添加列

The syntax for adding a new column is similar to the syntax when creating new rows in the CREATE TABLE statement. You need to specify the data type of the column along with any potential table constraints and default values to be applied to both existing and new rows. In some databases like MySQL, you can even specify where to insert the new column using the FIRST or AFTER clauses, though this is not a standard feature.
添加新的列的语法和 CREATE TABLE 类似,需要指定数据类型和可选的对于已有的和新的行的限制条件及默认值。有的数据库比如 MySQL,我们还能通过 FIRSTAFTER 指定新的列添加到哪里,当然这不是数据库的标准特性。语法:

-- Altering table to add new column(s)
ALTER TABLE mytable
ADD column DataType OptionalTableConstraint 
    DEFAULT default_value;

Removing columns 删除列

Dropping columns is as easy as specifying the column to drop, however, many databases (including Postgres, and SQLite) don't support this feature. Instead you may have to create a new table and migrate the data over.
删除列只需要简单的指定删除哪些列就行了。但是,很多数据库(包括 Postgres 和 SQLite)都不支持这一特性,我们只能创建新表格然后迁移数据。(Postgres 竟然不支持的么....这么实用的特征,好麻烦)。语法:

-- Altering table to remove column(s)
ALTER TABLE mytable
DROP column_to_be_deleted;

Renaming the table 重命名表格

If you need to rename the table itself, you can also do that using the RENAME TO clause of the statement.
想重命名表格只需要 RENAME TO 就行了。语法:

-- Altering table name
ALTER TABLE mytable
RENAME TO new_table_name;

练习时间到:

Our exercises use an implementation that only support adding new columns, so give that a try below.
练习题只支持添加新的列,试试吧。

还是那个表格:

5.movie

  1. Add a column named Aspect_ratio with a FLOAT data type to store the aspect-ratio each movie was released in.
    添加 FLOAT 类型的列 Aspect_ratio
ALTER TABLE Movies
ADD column Aspect_ratio FLOAT;
  1. Add another column named Language with a TEXT data type to store the language that the movie was released in. Ensure that the default for this language is English.
    添加新列 Language,类型为 TEXT,默认值为 English
ALTER TABLE Movies
ADD column Language TEXT
	DEFAULT English;

SQL Lesson 18: Dropping tables

第 18 课,删除表格。

最后一节课了,从入门到删库,终于到了删库跑路了哈哈哈哈。

In some rare cases, you may want to remove an entire table including all of its data and metadata, and to do so, you can use the DROP TABLE statement, which differs from the DELETE statement in that it also removes the table schema from the database entirely.
有时候我们想要删除整个表格及其元数据(然后离职跑路?),这时候就要用 DROP TABLE 了。它和 DELETE 的区别在于表格 Schema 也会同时删掉。语法:

-- Drop table statement
DROP TABLE IF EXISTS mytable;

Like the CREATE TABLE statement, the database may throw an error if the specified table does not exist, and to suppress that error, you can use the IF EXISTS clause.
CREATE TABLE 类似,这时候如果表格不存在数据库会报错,解决的办法还是顺手加个 IF EXISTS

In addition, if you have another table that is dependent on columns in table you are removing (for example, with a FOREIGN KEY dependency) then you will have to either update all dependent tables first to remove the dependent rows or to remove those tables entirely.
另外,如果其他的表格依赖与你想删除的表格(比如 FOREIGN KEY 依赖),那么你得把所有依赖关系事先去掉,要么连同那些表格也一起被删除。


练习

We've reached the end of our exercises, so lets clean up by removing all the tables we've worked with.
练习题接近尾声了,直接把表全删了吧。
啊,这就结束了么,有点小伤感呢。

还是那两张表格:

6.twoTabs

  1. We've sadly reached the end of our lessons, lets clean up by removing the Movies table
    好桑心,课程要结束了,把 Movies 删了吧
DROP TABLE IF EXISTS Movies;
  1. And drop the BoxOffice table as well
    Boxofice 也删掉吧
DROP TABLE IF EXISTS Boxoffice;

难忘,今宵,难忘今宵,无论....


等等,还没完——


SQL Lesson X: To infinity and beyond!

7.sqlbolt_complete

You've finished the tutorial!

We hope the lessons have given you a bit more experience with SQL and a bit more confidence to use SQL with your own data.

We've just brushed the surface of what SQL is capable of, so to get a better idea of how SQL can be used in the real world, we'll be adding more articles in the More Topics part of the site. If you have the time, we recommend that you continue to dive deeper into SQL!

If you need further details, it's also recommended that you read the documentation for the specific database that you are using, especially since each database has its own set of features and optimizations.

If you have any suggestions on how to make the site better, you can get in touch using one of the links in the footer below.

And if you found the lessons useful, please consider donating ($4) via Paypal to support our site. Your contribution will help keep the servers running and allow us to improve and add even more material in the future.

后面竟然还有番外 More Topics!

这篇先到这里吧。

SQLBolt 课程学习笔记二(6-8 课)

cover

昨天看了大火的《工作细胞》,挺有趣的。血小板太可爱了!

然后睡觉前看了一下 Todo list:

1.toc6-

任重而道远啊,后面的课程还多着呢。继续继续——

SQL Lesson 6: Multi-table queries with JOINs

第六课,JOINs 多表格查询

Up to now, we've been working with a single table, but entity data in the real world is often broken down into pieces and stored across multiple orthogonal tables using a process known as normalization.

In order to answer questions about an entity that has data spanning multiple tables in a normalized database, we need to learn how to write a query that can combine all that data and pull out exactly the information we need.
前面我们都是在一个表格里操作,但是真实世界的数据往往乱七八糟地组合在多个相关的表格中。这时候我们要查询就必须想办法把需要的信息从不同表格中的数据提取出来并组合到一起。

Multi-table queries with JOINs

Tables that share information about a single entity need to have a primary key that identifies that entity uniquely across the database. One common primary key type is an auto-incrementing integer (because they are space efficient), but it can also be a string, hashed value, so long as it is unique.

Using the JOIN clause in a query, we can combine row data across two separate tables using this unique key. The first of the joins that we will introduce is the INNER JOIN.
不同表格含有关于同一观测对象的信息需要通过唯一的主键相关联。最常见的主键类型就是递增的整数,这个做法省空间。但是主键也有可能是字符串,哈希值,只要是个唯一性的东西就行。
JOIN 从句可以通过唯一的主键把不同的表格整合到一起。我们首先要学习的是 INNER JOIN。语法:

SELECT column, another_table_column, …
FROM mytable
INNER JOIN another_table 
    ON mytable.id = another_table.id
WHERE condition(s)
ORDER BY column, … ASC/DESC
LIMIT num_limit OFFSET num_offset;

The INNER JOIN is a process that matches rows from the first table and the second table which have the same key (as defined by the ON constraint) to create a result row with the combined columns from both tables. After the tables are joined, the other clauses we learned previously are then applied.
INNER JOIN 通过在 一张表格和 ON 定义的第二张表格之间匹配相同的键值来合并两个表格得到结果。我们之前学习的那些从句都是在表格合并之后执行的。

You might see queries where the INNER JOIN is written simply as a JOIN. These two are equivalent, but we will continue to refer to these joins as inner-joins because they make the query easier to read once you start using other types of joins, which will be introduced in the following lesson.
INNER JOIN 可以简写成 JOIN,但是为了代码的可读性,大家还是该怎样怎样吧,多打一个单词而已。

We've added a new table to the Pixar database so that you can try practicing some joins. The BoxOffice table stores information about the ratings and sales of each particular Pixar movie, and the Movie_id column in that table corresponds with the Id column in the Movies table 1-to-1. Try and solve the tasks below using the INNER JOIN introduced above.
这次练习题有两张表。BoxOffice 存储每部电影的评分和票房情况,通过 Movie_id 和另一张表格 Movies 里的 Id 一一对应。两张表格大概长这样:

2.lesson6


练习题:

  1. Find the domestic and international sales for each movie
    找出每部电影的国内外票房情况,
SELECT Title, Domestic_sales, International_sales FROM movies m INNER JOIN Boxoffice b 
	ON m.Id=b.Movie_id;

偷懒用了缩写 : )

  1. Show the sales numbers for each movie that did better internationally rather than domestically
    国际票房好过国内的,就是加一个限制条件,WHERE 一下
SELECT Title, Domestic_sales, International_sales FROM movies m INNER JOIN Boxoffice b 
	ON m.Id=b.Movie_id
    WHERE b.International_sales > b.Domestic_sales;
  1. List all the movies by their ratings in descending order
    所有电影的评分降序排列
SELECT Title, Rating FROM movies m INNER JOIN Boxoffice b 
	ON m.Id=b.Movie_id
    ORDER BY Rating DESC;

INNER JOIN 还算简单的吧,下一个

SQL Lesson 7: OUTER JOINs

Depending on how you want to analyze the data, the INNER JOIN we used last lesson might not be sufficient because the resulting table only contains data that belongs in both of the tables.
根据查询任务不同,我们会经常发现 INNER JOIN 不够用的情况,因为它只能取两个表之间共有的行。

If the two tables have asymmetric data, which can easily happen when data is entered in different stages, then we would have to use a LEFT JOIN, RIGHT JOIN or FULL JOIN instead to ensure that the data you need is not left out of the results.
在现实实际乱七八糟的数据,各表格之间数据往往都是不对称的,这时候就得用上 LEFT JOINRIGHT JOINFULL JOIN 这些了。语法和前面的很类似:

SELECT column, another_column, …
FROM mytable
INNER/LEFT/RIGHT/FULL JOIN another_table 
    ON mytable.id = another_table.matching_id
WHERE condition(s)
ORDER BY column, … ASC/DESC
LIMIT num_limit OFFSET num_offset;

Like the INNER JOIN these three new joins have to specify which column to join the data on.
When joining table A to table B, a LEFT JOIN simply includes rows from A regardless of whether a matching row is found in B. The RIGHT JOIN is the same, but reversed, keeping rows in B regardless of whether a match is found in A. Finally, a FULL JOIN simply means that rows from both tables are kept, regardless of whether a matching row exists in the other table.
INNER JOIN 一样,这些不同的 JOIN 都需要指定我们通过哪一列来组合不同的表格数据。
举个例子,我们想把表格 A、B 组合起来, LEFT JOIN 结果会包括所有 A 里有的行,不论 B 里是否存在。RIGHT JOIN 类似,是反过来的,保留 B 里所有的行,不管 A 里有没有。FULL JOIN 就是包括两个表里所有的行,不管每一行是否存在匹配行。

When using any of these new joins, you will likely have to write additional logic to deal with NULLs in the result and constraints (more on this in the next lesson).
用到这些 JOIN 的时候,我们很有可能需要增加逻辑判断来处理结果中的 NULLs 值或者限制条件。下一节课会讲到。

You might see queries written these joins written as LEFT OUTER JOIN, RIGHT OUTER JOIN, or FULL OUTER JOIN, but the OUTER keyword is really kept for SQL-92 compatibility and these queries are simply equivalent to LEFT JOIN, RIGHT JOIN, and FULL JOIN respectively.
LEFT OUTER JOINRIGHT OUTER JOINFULL OUTER JOIN 这些 JOIN 里面的 OUTER 都是为了保留兼容性的。

In this exercise, you are going to be working with a new table which stores fictional data about Employees in the film studio and their assigned office Buildings. Some of the buildings are new, so they don't have any employees in them yet, but we need to find some information about them regardless.
Since our browser SQL database is somewhat limited, only the LEFT JOIN is supported in the exercise below.
这次的练习题是两张新的表格 EmployeesBuildings

3.empl.buildings

前者存储一个电影工作室的雇员信息,后者是雇员的工作地点信息。有的地点是新建的因此没有雇员入住。练习题只支持LEFT JOIN


练习题:

  1. Find the list of all buildings that have employees
    找到所有有雇员入住的楼栋,出现在 Employees 这个表格的楼栋那就肯定是入住的,所以 LEFT JOIN 前面是 Employees 就好了。然后用 DISTINCT 去重(想了半天,差点把这个忘了)。
SELECT DISTINCT Building_name FROM Employees e LEFT JOIN Buildings b ON
    e.Building=b.Building_name;
  1. Find the list of all buildings and their capacity
    列出所有的楼栋及其容量,这不就是 Buildings 这个表格么?认真的么?
    SELECT * FROM Buildings; 哈士奇狗头??

  2. List all buildings and the distinct employee roles in each building (including empty buildings)
    列出所有楼栋里唯一的雇员身份信息,空的楼栋也要,那肯定 LEFT JOIN 前面是 Buildings 就好了。

SELECT DISTINCT Building_name, Role FROM Buildings b LEFT JOIN Employees e ON b.Building_name=e.Building;

SQL Lesson 8: A short note on NULLs

NULL 值简单介绍

As promised in the last lesson, we are going to quickly talk about NULL values in an SQL database. It's always good to reduce the possibility of NULL values in databases because they require special attention when constructing queries, constraints (certain functions behave differently with null values) and when processing the results.
最好的做法当然是在查询和处理结果时避免数据中出现 NULL,因为它们往往需要我们特别注意。

An alternative to NULL values in your database is to have data-type appropriate default values, like 0 for numerical data, empty strings for text data, etc. But if your database needs to store incomplete data, then NULL values can be appropriate if the default values will skew later analysis (for example, when taking averages of numerical data).
对于需要出现 NULL 的地方,一种办法是使用合适类型的默认值,比如数据就用 0,文本就用空字符串等。但是如果本身就是要存储非完整的数据的话,有时候默认值会干扰后续分析而 NULL 反而更合适(比如数据取均值)。(这一段不是很理解.....)

Sometimes, it's also not possible to avoid NULL values, as we saw in the last lesson when outer-joining two tables with asymmetric data. In these cases, you can test a column for NULL values in a WHERE clause by using either the IS NULL or IS NOT NULL constraint.
有时候,NULL 是无法避免的,比如我们上节课看到的把两个不对称的数据表外连接起来的时候。这时候,对于某一列数据中的 NULL 值我们可以用 WHERE 从句配合 IS NULLIS NOT NULL 来做判断。语法:

SELECT column, another_column, …
FROM mytable
WHERE column IS/IS NOT NULL
AND/OR another_condition
AND/OR …;

This exercise will be a sort of review of the last few lessons. We're using the same Employees and Buildings table from the last lesson, but we've hired a few more people, who haven't yet been assigned a building.
这次的练习题是对前面的课程的复习。用到的数据还是 EmployeesBuildings 这两张表:

4.empl.buildings.2

但是可以看到招聘的雇员多了一些,而且一些人还没有分配到入住楼栋。


练习题:

  1. Find the name and role of all employees who have not been assigned to a building

找出没有分配到住所的员工及其岗位信息(要分房了??),这个不是就用 Employees 一张表找到没有住房信息的人就行了么: SELECT Role, Name, Building FROM Employees WHERE Building IS NULL;

  1. Find the names of the buildings that hold no employees
    找到没有雇员入住的楼栋,这就新房啊,这个问题我想了一会儿,最后发现其实 LEFT JOIN 然后找没雇员的楼栋就行:
SELECT Building_name, Name FROM Buildings b LEFT JOIN Employees e ON
    b.Building_name = e.Building
    WHERE Name IS NULL;

发现第 8 课开始是新的内容,那到这里第二篇笔记结束吧。

sed 和 awk 学习

2017-05-20

sed 、awk 和 grep 并称 Linux 系统下文本处理三剑客,三者都是非交互式的文本编辑器。

sed 的基本处理单位为记录 (record) ,即文件的行;awk的基本处理单位为域 (field),即文件的逻辑列。

以下内容大部分整理自小明明s Github,有改动。

sed

语法1

格式: sed [options] {sed-commands} {input-file}

系统里/etc/passwd文件的内容:

root:x:0:0:root:/root:/bin/zsh
daemon:x:1:1:daemon:/usr/sbin:/usr/sbin/nologin
bin:x:2:2:bin:/bin:/usr/sbin/nologin
sys:x:3:3:sys:/dev:/usr/sbin/nologin
sync:x:4:65534:sync:/bin:/bin/sync
games:x:5:60:games:/usr/games:/usr/sbin/nologin
man:x:6:12:man:/var/cache/man:/usr/sbin/nologin
lp:x:7:7:lp:/var/spool/lpd:/usr/sbin/nologin
mail:x:8:8:mail:/var/mail:/usr/sbin/nologin
news:x:9:9:news:/var/spool/news:/usr/sbin/nologin
uucp:x:10:10:uucp:/var/spool/uucp:/usr/sbin/nologin
proxy:x:13:13:proxy:/bin:/usr/sbin/nologin
www-data:x:33:33:www-data:/var/www:/usr/sbin/nologin
backup:x:34:34:backup:/var/backups:/usr/sbin/nologin
list:x:38:38:Mailing List Manager:/var/list:/usr/sbin/nologin
irc:x:39:39:ircd:/var/run/ircd:/usr/sbin/nologin
gnats:x:41:41:Gnats Bug-Reporting System (admin):/var/lib/gnats:/usr/sbin/nologin
nobody:x:65534:65534:nobody:/nonexistent:/usr/sbin/nologin
systemd-timesync:x:100:102:systemd Time Synchronization,,,:/run/systemd:/bin/false
systemd-network:x:101:103:systemd Network Management,,,:/run/systemd/netif:/bin/false
systemd-resolve:x:102:104:systemd Resolver,,,:/run/systemd/resolve:/bin/false
systemd-bus-proxy:x:103:105:systemd Bus Proxy,,,:/run/systemd:/bin/false
_apt:x:104:65534::/nonexistent:/bin/false
rtkit:x:105:109:RealtimeKit,,,:/proc:/bin/false
dnsmasq:x:106:65534:dnsmasq,,,:/var/lib/misc:/bin/false
avahi-autoipd:x:107:110:Avahi autoip daemon,,,:/var/lib/avahi-autoipd:/bin/false
messagebus:x:108:111::/var/run/dbus:/bin/false
usbmux:x:109:46:usbmux daemon,,,:/var/lib/usbmux:/bin/false
lightdm:x:111:115:Light Display Manager:/var/lib/lightdm:/bin/false
......(省略)......

例子:

# -n表示取消默认输出(默认输出将会打印出整个文件),p表示打印行
➜ sed -n 'p' /etc/passwd
root:x:0:0:root:/root:/bin/zsh
daemon:x:1:1:daemon:/usr/sbin:/usr/sbin/nologin
bin:x:2:2:bin:/bin:/usr/sbin/nologin
sys:x:3:3:sys:/dev:/usr/sbin/nologin
sync:x:4:65534:sync:/bin:/bin/sync
games:x:5:60:games:/usr/games:/usr/sbin/nologin
man:x:6:12:man:/var/cache/man:/usr/sbin/nologin
lp:x:7:7:lp:/var/spool/lpd:/usr/sbin/nologin
mail:x:8:8:mail:/var/mail:/usr/sbin/nologin
news:x:9:9:news:/var/spool/news:/usr/sbin/nologin
......(省略)......
# 只打印第三行
➜ sed -n '3p' /etc/passwd
bin:x:2:2:bin:/bin:/usr/sbin/nologin
# 打印1,3行
➜ sed -n '1,3p' /etc/passwd
root:x:0:0:root:/root:/bin/zsh
daemon:x:1:1:daemon:/usr/sbin:/usr/sbin/nologin
bin:x:2:2:bin:/bin:/usr/sbin/nologin

语法2

格式:sed [options] -f {sed-commands-in-a-file} {input-file}

例子:

# 打印以root开头或者nobody开头的行
➜ cat sed_example_1.sed
/^root/ p
/^nobody/ p
➜ sed -n -f sed_example_1.sed /etc/passwd
root:x:0:0:root:/root:/bin/zsh
nobody:x:65534:65534:nobody:/nonexistent:/usr/sbin/nologin

语法3

格式: sed [options] -e {sed-command-1} -e {sed-command-2} {input-file}

例子:

# 打印以root开头或者nobody开头的行
➜ sed -n -e '/^root/ p' -e '/^nobody/ p' /etc/passwd
root:x:0:0:root:/root:/bin/zsh
nobody:x:65534:65534:nobody:/nonexistent:/usr/sbin/nologin
#或者
➜ sed -n \
-e '/^root/ p' \
-e '/^nobody/ p' \
/etc/passwd

语法4

格式:

sed [options] '{
sed-command-1
sed-command-2
}' input-file

例子:

# 打印以root开头或者sync结尾的行
sed -n '{
/^root/ p
/nobody$/ p
}' /etc/passwd
root:x:0:0:root:/root:/bin/zsh
sync:x:4:65534:sync:/bin:/bin/sync

sed 流

利用 sed 流可以实现文件操作:

  1. 执行
  2. 打印
  3. 重复

源文件source.txt内容如下:

101,Ian Bicking,Mozilla
102,Hakim El Hattab,Whim
103,Paul Irish,Google
104,Addy Osmani,Google
105,Chris Wanstrath,Github
106,Mattt Thompson,Heroku
107,Ask Solem Hoel,VMware

范围

➜ sed -n '1~2 p' source.txt
# 从第1行开始,步长为2, 及奇数行
101,Ian Bicking,Mozilla
103,Paul Irish,Google
105,Chris Wanstrath,Github
107,Ask Solem Hoel,VMware                                                                                                                ➜ sed -n '2~3 p' source.txt
# 从第2行开始,步长为3
102,Hakim El Hattab,Whim
105,Chris Wanstrath,Github

模式匹配

# 寻找包含Paul的行
➜ sed -n '/Paul/ p' source.txt
103,Paul Irish,Google
# 在第一行开始到第五行中, 从找到Paul开始打印到第五行
➜ sed -n '/Paul/,5 p' source.txt
103,Paul Irish,Google
104,Addy Osmani,Google
105,Chris Wanstrath,Github

# 从匹配Paul行打印达匹配Addy的行
➜ sed -n '/Paul/,/Addy/ p' source.txt
103,Paul Irish,Google
104,Addy Osmani,Google
# 匹配Paul行再多输出2行
➜ sed -n '/Paul/,+2 p' source.txt
103,Paul Irish,Google
104,Addy Osmani,Google
105,Chris Wanstrath,Github

删除行

# 删除所有行
➜ sed 'd' source.txt
(无输出)
# 只删除第二行
➜ sed '2 d' source.txt 
...
# 删除第一到第四行
➜ sed '1,4 d' source.txt
105,Chris Wanstrath,Github
106,Mattt Thompson,Heroku
107,Ask Solem Hoel,VMware
# 删除奇数行
➜ sed '1~2 d' source.txt
102,Hakim El Hattab,Whim
104,Addy Osmani,Google
106,Mattt Thompson,Heroku
# 删除符合Paul到Addy的行
➜ sed '/Paul/,/Addy/d' source.txt
101,Ian Bicking,Mozilla
102,Hakim El Hattab,Whim
105,Chris Wanstrath,Github
106,Mattt Thompson,Heroku
107,Ask Solem Hoel,VMware
# 删除空行
➜ sed '/^$/ d' source.txt
# 删除用#注释的行
➜ sed '/^#/ d' source.txt

重定向

# 将source.txt内容重定向写到output.txt
➜ sed 'w output.txt' source.txt
# 和上面一样,但是没有在终端显示
➜ sed -n 'w output.txt' source.txt
# 只写第二行
➜ sed -n '2 w output.txt' source.txt
# 写一到四行到output.txt
➜ sed -n '1,4 w output.txt'
# 写匹配Ask的行到结尾行到output.txt
➜ sed -n '/Ask/,$ w output.txt'

替换

格式为:

sed '[address-range|pattern-range] s/original-string/replacement-string/[substitute-flags]' inputfile

例子:

# 替换Google为Github
➜ sed 's/Google/Github/' source.txt
101,Ian Bicking,Mozilla
102,Hakim El Hattab,Whim
103,Paul Irish,Github
104,Addy Osmani,Github
105,Chris Wanstrath,Github
106,Mattt Thompson,Heroku
107,Ask Solem Hoel,VMware
# 替换匹配Addy的行里面的Google为Github
➜ sed '/Addy/s/Google/Github/' source.txt
101,Ian Bicking,Mozilla
102,Hakim El Hattab,Whim
103,Paul Irish,Google
104,Addy Osmani,Github
105,Chris Wanstrath,Github
106,Mattt Thompson,Heroku
107,Ask Solem Hoel,VMware
# 默认s只会替换一行中的第一个匹配项
➜ sed '1s/a/A/'  source.txt|head -1
101,IAn Bicking,Mozilla
# g可以替换每行的全部符合
➜ sed '1s/a/A/g'  source.txt|head -1
101,IAn Bicking,MozillA
# 可以直接指定想要替换的第N个匹配项,这里是第二个
➜ sed '1s/a/A/2'  source.txt|head -1
101,Ian Bicking,MozillA

# 使用w将能够替换的行重定向写到output.txt
➜ sed -n 's/Mozilla/Github/w output.txt' source.txt
➜ cat output.txt 
101,Ian Bicking,Github

# 还可以使用i忽略匹配的大小写,看来freebsd的不能用
➜ sed '1s/iaN/IAN/i'  source.txt|head -1
101,IAN Bicking,Mozilla

➜ cat files.txt 
/etc/passwd
/etc/group
# 给每行前和后都添加点字符
➜ sed 's/\(.*\)/ls -l \1|head -1/' files.txt
ls -l /etc/passwd|head -1
ls -l /etc/group|head -1
# 用sed执行这个字符串命令
➜ sed 's/^/ls -l /e' files.txt
-rw-r--r-- 1 root root 1627 Oct 14 14:30 /etc/passwd
-rw-r--r-- 1 root root 807 Oct 14 14:30 /etc/group

# sed分隔符不只可以使用'/'
$sed 's|/usr/local/bin|/usr/bin|' path.txt
$sed 's^/usr/local/bin^/usr/bin^' path.txt
$sed 's@/usr/local/bin@/usr/bin@' path.txt
$sed 's!/usr/local/bin!/usr/bin!' path.txt

替换覆盖

➜ sed '{
s/Google/Github/
s/Git/git/ 
}' source.txt
101,Ian Bicking,Mozilla
102,Hakim El Hattab,Whim
103,Paul Irish,github
104,Addy Osmani,github
105,Chris Wanstrath,github
106,Mattt Thompson,Heroku
107,Ask Solem Hoel,VMware

& 代表匹配到的内容

➜ sed 's/^[0-9][0-9][0-9]/[&]/g' source.txt
[101],Ian Bicking,Mozilla
[102],Hakim El Hattab,Whim
[103],Paul Irish,Google
[104],Addy Osmani,Google
[105],Chris Wanstrath,Github
[106],Mattt Thompson,Heroku
[107],Ask Solem Hoel,VMware
➜ sed 's/^.*/[&]/' source.txt
[101,Ian Bicking,Mozilla]
[102,Hakim El Hattab,Whim]
[103,Paul Irish,Google]
[104,Addy Osmani,Google]
[105,Chris Wanstrath,Github]
[106,Mattt Thompson,Heroku]
[107,Ask Solem Hoel,VMware]
➜ sed 's/^.*/<<&>>/g' source.txt
<<101,Ian Bicking,Mozilla>>
<<102,Hakim El Hattab,Whim>>
<<103,Paul Irish,Google>>
<<104,Addy Osmani,Google>>
<<105,Chris Wanstrath,Github>>
<<106,Mattt Thompson,Heroku>>
<<107,Ask Solem Hoel,VMware>>

正则

# ^表示匹配以什么开头
➜ sed -n '/^101/ p' source.txt      
101,Ian Bicking,Mozilla
# $表示匹配以什么结尾
➜ sed -n '/Github$/ p' source.txt 
105,Chris Wanstrath,Github
# .表示单个字符,下面的匹配一个逗号然后I然后2个单字符
➜ sed -n '/,I../ p' source.txt
101,Ian Bicking,Mozilla
# *表示匹配0个或者多个, \+表示匹配一个或者多个, \?表示匹配0个或者1个
# [0-9]表示匹配数字,下面匹配包含3或者4的行
➜ sed -n '/[34]/ p ' source.txt      
103,Paul Irish,Google
104,Addy Osmani,Google
# -表示范围,这里匹配3,4,5
➜ sed -n '/[3-5]/ p ' source.txt
103,Paul Irish,Google
104,Addy Osmani,Google
105,Chris Wanstrath,Github
# |表示或者的关系
➜ sed -n '/102\|103/ p ' source.txt
102,Hakim El Hattab,Whim
103,Paul Irish,Google

➜ cat numbers.txt 
1
12
123
1234
12345
123456
# {m} 表示前面的匹配的重复次数
➜ sed -n '/^[0-9]\{5\}$/ p' numbers.txt
12345
# {m,n} 表示匹配m-n的次数都算
sed -n '/^[0-9]\{3,5\}$/ p' numbers.txt
123
1234
12345
# 删除所有注释行和空行
➜ sed -e 's/#.*//' -e '/^$/ d' /etc/profile							
# 转化windows文件到unix格式
➜ sed 's/.$//' filename								
							
#\1表示第一个正则匹配到的数据
➜ sed 's/\([^,]*\).*/\1/g' source.txt |head -1
101
#给每个单词第一个字母加括号echo "Dong Wei Ming" | sed 's/\(\b[A-Z]\)/\(\1\)/g'
(D)ong (W)ei (M)ing
➜ sed 's/\(^\|[^0-9.]\)\([0-9]\+\)\([0-9]\{3\}\)/\1\2,\3/g' numbers.txt
1
12
123
1,234
12,345
123,456
# 只取第一和第三列,并且换了他们的位置
$sed 's/\([^,]*\),\([^,]*\),\([^,]*\).*/\3,\1/g' source.txt
Mozilla,101
Whim,102
Google,103
Google,104
Github,105
Heroku,106
VMware,107

其他

# \l能将后面的一个字符变成小写
➜ sed 's/Ian/IAN/' source.txt|head -1               
101,IAN Bicking,Mozilla
➜ sed 's/Ian/IA\lN/' source.txt|head -1 
101,IAn Bicking,Mozilla
# \L能将后面的字符都变成小写
➜ sed 's/Ian/I\LAN/' source.txt|head -1
101,Ian Bicking,Mozilla
# \u能将后面的一个字符变成大写
➜ sed 's/Ian/IA\un/' source.txt|head -1
101,IAN Bicking,Mozilla
# \U能将后面的字都变成大写
➜ sed 's/Ian/\Uian/' source.txt|head -1 
101,IAN Bicking,Mozilla
# \E能打断\L或者\U改变大小写
➜ sed 's/Ian/\Uia\En/' source.txt|head -1
101,IAn Bicking,Mozilla
# 使用以上功能:调换前2列,把名字列全部大写,公司列全部小写
➜ sed 's/\([^,]*\),\([^,]*\),\(.*\).*/\U\2\E,\1,\L\3/g' source.txt
IAN BICKING,101,mozilla
HAKIM EL HATTAB,102,whim
PAUL IRISH,103,google
ADDY OSMANI,104,google
CHRIS WANSTRATH,105,github
MATTT THOMPSON,106,heroku
ASK SOLEM HOEL,107,vmware

sed 可执行脚本

➜ cat testscript.sed
#!/bin/sed -nf
/root/ p
/nobody/ p
➜ chmod u+x testscript.sed
➜ ./testscript.sed /etc/passwd 
root:x:0:0:root:/root:/bin/zsh
nobody:x:65534:65534:nobody:/nonexistent:/usr/sbin/nologin

sed 修改源文件和备份

#-i会修改源文件,但是可以同时使用bak备份
➜ sed -i.bak 's/Ian/IAN/' source.txt 
# or
➜ sed --in-place=.bak 's/Ian/IAN/' source.txt 
# 这样备份一个修改前的文件为source.txt.bak

行后增加和行前插入

语法格式:

  • 行后增加:sed '[address] a the-line-to-append' input-file
  • 行前插入:sed '[address] i the-line-to-insert' input-file

例子:

➜ sed '2 a 108,Donald Stufft, Nebula' source.txt
101,IAN Bicking,Mozilla
102,Hakim El Hattab,Whim
108,Donald Stufft, Nebula
103,Paul Irish,Google
104,Addy Osmani,Google
105,Chris Wanstrath,Github
106,Mattt Thompson,Heroku
107,Ask Solem Hoel,VMware
➜ sed '2 i 108,Donald Stufft, Nebula' source.txt
101,IAN Bicking,Mozilla
108,Donald Stufft, Nebula
102,Hakim El Hattab,Whim
103,Paul Irish,Google
104,Addy Osmani,Google
105,Chris Wanstrath,Github
106,Mattt Thompson,Heroku
107,Ask Solem Hoel,VMware

修改行

格式:sed '[address] c the-line-to-insert' input-file

例子:

# 修改含有Paul的行
➜ sed '/Paul/ c 108,Donald Stufft, Nebula' source.txt
101,IAN Bicking,Mozilla
102,Hakim El Hattab,Whim
108,Donald Stufft, Nebula
104,Addy Osmani,Google
105,Chris Wanstrath,Github
106,Mattt Thompson,Heroku
107,Ask Solem Hoel,VMware

其他用法:

# = 可以显示行号
➜ sed = source.txt
1
101,Ian Bicking,Mozilla
2
102,Hakim El Hattab,Whim
3
103,Paul Irish,Google
4
104,Addy Osmani,Google
5
105,Chris Wanstrath,Github
6
106,Mattt Thompson,Heroku
7
107,Ask Solem Hoel,VMware
# y或翻译你要转换的字符,这里I会转化成i,B转换成b
➜ sed 'y/IB/ib/' source.txt |head -1
101,iAN bicking,Mozilla

awd

示例文件 items.txt,列分别是id, 描述, 价钱和库存:

101,HD Camcorder,Video,210,10
102,Refrigerator,Appliance,850,2
103,MP3 Player,Audio,270,15
104,Tennis Racket,Sports,190,20
105,Laser Printer,Office,475,5

示例文件items-sold.txt, 列分别是id和1-6月的销售情况

101 2 10 5 8 10 12
102 0 1 4 3 0 2
103 10 6 11 20 5 13
104 2 3 4 0 6 5
105 10 2 5 7 12 6

语法1

awk -Fs '/pattern/ {action}' input-file
#or
awk -Fs '{action}' intput-file
# -F表示设置分隔符,不指定就是默认为空字符, Fs即field seperator

例子:

# 用:分割,查找匹配systemd的行并且打印冒号分割后的第一部分
➜ awk -F: '/systemd/ {print $1}' /etc/passwd
systemd-timesync
systemd-network
systemd-resolve
systemd-bus-proxy

awk数据结构

# 1 BEGIN { awk-commands } 在执行awk body之前执行这个awk-commands,而且只一次
# 2 /pattern/ {action} body部分,也就是awk要执行的主体,比如十行,那么这个主体就调用10次
# 3 END { awk-commands } 在执行完body之后执行,也是只一次
➜ awk -F: 'BEGIN {print "----header----"} /systemd/ {print $1} \
END {print "----footer----"}' /etc/passwd
----header----
systemd-timesync
systemd-network
systemd-resolve
systemd-bus-proxy
----footer----
# 当然可以只有其中一种或者集中数据结构
➜ awk -F: 'BEGIN {print "UID"} {print $3}' /etc/passwd | head -3
UID
0
1
➜ awk 'BEGIN {print "Hello World!"}'
Hello World!

print

# 默认print就是打印文件全文到终端
➜ awk '{print}' source.txt 
101,Ian Bicking,Mozilla
102,Hakim El Hattab,Whim
103,Paul Irish,Google
104,Addy Osmani,Google
105,Chris Wanstrath,Github
106,Mattt Thompson,Heroku
107,Ask Solem Hoel,VMware
# 下面是通过,分割, 输出第二段。$0表示全行,类似shell用法
$awk -F ',' '{print $2}' source.txt
Ian Bicking
Hakim El Hattab
Paul Irish
Addy Osmani
Chris Wanstrath
Mattt Thompson
Ask Solem Hoel
# or
$awk -F "," '{print $2}' source.txt
$awk -F, '{print $2}' source.txt
# 一个格式化更好看些的效果
➜ awk -F "," 'BEGIN {print "--------------\nName\tComp\n--------------"} \
{print $2,"\t",$3}\
END {print "--------------"}' source.txt
--------------
Name	Comp
--------------
Ian Bicking 	 Mozilla
Hakim El Hattab 	 Whim
Paul Irish 	 Google
Addy Osmani 	 Google
Chris Wanstrath 	 Github
Mattt Thompson 	 Heroku
Ask Solem Hoel 	 VMware
--------------

模式匹配

# 用逗号做分隔符, 打印第二和第三列
➜ awk -F ',' '/Whim/ {print $2, $3}' source.txt
Hakim El Hattab Whim
# 可以加点格式化语句
➜ awk -F, '/Whim/ {print "Whim\"s full name:",$2}' source.txt
Whim"s full name: Hakim El Hattab

awk内置变量

1. FS:输入字段分隔符

设置分隔符:输入字段分隔符FS

➜ awk -F, '{print $2,$3}' source.txt 
Ian Bicking Mozilla
Hakim El Hattab Whim
Paul Irish Google
Addy Osmani Google
Chris Wanstrath Github
Mattt Thompson Heroku
Ask Solem Hoel VMware
# 可以使用内置的FS - 输入字段分隔符 实现相同的功能
➜awk 'BEGIN {FS=","} {print $2,$3}' source.txt            
Ian Bicking Mozilla
Hakim El Hattab Whim
Paul Irish Google
Addy Osmani Google
Chris Wanstrath Github
Mattt Thompson Heroku
Ask Solem Hoel VMware
➜ cat source-multiple-fs.txt
101,Ian Bicking:Mozilla%
102,Hakim El Hattab:Whim%
103,Paul Irish:Google%
104,Addy Osmani:Google%
105,Chris Wanstrath:Github%
106,Mattt Thompson:Heroku%
107,Ask Solem Hoel:VMware%
# 发现上面的分隔符有三种:逗号分号和百分号,这样就可以这样使用:
➜ awk 'BEGIN {FS="[,:%]"} {print $2,$3}' source-multiple-fs.txt
Ian Bicking Mozilla
Hakim El Hattab Whim
Paul Irish Google
Addy Osmani Google
Chris Wanstrath Github
Mattt Thompson Heroku
Ask Solem Hoel VMware

2. OFS:输出字段分隔符

设置输出时的分隔符:输出字段分隔符OFS

➜ awk -F, '{print $2":"$3}' source.txt
Ian Bicking:Mozilla
Hakim El Hattab:Whim
Paul Irish:Google
Addy Osmani:Google
Chris Wanstrath:Github
Mattt Thompson:Heroku
Ask Solem Hoel:VMware
➜ awk -F, 'BEGIN {OFS=":"} {print $2":"$3}' source.txt
Ian Bicking:Mozilla
Hakim El Hattab:Whim
Paul Irish:Google
Addy Osmani:Google
Chris Wanstrath:Github
Mattt Thompson:Heroku
Ask Solem Hoel:VMware

3. RS:输入记录分隔符

现在有一个文件 source-one-line.txt 内容为:

1,one:2,two:3,three:4,four

想输出

one
two

这样的效果。

借用记录分隔符 RS,先把单行内容分割,然后再按 -F 分割输出:

awk -F, 'BEGIN {RS=":"} {print $2}' source-one-line.txt
one
two
three
four

4. ORS:输出记录分隔符

完成一个输出记录后由 ORS 进行分隔(一个输出记录record一般就是一行,在 awk 中打印输出时,{print A, B, C} 相当于 3 个field组成一个record输出)。

直接看例子吧:

➜ awk 'BEGIN {FS=","; OFS="\n"; ORS="\n------\n"} \
{print $1"\t"$2"\t"$3}' source.txt
101	Ian Bicking	Mozilla
------
102	Hakim El Hattab	Whim
------
103	Paul Irish	Google
------
104	Addy Osmani	Google
------
105	Chris Wanstrath	Github
------
106	Mattt Thompson	Heroku
------
107	Ask Solem Hoel	VMware

➜ awk 'BEGIN {FS=","; OFS="\n"; ORS="\n------\n"} \
{print $1,$2,$3}' source.txt | head -12
101
Ian Bicking
Mozilla
------
102
Hakim El Hattab
Whim
------
103
Paul Irish
Google
------

5. NR:记录的数目

➜ awk 'BEGIN {FS=","} {print "Id of record", NR, "is", $1}' source.txt
Id of record 1 is 101
Id of record 2 is 102
Id of record 3 is 103
Id of record 4 is 104
Id of record 5 is 105
Id of record 6 is 106
Id of record 7 is 107
➜ awk 'BEGIN {FS=","} {print "Id of record", NR, "is", $1} END {print "Total number of records is", NR}' source.txt
Id of record 1 is 101
Id of record 2 is 102
Id of record 3 is 103
Id of record 4 is 104
Id of record 5 is 105
Id of record 6 is 106
Id of record 7 is 107
Total number of records is 7

6. FILENAME 和 FNR

FILENAME显示了当前文件, FNR关联到当前文件的记录数。

➜ awk -F, '{print "In file", FILENAME, ": record number", FNR, "is", $1}  END {print "Toltal num of records is", NR}' source.txt source-multiple-fs.txt 
In file source.txt : record number 1 is 101
In file source.txt : record number 2 is 102
In file source.txt : record number 3 is 103
In file source.txt : record number 4 is 104
In file source.txt : record number 5 is 105
In file source.txt : record number 6 is 106
In file source.txt : record number 7 is 107
In file source-multiple-fs.txt : record number 1 is 101
In file source-multiple-fs.txt : record number 2 is 102
In file source-multiple-fs.txt : record number 3 is 103
In file source-multiple-fs.txt : record number 4 is 104
In file source-multiple-fs.txt : record number 5 is 105
In file source-multiple-fs.txt : record number 6 is 106
In file source-multiple-fs.txt : record number 7 is 107
Toltal num of records is 14

awk变量

变量支持数字,字符和下划线

一个文件source-star.txt内容为:

101,Ian Bicking,Mozilla,1204

102,Hakim El Hattab,Whim,4029

103,Paul Irish,Google,7200

104,Addy Osmani,Google,2201

105,Chris Wanstrath,Github,1002

106,Mattt Thompson,Heroku,890

107,Ask Solem Hoel,VMware,2109

这个文件多加了最后一列star数, 现在统计整个文件的star:

➜ awk -F, 'BEGIN {total=0} {print $2, "got",$4, "star"; total=total + $4} END {print "Total star is "total}'  source-star.txt
Ian Bicking got 1204 star
Hakim El Hattab got 4029 star
Paul Irish got 7200 star
Addy Osmani got 2201 star
Chris Wanstrath got 1002 star
Mattt Thompson got 890 star
Ask Solem Hoel got 2109 star
Total star is 18635

自增/减

使用++或者--,但是注意符号位置

➜ awk -F, '{print --$4}' source-star.txt 
1203
4028
7199
2200
1001
889
2108
➜ awk -F, '{print $4--}' source-star.txt
1204
4029
7200
2201
1002
890
2109
➜ awk -F, '{$4--; print $4}' source-star.txt
1203
4028
7199
2200
1001
889
2108

字符串操作

字符串直接 print 会连接起来, 字符串相加会自动转化成数字相加

➜ awk 'BEGIN {
	FS=",";
    OFS=",";
    string1="GO";    
    string2="OGLE";    
    numberstring="100";
    string3=string1 string2;
    print "Concatenate string is:" string3;
    numberstring=numberstring+1;
    print "String to number:" numberstring;
}'
Concatenate string is:GOOGLE
String to number:101

复合运算

加减乘除和余数除计算,

文件assignment.awk内容如下:

BEGIN {
FS=",";
OFS=",";
total1 = total2 = total3 = total4 = total5 = 10;
total1 += 5; print total1;
total2 -= 5; print total2;
total3 *= 5; print total3;
total4 /= 5; print total4;
total5 %= 5; print total5;

}

➜ awk -f assignment.awk 
15
5
50
2
0

比较操作

# 只会显示小于1500的行
➜ awk -F, '$4 < 1500' source-star.txt
101,Ian Bicking,Mozilla,1204
105,Chris Wanstrath,Github,1002
106,Mattt Thompson,Heroku,890
➜ awk -F, '$1 == 103 {print $2}' source-star.txt
Paul Irish
# ||表示或者  && 表示和
➜ awk -F, '$4 >= 1000 && $4 <= 2000' source-star.txt 
101,Ian Bicking,Mozilla,1204
105,Chris Wanstrath,Github,1002
➜ awk -F, '$4 >= 1000 && $4 <= 2000 {print $0}' source-star.txt
101,Ian Bicking,Mozilla,1204
105,Chris Wanstrath,Github,1002
# star 少于1000或多于5000的项目的作者和对应star数
➜ awk -F, '$4 >= 5000 || $4 <= 1000 {print $2":"$4}' source-star.txt
Paul Irish:7200
Mattt Thompson:890

正则

~ 表示匹配, !~ 表示不匹配

➜ awk -F, '$3 ~ "Github"' source.txt 
105,Chris Wanstrath,Github
➜ awk -F, '$3 !~ "Google"' source.txt
101,Ian Bicking,Mozilla
102,Hakim El Hattab,Whim
105,Chris Wanstrath,Github
106,Mattt Thompson,Heroku
107,Ask Solem Hoel,VMware

NF (number of fields) 是分割项数目, $NF表示最后一个分割项

# 计数/etc/passwd中以/bin/zsh作为最后一个字段的行
➜ awk -F: '$NF ~ /\/bin\/zsh/ {n++}; END {print n}' /etc/passwd
2

if 条件判断

if 条件判断语法为

if (conditional-expression)
    action
	
if (conditional-expression)
{
action1;
action2; }

例子:

# 和上面的一个例子一样
➜ awk -F "," '{ if ($3 ~ "Github") print $0}' source.txt
105,Chris Wanstrath,Github

if-else语法为:

if (conditional-expression)
    action1
else
    action2
# or
conditional-expression ? action1 : action2 ;

例子:

# 奇数行就换行输出,偶数行直接在,后输出
➜ awk 'ORS = NR % 2?",":"\n"' source.txt
101,Ian Bicking,Mozilla,102,Hakim El Hattab,Whim
103,Paul Irish,Google,104,Addy Osmani,Google
105,Chris Wanstrath,Github,106,Mattt Thompson,Heroku
107,Ask Solem Hoel,VMware,

while

语法为:

while(condition)
    actions

例子,一个文件 while.awk 的内容如下:

while.awk 
{
    i=2; total=0;
    while (i <= NF) {
        total = total + $i;
        i++;
 }
    print "Item", $1, ":", total, "quantities sold";
}

将文件作为 awk 的输入文件:

➜ awk -f while.awk items-sold.txt 
Item 101 : 47 quantities sold
Item 102 : 10 quantities sold
Item 103 : 65 quantities sold
Item 104 : 20 quantities sold
Item 105 : 42 quantities sold

do-while

do-while 至少会执行一次,语法格式为:

do
	action
while(condition)

用 do-while 实现上一次的例子看看会得到什么结果,文件dowhile.awk 内容为:

{
    i=2; total=0;
 do
 {
     total = total + $i;
     i++;
 } while (i <= NF)
 print "Item", $1, ":", total, "quantities sold";
}

将文件作为 awk 的输入文件:

➜ awk -f dowhile.awk items-sold.txt 
Item 101 : 47 quantities sold
Item 102 : 10 quantities sold
Item 103 : 65 quantities sold
Item 104 : 20 quantities sold
Item 105 : 42 quantities sold

(好吧,这个例子我还不是特别懂,姑且先写在这儿)

for

for 循环语法格式:

for(initialization;condition;increment/decrement)
	actions

例子:

echo '1,2,3,4' | awk -F, '{for(i = 1; i <= NF; i++) total = total + i} END {print total}'
10

break continue exit

直接看例子:

# 程序一直运行打印Iteration,并且累加x,直到x等于10停止程序-break
➜ awk 'BEGIN{
x=1;
while(1)
	{
	print "Iteration";
	if ( x==10 )
		break;
		x++;
	}
}'
Iteration
Iteration
Iteration
Iteration
Iteration
Iteration
Iteration
Iteration
Iteration
Iteration
# x从1到10, 如果x等于5 直接直接累加x而不打印
➜ awk 'BEGIN{
	x=1;
	while(x<=10){
		if(x==5){
			x++;
			continue;
		}
	print "Value of x",x;x++;
	}
}'
Value of x 1
Value of x 2
Value of x 3
Value of x 4
Value of x 6
Value of x 7
Value of x 8
Value of x 9
Value of x 10
# x从1到10,当x等于5的时候程序直接退出
➜ awk 'BEGIN{
	x=1;
	while(x<=10){
		if(x==5){
			exit;
		}
print "Value of x",x;x++;
	}
}'
Value of x 1
Value of x 2
Value of x 3
Value of x 4

关联数组

# awk 的关联数组中item[101]和item["101"]意义一样
➜ awk 'BEGIN { item[101]="Github"; print item["101"]}' 
Github
# 可以用in检验是否包含本项
➜ awk 'BEGIN { item[101]="a"; if ( 101 in item ) print "Has 101"}'
Has 101
# 还可以使用for循环读取列表
➜ awk 'BEGIN {
			item[101]="Github";
			item[21]="Google";
			for (x in item)
        		print item[x]}'
Google
Github

# 多维数组, delete可以删除元素.PS item[2,1]这样的格式有问题
# 因为会被翻译成2#2("2\0342"),假设要设置分隔符可以使用SUBSEP=",";
➜ awk 'BEGIN {item["1,1"]="Github"; item["1,2"]="Google"; \
		item["2,1"]="Whim"; delete item["2,1"];
		for (x in item)
			print "Index",x,"contains",item[x]}'
Index 1,1 contains Github
Index 1,2 contains Google

格式化打印

# \n是换行
Line 1
Line 2
# \t是tab
➜ awk 'BEGIN {printf "Field 1\t\tField 2\tField 3\tField 4\n" }' 
Field 1		Field 2	Field 3	Field 4
➜ awk 'BEGIN {printf "Field 1\t\tField 2\t\tField 3\tField 4\n" }'
Field 1		Field 2		Field 3	Field 4
# \v是垂直tab
➜ awk 'BEGIN {printf "Field 1\vField 2\vField 3\vField 4\n"}'  
Field 1
       Field 2
              Field 3
                     Field 4

# %s字符串; %c单个字符; %d数字; %f浮点数......
➜ cat printf-width.awk 
BEGIN {
	FS=","
	printf "%3s\t%10s\t%10s\t%5s\t%3s\n",
    "Num","Description","Type","Price","Qty"
	printf "-----------------------------------------------------\n"
}
{
    printf "%3d\t%10s\t%10s\t%g\t%d\n", $1,$2,$3,$4,$5
}
➜ awk -f printf-width.awk items.txt 
Num	Description	      Type	Price	Qty
-----------------------------------------------------
101	HD Camcorder	     Video	210	10
102	Refrigerator	 Appliance	850	2
103	MP3 Player	         Audio	270	15
104	Tennis Racket	    Sports	190	20
105	Laser Printer	    Office	475	5

内置函数

# int - 将数字转换成整形, 类似的函数还有sqrt, sin, cos...
➜ awk 'BEGIN {print int(4.1);print int(-6.22);print int(strings)}'
4
-6
0

# rand - 随机0-1的数字; srand -初始化随机数的初始值
➜ cat srand.awk 
BEGIN {
    srand(5);
    count=0;
    max=30;
    while (count < 5) {
        # 随机数范围为5-30
        rnd = int(rand() * max);
        print rnd;
        count++;
    }
}
# 使用osx的awk随机范围不对
➜ awk -f strand.awk 
19
9
21
8
13

# index - 所查字符在字符串中的位置,没找到会返回0
➜ awk 'BEGIN{str="This is a test"; print index(str, "a"); print index(str, "y")}'
9
0
# length - 字符串的长度
➜ awk -F, '{print length($0)}' source.txt
23
24
21
22
26
25
25
# split - 分片 PS:使用awk分片的顺序有问题;
# split第一个参数是要分割的内容,第二个是分割后的结果保存的数组,第三个是使用的分隔符echo "101 arg1:arg2:arg3" | awk '{split($2,out,":"); for (x in out) print out[x]}'
arg1
arg2
arg3
# substr - 取字符串范围内容;
# 第一个参数是要取的内容, 第二个是开始位置(从1开始),第三个是要取的长度echo "This is test"|awk '{print substr($3,2,2);}'
es
# sub - 替换原来的字符串,但是只替换第一个符合项; gsub - 替换全部选择项
➜ awk 'BEGIN{str="ThIs is test"; sub("[Ii]s","e", str); print str;}' 
The is test
➜ awk 'BEGIN{str="ThIs is test"; gsub("[Ii]s","e", str); print str;}'
The e test
# match - 返回某子字符串是否匹配了某字符串;
# RSTART - awk 自带变量,返回匹配的开始位置
# RLENGTH - awk 自带变量,返回匹配串的长度
➜ awk 'BEGIN{str="This is test"; if (match(str, "test")) {print substr(str,RSTART,RLENGTH)}}'  
# tolower/toupper - 把字符串都变成小写/大写
➜ awk 'BEGIN{str="This is test"; print tolower(str); print toupper(str);}'
this is test
THIS IS TEST

# ARGC - 参数的数量; ARGV参数的数组
➜ cat arguments.awk
BEGIN {
    print "ARGC=",ARGC
    for (i = 0; i < ARGC; i++)
  print ARGV[i]
}
➜ awk -f arguments.awk 
ARGC= 1
awk
➜ awk -f arguments.awk source.txt 
ARGC= 2
awk
source.txt
➜ awk -f arguments.awk source.txt source-star.txt 
ARGC= 3
awk
source.txt
source-star.txt

内置变量

# ENVIRON - 系统环境变量
➜ cat environ.awk
BEGIN {
 OFS="="
 for(x in ENVIRON)
     print x,ENVIRON[x];
}
➜ awk -f environ.awk 
SHLVL=1
DBUS_SESSION_BUS_ADDRESS=unix:path=/run/user/1000/bus
UPDATE_ZSH_DAYS=13
XDG_SESSION_PATH=/org/freedesktop/DisplayManager/Session0
PWD=/home/adam/Learning/sed&awk
GDMSESSION=lightdm-xsession
XDG_CONFIG_DIRS=/etc/xdg
XDG_CURRENT_DESKTOP=XFCE
JAVA_HOME=/usr/lib/jvm/oracle-java8-jdk-amd64/jre
XDG_GREETER_DATA_DIR=/var/lib/lightdm/data/adam
XDG_DATA_DIRS=/usr/share/xfce4:/usr/local/share/:/usr/share/:/usr/share
ZSH=/home/adam/.oh-my-zsh
SHELL=/bin/zsh
ALLOW_WGCNA_THREADS=4
QT_LINUX_ACCESSIBILITY_ALWAYS_ON=1
COLORTERM=truecolor
(....部分省略.....)

# IGNORECASE - 设置为1 忽略大小写
➜ awk 'BEGIN{IGNORECASE=1} /github/{print}' source.txt
105,Chris Wanstrath,Github

自定义函数

自定义一个函数写入文件function-debug.awk

function mydebug (message) {
    print ("Debug Time:" strftime("%a %b %d %H:%M:%S %Z %Y", systime()))
    print (message)
}
{
    mydebug($NF)
}
# 函数的位置不重要

然后调用这个函数:

➜ awk -f function-debug.awk source.txt
Debug Time:Sat May 20 20:56:40 HKT 2017
Bicking,Mozilla
Debug Time:Sat May 20 20:56:40 HKT 2017
Hattab,Whim
Debug Time:Sat May 20 20:56:40 HKT 2017
Irish,Google
Debug Time:Sat May 20 20:56:40 HKT 2017
Osmani,Google
Debug Time:Sat May 20 20:56:40 HKT 2017
Wanstrath,Github
Debug Time:Sat May 20 20:56:40 HKT 2017
Thompson,Heroku
Debug Time:Sat May 20 20:56:40 HKT 2017
Hoel,VMware

系统调用

使用 system 函数可以调用 shell 命令:

➜ awk 'BEGIN {system("date")}' 
Sat May 20 20:58:54 HKT 2017
# systime 和 strftime上面见过了.处理时间和格式化时间
➜ awk 'BEGIN {print strftime("%c",systime())}' 
Sat 20 May 2017 09:04:12 PM HKT

awk 高级话题

getline

直接看例子:

# awk首先读入一行,接着处理 getline 函数再获得一行....所以最后print得到的就是所有奇数行
➜ awk -F, '{print $0;getline;}' source.txt  
101,Ian Bicking,Mozilla
103,Paul Irish,Google
105,Chris Wanstrath,Github
107,Ask Solem Hoel,VMware
105,Chris Wanstrath,Github
107,Ask Solem Hoel,VMware
# 我们使用getline 并把这行变量赋值给tmp,这个例子清晰的显示出了上一例的处理过程
➜ awk -F, '{getline tmp; print "$0->", $0; print "tmp->", tmp}' source.txt 
$0-> 101,Ian Bicking,Mozilla
tmp-> 102,Hakim El Hattab,Whim
$0-> 103,Paul Irish,Google
tmp-> 104,Addy Osmani,Google
$0-> 105,Chris Wanstrath,Github
tmp-> 106,Mattt Thompson,Heroku
$0-> 107,Ask Solem Hoel,VMware
tmp-> 106,Mattt Thompson,Heroku
# 执行外部程序, close可关闭管道,比如这里必须是`|getline`之前的命令
➜ awk 'BEGIN{"date"| getline;close("date");print "Timestamp:" $0}'
Timestamp:Sat May 20 21:12:30 HKT 2017
# or
➜ awk 'BEGIN{"date"| getline timestamp;close("date");print "Timestamp:" timestamp}'
Timestamp:Sat May 20 21:13:41 HKT 2017
# 这个例子不是很懂...

跟着 mimic-code 探索 MIMIC 数据之 notebooks CRRT (三)

感觉必应随便搞个图下来当封面不错的。再来一次

0.BingWallpaper-2018-08-19.jpg

这一篇为什么隔了好几天才出来呢,因为代码的理解难度突然、陡然、猝不及防的上了个 90 度的坡。我看了好几天没看懂。在 RStudio 里光是调代码缩进方便看代码眼睛都要瞎了。结果我的 1080p 屏幕还是无法很好的显示代码,因为一段代码太长了。最后实在没办法还是用 vim 调,顺便学了下 vim 里代码折叠,然后就可以愉快的把那种括号内的东西折叠起来,然后再调代码缩进方便很多,然后代码格式调好了,但是我不是很懂......还得看。

通过前面的两篇,我们用尽心思,千辛万苦,翻雪山过草地,四渡赤水河,用了七七四十九步,历经九九八十一难,终于,finally,at last 可以把同一个事件的多个时间段合并得到一个完整的时间段。但是不要高兴得太早了,还记得我们最开始的时候是 INPUTEVENTS_MVCHARTEVENTSPROCEDUREVENTS_MV 一共三张表格吗?现在我们刚刚把 INPUTEVENTS_MV 表格处理完,而已。我们在上一篇 Step 4 定下的步骤还记得吗?

1.WhereWeR

有没有很惊喜?有没有很意外?我们做了这么就其实才做完 Step 4 的第 1 条哈哈哈哈。

好吧,乖乖继续按流程走吧。


Convert CHARTEVENTS into durations

(我已经连这应该是几级标题都搞不清楚了)

INPUTEVENTS_MV 处理好了,轮到下一个 CHARTEVENTS 。我们直接复用之前写好的代码就行了(一样的作为示例我们只看一个病人的):

WITH crrt_settings AS
  (
  SELECT ce.icustay_id, ce.charttime
  , MAX(CASE WHEN ce.itemid IN
            (
            224149, -- Access Pressure
            224144, -- Blood Flow (ml/min)
            228004, -- Citrate (ACD-A)
            225183, -- Current Goal
            225977, -- Dialysate Fluid
            224154, -- Dialysate Rate
            224151, -- Effluent Pressure
            224150, -- Filter Pressure
            225958, -- Heparin Concentration (units/mL)
            224145, -- Heparin Dose (per hour)
            224191, -- Hourly Patient Fluid Removal
            228005, -- PBP (Prefilter) Replacement Rate
            228006, -- Post Filter Replacement Rate
            225976, -- Replacement Fluid
            224153, -- Replacement Rate
            224152, -- Return Pressure
            226457  -- Ultrafiltrate Output
            )
        THEN 1 ELSE 0 END) AS RRT
    -- Below indicates that a new instance of CRRT has started
  , MAX(CASE
    -- System Integrity
        WHEN ce.itemid = 224146 AND
             value IN ('New Filter','Reinitiated')
        THEN 1 ELSE 0 END) AS RRT_start
    -- Below indicates that the current instance of CRRT has ended
  , MAX(CASE
    -- System Integrity
        WHEN
          ce.itemid = 224146 AND
          value IN ('Discontinued','Recirculating') THEN 1
        WHEN ce.itemid = 225956
        THEN 1 ELSE 0 END ) AS RRT_end
  FROM chartevents ce
  WHERE ce.itemid IN
    (
      -- MetaVision ITEMIDs
      -- Below require special handling
      224146, -- System Integrity
      225956,  -- Reason for CRRT Filter Change

      -- Below are settings which indicate CRRT is started/continuing
      224149, -- Access Pressure
      224144, -- Blood Flow (ml/min)
      228004, -- Citrate (ACD-A)
      225183, -- Current Goal
      225977, -- Dialysate Fluid
      224154, -- Dialysate Rate
      224151, -- Effluent Pressure
      224150, -- Filter Pressure
      225958, -- Heparin Concentration (units/mL)
      224145, -- Heparin Dose (per hour)
      224191, -- Hourly Patient Fluid Removal
      228005, -- PBP (Prefilter) Replacement Rate
      228006, -- Post Filter Replacement Rate
      225976, -- Replacement Fluid
      224153, -- Replacement Rate
      224152, -- Return Pressure
      226457  -- Ultrafiltrate Output
    )
    AND ce.value is not null
    AND icustay_id = 246866
    GROUP BY icustay_id, charttime
  )

  -- create the durations for each CRRT instance
  SELECT icustay_id
  , ROW_NUMBER() OVER (PARTITION BY icustay_id order BY num) AS num
  , MIN(charttime) AS starttime
  , MAX(charttime) AS endtime
  FROM
  (
  SELECT vd1.*
  -- create a cumulative sum of the instances of new CRRT
  -- this results in a monotonically increasing integer assigned to each CRRT
  , CASE WHEN
      RRT_start = 1 OR RRT=1 OR RRT_end = 1
    THEN SUM(NewCRRT) OVER
      (PARTITION BY icustay_id ORDER BY charttime )
    ELSE null END AS num
  --- now we convert CHARTTIME of CRRT settings into durations
  FROM
    ( -- vd1
      SELECT
      icustay_id
      -- this carries over the previous charttime
      , CASE WHEN RRT=1 THEN
          LAG(CHARTTIME, 1) OVER (PARTITION BY icustay_id, RRT ORDER BY charttime)
      ELSE null END AS charttime_lag
      , charttime
      , RRT, RRT_start, RRT_end
      -- calculate the time since the last event
      , CASE
      -- non-null iff the current observation indicates settings are present
        WHEN RRT=1 THEN
          CHARTTIME -
            (
              LAG(CHARTTIME, 1) OVER
                  (PARTITION BY icustay_id, RRT
                  ORDER BY charttime)
            )
      ELSE null END AS CRRT_duration

      -- now we determine if the current event is a new instantiation
      , CASE
        WHEN RRT_start = 1 THEN 1
        -- if there is an end flag, we mark any subsequent event as new
        WHEN RRT_end = 1 THEN 0
        -- note the end is *not* a new event, the *subsequent* row is
        -- so here we output 0
        WHEN LAG(RRT_end,1) OVER
          (
            PARTITION BY icustay_id,
                         CASE WHEN RRT=1 OR RRT_end=1
                         THEN 1 ELSE 0 END
            ORDER BY charttime
          ) = 1 THEN 1
        -- if there is less than 2 hours between CRRT settings, we do not treat this as a new CRRT event
        WHEN (CHARTTIME - (LAG(CHARTTIME, 1) OVER
                              (
                                PARTITION BY icustay_id, CASE WHEN RRT=1 OR RRT_end=1
                                                            THEN 1 ELSE 0 END
                                ORDER BY charttime
                              )
                          )
              ) <= INTERVAL '2' hour
        THEN 0 ELSE 1 END AS NewCRRT
        -- use the temp table with only settings from chartevents
        FROM crrt_settings
      ) AS vd1
    -- now we can isolate to just rows with settings
    -- (before we had rows with start/end flags)
    -- this removes any null values for NewCRRT
  WHERE RRT_start = 1 OR RRT = 1 OR RRT_end = 1
) AS vd2
GROUP BY icustay_id, num
HAVING MIN(charttime) != MAX(charttime)
ORDER BY icustay_id, num;

得到:

* num starttime endtime
0 1 Day 11, 23:43 Day 12, 20:00
1 2 Day 12, 22:00 Day 13, 16:30
2 3 Day 13, 18:15 Day 13, 23:00
3 4 Day 14, 15:27 Day 16, 16:00

看看应该没问题,然后就可以去掉那个 AND icustay_id = 246866来查询所有病人了(猝不及防地又来了一段 Python,这是为了把查询 CHARTEVENTS 所有病人的查询语句记下来,后面就能直接用了。本来是应该用 R 的,但是我看了一下后面主要是作图。ggplot2 应该画同样的图没问题,但是我懒得查了):

# happy with above query
# now remove the one patient constraints
query_chartevents = query_schema + """
WITH crrt_settings AS(
SELECT ce.icustay_id, ce.charttime,
MAX(CASE WHEN ce.itemid IN
      (
        224149, -- Access Pressure
        224144, -- Blood Flow (ml/min)
        228004, -- Citrate (ACD-A)
        225183, -- Current Goal
        225977, -- Dialysate Fluid
        224154, -- Dialysate Rate
        224151, -- Effluent Pressure
        224150, -- Filter Pressure
        225958, -- Heparin Concentration (units/mL)
        224145, -- Heparin Dose (per hour)
        224191, -- Hourly Patient Fluid Removal
        228005, -- PBP (Prefilter) Replacement Rate
        228006, -- Post Filter Replacement Rate
        225976, -- Replacement Fluid
        224153, -- Replacement Rate
        224152, -- Return Pressure
        226457  -- Ultrafiltrate Output
      ) THEN 1 ELSE 0
    END) AS RRT
-- Below indicates that a new instance of CRRT has started
, MAX(
  CASE
    -- System Integrity
    WHEN ce.itemid = 224146 AND value IN ('New Filter','Reinitiated')
      THEN 1 ELSE 0
  END) AS RRT_start
-- Below indicates that the current instance of CRRT has ended
, MAX(
  CASE
    -- System Integrity
    WHEN ce.itemid = 224146 AND value IN ('Discontinued','Recirculating')
      THEN 1
    WHEN ce.itemid = 225956
      THEN 1
  ELSE 0
  END) AS RRT_end
FROM chartevents ce
WHERE ce.itemid IN
  (
    -- MetaVision ITEMIDs
    -- Below require special handling
    224146, -- System Integrity
    225956,  -- Reason fOR CRRT Filter Change

    -- Below are settings which indicate CRRT is started/continuing
    224149, -- Access Pressure
    224144, -- Blood Flow (ml/min)
    228004, -- Citrate (ACD-A)
    225183, -- Current Goal
    225977, -- Dialysate Fluid
    224154, -- Dialysate Rate
    224151, -- Effluent Pressure
    224150, -- Filter Pressure
    225958, -- Heparin Concentration (units/mL)
    224145, -- Heparin Dose (per hour)
    224191, -- Hourly Patient Fluid Removal
    228005, -- PBP (Prefilter) Replacement Rate
    228006, -- Post Filter Replacement Rate
    225976, -- Replacement Fluid
    224153, -- Replacement Rate
    224152, -- Return Pressure
    226457  -- Ultrafiltrate Output
  )
AND ce.value IS NOT null
GROUP BY icustay_id, charttime
)

-- create the durations fOR each CRRT instance
SELECT icustay_id
  , ROW_NUMBER() OVER (PARTITION BY icustay_id ORDER BY num) AS num
  , MIN(charttime) AS starttime
  , MAX(charttime) AS endtime
FROM
(
  SELECT vd1.*
  -- create a cumulative sum of the instances of new CRRT
  -- this results in a monotonically increasing integer assigned to each CRRT
  , CASE WHEN RRT_start = 1 OR RRT=1 OR RRT_end = 1
	THEN SUM(NewCRRT)
      OVER (PARTITION BY icustay_id ORDER BY charttime) ELSE null
	END AS num
  --- now we convert CHARTTIME of CRRT settings into durations
  FROM ( -- vd1
      SELECT
          icustay_id
          -- this carries over the previous charttime
          , CASE
              WHEN RRT=1 THEN
                LAG(CHARTTIME, 1) OVER (PARTITION BY icustay_id, RRT ORDER BY charttime)
              ELSE null
            END AS charttime_lag
          , charttime
          , RRT
          , RRT_start
          , RRT_end
          -- calculate the time since the last event
          , CASE
              -- non-null iff the current observation indicates settings are present
              WHEN RRT=1 THEN
                CHARTTIME -
                (
                  LAG(CHARTTIME, 1) OVER
                  (
                    PARTITION BY icustay_id, RRT
                    ORDER BY charttime
                  )
                )
              ELSE null
            END AS CRRT_duration

          -- now we determine if the current event is a new instantiation
          , CASE
              WHEN RRT_start = 1
                THEN 1
            	-- if there is an end flag, we mark any subsequent event as new
              WHEN RRT_end = 1
                -- note the end is *not* a new event, the *subsequent* row is
                -- so here we output 0
                THEN 0
              WHEN
                LAG(RRT_end,1)
                OVER
                (
                	PARTITION BY icustay_id, CASE WHEN RRT=1 OR RRT_end=1 THEN 1 ELSE 0 END
                	ORDER BY charttime
                ) = 1
 								THEN 1
              -- if there is less than 2 hours between CRRT settings, we do not treat this as a new CRRT event
              WHEN (CHARTTIME - (LAG(CHARTTIME, 1)
              OVER
              (
                PARTITION BY icustay_id, CASE WHEN RRT=1 OR RRT_end=1 THEN 1 ELSE 0  END
                ORDER BY charttime
              ))) <= interval '2' hour
              	THEN 0
            ELSE 1
          END AS NewCRRT
      -- use the temp table with only settings from chartevents
      FROM crrt_settings
  ) AS vd1
  -- now we can isolate to just rows with settings
  -- (befORe we had rows with start/end flags)
  -- this removes any null values fOR NewCRRT
  WHERE
    RRT_start = 1 OR RRT = 1 OR RRT_end = 1
) AS vd2
GROUP BY icustay_id, num
HAVING MIN(charttime) != MAX(charttime)
ORDER BY icustay_id, num;
"""

Extract durations from PROCEDUREEVENTS_MV

PROCEDUREEVENTS_MV 里也有透析的记录。估计你们也忘了前面选的那些了。再列一次我们挑出来 itemid

  • 225802 -- Dialysis - CRRT
  • 225803 -- Dialysis - CVVHD
  • 225809 -- Dialysis - CVVHDF
  • 225955 -- Dialysis - SCUF

提取这些数据就很直接了。每个 CRRT 的记录也只记录了一个 starttimestoptime,也就不需要我们再去合并了。

# extract the durations from PROCEDUREEVENTS_MV
# NOTE: we only look at a single patient as an exemplar
SELECT icustay_id
  , ROW_NUMBER() OVER (
      PARTITION BY icustay_id
      ORDER BY starttime, endtime) AS num
  , starttime, endtime
FROM procedureevents_mv
WHERE itemid IN
(
    225802 -- Dialysis - CRRT
  , 225803 -- Dialysis - CVVHD
  , 225809 -- Dialysis - CVVHDF
  , 225955 -- Dialysis - SCUF
)
AND icustay_id = 246866
ORDER BY icustay_id, num;

得到:

* num starttime endtime
0 1 Day 11, 23:45 Day 12, 20:30
1 2 Day 12, 21:30 Day 13, 23:15
2 3 Day 14, 15:27 Day 16, 16:02

可以看到上面的记录很勤:第 1 行与第 2 行这两条记录之间间隔了一个小时,这是实际中现实中 CRRT 治疗暂停了一个小时的反映。上面的代码没问题的话,现在又要去掉一个病人的限制条件了(和上面一样,这是后面 Python 需要用到的查询语句):

# happy with above query
# now remove the one patient constraints
query_procedureevents = query_schema + """
SELECT icustay_id
  , ROW_NUMBER() OVER (PARTITION BY icustay_id
                       ORDER BY starttime, endtime) AS num
  , starttime, endtime
FROM procedureevents_mv
WHERE itemid IN
(
    225802 -- Dialysis - CRRT
  , 225803 -- Dialysis - CVVHD
  , 225809 -- Dialysis - CVVHDF
  , 225955 -- Dialysis - SCUF
)
ORDER BY icustay_id, num;
"""

Roundup: data from INPUTEVENTS_MV, CHARTEVENTS, and PROCEDUREEVENTS_MV

好了,现在 3 个表都处理完了。综合一下 3 个结果,但首先我们得把三个结果都存储到一个变量里方便后面比较(这就要用到上一篇最后那个和本篇里上面两个存储在 Python 里的一共 3 个长长的查询语句了。由于我们已经存在 Python 变量了,所以现在只需要套壳 qurey() 就行了。

上面已经有了 query_charteventsquery_procedureevents ,干脆再贴一下 query_inputevents 的,免得回去翻:

query_inputevents = query_schema + """
WITH t1 AS
  (
    SELECT icustay_id
    , CASE WHEN
        itemid = 227525 THEN 'Calcium'
      ELSE 'KCl' END AS label
    , starttime, endtime
    , CASE WHEN LAG(endtime) OVER
        (PARTITION BY icustay_id, itemid ORDER BY starttime, endtime) = starttime
      THEN 0
    ELSE 1 END AS new_event_flag
    , rate, rateuom
    , statusdescription
    FROM inputevents_mv
    WHERE itemid IN
      (
      227525,-- Calcium Gluconate (CRRT)
      227536 -- KCl (CRRT)
      )
    AND statusdescription != 'Rewritten'
  )
  , t2 as
  (
    SELECT
    icustay_id, label
    , starttime, endtime
    , SUM(new_event_flag) OVER
        (PARTITION BY icustay_id, label ORDER BY starttime, endtime)
        AS time_partition
    , rate, rateuom, statusdescription
    FROM t1
  )
  , t3 as
  (
    SELECT
    icustay_id, label
    , starttime, endtime
    , time_partition
    , rate, rateuom, statusdescription
    , ROW_NUMBER() OVER
        (PARTITION BY icustay_id, label, time_partition
          ORDER BY starttime DESC, endtime DESC)
      AS lastrow
    FROM t2
  )
SELECT
icustay_id
, time_partition AS num
, MIN(starttime) AS starttime
, max(endtime) AS endtime
, label
--, MIN(rate) AS rate_min
--, max(rate) AS rate_max
--, MIN(rateuom) AS rateuom
--, MIN(CASE WHEN lastrow = 1 THEN statusdescription ELSE null END) AS statusdescription
FROM t3
GROUP BY icustay_id, label, time_partition
ORDER BY starttime, endtime;
"""

而且这一次也不再是简简单单查询一下看一下数据,而是把结果存下来后面再比较分析)。一样的,先把环境搞起来,载入包:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import psycopg2
import getpass
from IPython.display import HTML, display
import matplotlib.dates as dates
import matplotlib.lines as mlines

简单设置并且连上数据库:

%matplotlib inline
plt.style.use("ggplot")

dbname = 'mimic'
user = 'postgres'
schema_name = 'mimiciii'

# ln -s /var/run/postgresql/.s.PGSQL.5432 /tmp/.s.PGSQL.5432
con = psycopg2.connect(dbname="mimic", user="postgres", password=getpass.getpass(prompt='Password:'.format(user)))

query_schema = 'SET search_path to ' + schema_name + ';'

然后得到那三个数据:

print("Durations from INPUTEVENTS...")
ie = pd.read_sql_query(query_inputevents,con)

print("Durations from CHARTEVENTS...")
ce = pd.read_sql_query(query_chartevents,con)

print("Durations from PROCEDUREEVENTS...")
pe = pd.read_sql_query(query_procedureevents,con)

进行下一步之前我们先看看得到的这三个数据到底长什么样子:

print("First 5 lines of ie...")
ie.head()

表格 ie

* icustay_id num starttime endtime label
0 205508 1 2101-07-09 18:10:00 2101-07-13 15:44:00 Calcium
1 280550 1 2101-08-02 21:20:00 2101-08-04 16:05:00 Calcium
2 280550 1 2101-08-03 08:56:00 2101-08-04 16:05:00 KCl
3 217315 1 2101-09-21 01:00:00 2101-09-21 09:00:00 Calcium
4 217315 2 2101-09-21 11:00:00 2101-09-27 11:00:00 Calcium

表格 ce

print("First 5 lines of ce...")
ce.head()
* icustay_id num starttime endtime
0 200347 1 2116-06-10 15:00:00 2116-06-11 01:00:00
1 200347 2 2116-06-11 04:20:00 2116-06-11 18:00:00
2 200347 3 2116-06-11 19:00:00 2116-06-12 08:00:00
3 200347 4 2116-06-12 10:02:00 2116-06-13 10:26:00
4 200699 1 2105-04-30 00:19:00 2105-04-30 08:00:00

表格 pe:

print("First 5 lines of pe...")
pe.head()
* icustay_id num starttime endtime
0 200347 1 2116-06-10 15:00:00 2116-06-11 00:07:00
1 200347 2 2116-06-11 04:20:00 2116-06-12 07:27:00
2 200347 3 2116-06-12 10:00:00 2116-06-12 12:22:00
3 200347 4 2116-06-12 13:15:00 2116-06-13 10:29:00
4 200699 1 2105-04-30 00:19:00 2105-04-30 09:00:00

可以看到表格除了 ie 有一列 lable 用来表示使用的是钙还是钾之外,表格剩余 4 列都是 icustay_idnumstarttimeendtime,其中 num 用来区分同一个人多次治疗。

Compare durations

现在呢,就把三个数据合起来。而且为了让合并起来的数据知道是来自于这三个表格中的哪个,我们还要加上一列 source。对于 ie 我们还得区分这是 KCl 还是 Ca:

def display_df(df):
    col = [x for x in df.columns if x != 'icustay_id']
    df_tmp = df[col].copy()
    for c in df_tmp.columns:
        if '[ns]' in str(df_tmp[c].dtype):
            df_tmp[c] = df_tmp[c].dt.strftime('Day %d, %H:%M')
    
    display(HTML(df_tmp.to_html().replace('NaN', '')))

# compare the above durations
ce['source'] = 'chartevents'
ie['source'] = 'inputevents_kcl'
ie.loc[ie['label']=='Calcium','source'] = 'inputevents_ca' 
pe['source'] = 'procedureevents'
df = pd.concat([ie[['icustay_id','num','starttime','endtime','source']], ce, pe])

df.head()

然后合并后数据长这样:

* icustay_id num starttime endtime source
0 205508 1 2101-07-09 18:10:00 2101-07-13 15:44:00 inputevents_ca
1 280550 1 2101-08-02 21:20:00 2101-08-04 16:05:00 inputevents_ca
2 280550 1 2101-08-03 08:56:00 2101-08-04 16:05:00 inputevents_kcl
3 217315 1 2101-09-21 01:00:00 2101-09-21 09:00:00 inputevents_ca
4 217315 2 2101-09-21 11:00:00 2101-09-27 11:00:00 inputevents_ca

然后单独拎出一个病人的数据,来看一下这个个不同来源的数据之间是否相互重叠:

iid = 205508

idxDisplay = df['icustay_id'] == iid
display_df(df.loc[idxDisplay, :])

得到:

* num starttime endtime source
0 1 Day 09, 18:10 Day 13, 15:44 inputevents_ca
136 1 Day 09, 18:00 Day 12, 15:15 chartevents
137 2 Day 12, 16:02 Day 12, 19:01 chartevents
138 3 Day 12, 21:00 Day 13, 14:03 chartevents
147 1 Day 09, 18:00 Day 13, 15:04 procedureevents

看表还不够直观,我们画图:

# set a color palette
col_dict = {'chartevents': [247,129,191],
           'inputevents_kcl': [255,127,0],
           'inputevents_ca': [228,26,28],
           'procedureevents': [55,126,184]}

for c in col_dict:
    col_dict[c] = [x/256.0 for x in col_dict[c]]


fig, ax = plt.subplots(figsize=[16,10])
m = 0.
M = np.sum(idxDisplay)

# dummy plots for legend
legend_handle = list()
for c in col_dict:
    legend_handle.append(mlines.Line2D([], [], color=col_dict[c], marker='o',
                              markersize=15, label=c))

for row in df.loc[idxDisplay,:].iterrows():
    # row is a tuple: [index, actual_data], so we use row[1]
    plt.plot([row[1]['starttime'].to_pydatetime(), row[1]['endtime'].to_pydatetime()], [0+m/M,0+m/M],
            'o-',color=col_dict[row[1]['source']],
            markersize=15, linewidth=2)
    m=m+1
    
ax.xaxis.set_minor_locator(dates.HourLocator(byhour=[0,12],interval=1))
ax.xaxis.set_minor_formatter(dates.DateFormatter('%H:%M'))
ax.xaxis.grid(True, which="minor")
ax.xaxis.set_major_locator(dates.DayLocator(interval=1))
ax.xaxis.set_major_formatter(dates.DateFormatter('\n%d\n%a'))

ax.set_ylim([-0.1,1.0])

plt.legend(handles=legend_handle,loc='best')
plt.savefig('0-crrt_' + str(iid) + '.png')
plt.show()

得到图:

2.data.overlap.png

可以发现三个数据基本上对于起止时间记录相差不大,差别仅仅在于数据是否是分段记录的(治疗间的暂停如何记录和定义的问题)。

这是一个病人的数据。我们现在来直接看 10 个:

# print out the above for 10 examples

# compare the above durations
ce['source'] = 'chartevents'
ie['source'] = 'inputevents_kcl'
ie.loc[ie['label']=='Calcium','source'] = 'inputevents_ca' 
pe['source'] = 'procedureevents'
df = pd.concat([ie[['icustay_id','num','starttime','endtime','source']], ce, pe])

for iid in np.sort(df.icustay_id.unique()[0:10]):
    iid = int(iid)
    # how many PROCEDUREEVENTS_MV dialysis events encapsulate CHARTEVENTS/INPUTEVENTS_MV?
    # vice-versa?
    idxDisplay = df['icustay_id'] == iid
    
    # no need to display here
    #display_df(df.loc[idxDisplay, :])
    
    # 2) how many have no overlap whatsoever?
    col_dict = {'chartevents': [247,129,191],
               'inputevents_kcl': [255,127,0],
               'inputevents_ca': [228,26,28],
               'procedureevents': [55,126,184]}

    for c in col_dict:
        col_dict[c] = [x/256.0 for x in col_dict[c]]


    fig, ax = plt.subplots(figsize=[16,10])
    m = 0.
    M = np.sum(idxDisplay)

    # dummy plots for legend
    legend_handle = list()
    for c in col_dict:
        legend_handle.append(mlines.Line2D([], [], color=col_dict[c], marker='o',
                                  markersize=15, label=c))

    for row in df.loc[idxDisplay,:].iterrows():
        # row is a tuple: [index, actual_data], so we use row[1]
        plt.plot([row[1]['starttime'].to_pydatetime(), row[1]['endtime'].to_pydatetime()], [0+m/M,0+m/M],
                'o-',color=col_dict[row[1]['source']],
                markersize=15, linewidth=2)
        m=m+1

    ax.xaxis.set_minor_locator(dates.HourLocator(byhour=[0,6,12,18],interval=1))
    ax.xaxis.set_minor_formatter(dates.DateFormatter('%H:%M'))
    ax.xaxis.grid(True, which="minor")
    ax.xaxis.set_major_locator(dates.DayLocator(interval=1))
    ax.xaxis.set_major_formatter(dates.DateFormatter('\n%d-%m-%Y'))

    ax.set_ylim([-0.1,1.0])

    plt.legend(handles=legend_handle,loc='best')
    
    # if you want to save the figures, uncomment the line below
    #plt.savefig('crrt_' + str(iid) + '.png')

依次得到 10 个人的图:

crrt_202837

crrt_203641

crrt_205508

crrt_206253

crrt_214522

crrt_217315

crrt_257445

crrt_261439

crrt_265724

crrt_280550

看了这些图,好像 INPUTEVENTSPROCEDUREEVENTS_MV 里的数据对于 CHARTEVENTS 来说基本上是冗余的。而且,CHARTEVENTS 的记录似乎似乎更好地反映了 CRRT 治疗过程中因为输液管阻塞和治疗暂停等导致的记录中断。综合一下,我们其实对于反映 CRRT 的治疗时间来说,仅仅用 CHARTEVENTS 的数据就够了。concepts/durations/crrt-durations.sql这里放的查询脚本包含了最终加入 CareVue 的 itemid。查找这些数据的方法和这个记事本讲的方法一样。


最难一根骨头终于啃完了。但是其实代码还不是特别熟悉,还要仔细看。然后后面再看哪一个再说。

发觉还是要好好学一下 Python 分析数据了,因为我发现好像 Python 查询 postgreSQL 好像速度要快很多,虽然我还是不算很喜欢 Jupyter-Notebook 这种工作方式。

尝试在这里建立一个博客

y4igm7t

今天突然看到在 GitHub Issues 写博客这种操作,顿时觉得这比之前的 Pages + Hexo 更方便,而且之前一直觉得麻烦的迁移问题迎刃而解。好吧,其实是我不懂 Hexo ,不知道怎么迁移。

决定试试把 Hexo 中的内容迁移过来试试看。

  • 发一篇带图片的博客
  • 解决怎么加标签的问题
  • 找一个好的显示日期的解决方法
  • 剩下的就是体力活啦

突然想到 README 应该拿来做目录,暂时可以以月份归档。
所以剩下的就是怎么按标签分类显示的问题了....

以及这个时候 repo 应该怎样合理利用也是个问题。

当然还有怎么防止其他人开 Issue 呢?虽然这个对我的影响可能为零。

Python 里 NumPy 的 axis 参数的理解

0.BingWallpaper-2018-08-25.jpg

最近学学 Python 做数据分析,主要就是 Python 基本语法 + NumPy + pandas 咯。

发现很好的一些教程:

果然人生苦短,大家都在用 Python。好教程都一搜一大把。然后今天在 B 站看莫烦的视频,前面都是讲 NumPy 的,array 这个东西其实对于我来说没那么重要,所以我就 1.5x 倍速的看。然后一边刷酷安和饭否啥的,基本没怎么操作,想着泛泛地听一听得了,后面 pandas 再认真听跟着操作。印象中 Python 里对于二维数据就是 0 是行 1 是列。因为我记忆的方法一直是我们平时都会说行列行列,那 0101 不就是行列行列。看到对 array 讲求均数、最大最小值以及后面 np.split() 发现一直有人在弹幕刷什么 axis = 0 是行有没有错或者怎么理解之类的。然后我就决定试一下看一看(我都是在 conda 环境 IPython3 下面操作,所以前面都有 INOUT):

In [1]: import numpy as np

In [2]: a = np.arange(12).reshape(3,4)
In [3]: b = np.arange(12).reshape(4,3)

这样 a、b 分别是 3 行 4 列和 4 行 3 列的两个 array。print 看一下心里有底:

In [4]: print(a)
[[ 0  1  2  3]
 [ 4  5  6  7]
 [ 8  9 10 11]]

In [5]: print(b)
[[ 0  1  2]
 [ 3  4  5]
 [ 6  7  8]
 [ 9 10 11]]

我们来看看求均数的时候 axis 参数怎么工作:

# axis = 0 是行
In [6]: np.mean(a, axis=0)
Out[6]: array([4., 5., 6., 7.])

In [7]: np.mean(b, axis=0)
Out[7]: array([4.5, 5.5, 6.5])

# axis = 1 是列
In [8]: np.mean(a, axis=1)
Out[8]: array([1.5, 5.5, 9.5])

In [9]: np.mean(b, axis=1)
Out[9]: array([ 1.,  4.,  7., 10.])

对 a 进行"行求平均"得到 4 个值,b 同样进行"行求平均"得到 3 个值。这不是列平均数吗?
对 a 进行”列求平均“得到 3 个值,b 同样进行“列求平均”得到 4 个值。这不是列求平均吗?

然后我就开始查了,“Python numpy axis” 拿去 Google 一下,果然问这个问题的不少:

我们首先看 StackOverflow 的回答:Ambiguity in Pandas Dataframe / Numpy Array “axis” definition:

1.question

这个人几乎问了和我一模一样的问题,NumPy 的 axis 到底咋回事?

下面的回答解释得很详细:

It's perhaps simplest to remember it as 0=down and 1=across.

This means:

  • Use axis=0 to apply a method down each column, or to the row labels (the index).
  • Use axis=1 to apply a method across each row, or to the column labels.

It's also useful to remember that Pandas follows NumPy's use of the word axis. The usage is explained in NumPy's glossary of terms:

Axes are defined for arrays with more than one dimension. A 2-dimensional array has two corresponding axes: the first running vertically downwards across rows (axis 0), and the second running horizontally across columns (axis 1). [my emphasis]

So, concerning the method in the question, df.mean(axis=1), seems to be correctly defined. It takes the mean of entries horizontally across columns, that is, along each individual row. On the other hand, df.mean(axis=0) would be an operation acting vertically downwards across rows.

Similarly, df.drop(name, axis=1) refers to an action on column labels, because they intuitively go across the horizontal axis. Specifying axis=0 would make the method act on rows instead.

什么意思呢?其实简单的理解办法就是:axis = 0 就是在列上上下方向应用一个方法,或者说是对 row index 作用;而 axis = 1 就是在行上左右方向作用,或者说是对列名。在 NumPy 的文档里也说了,axis = 0 是垂直方向上在行上下进行操作,axis = 1 在水平方向上对列操作。所以呢,这就能理解了,我们说行列其实是说在哪个维度上来操作,0 在行上操作,那么列不动,行上下压缩没了;反之,1 在列上左右方向操作,那行不动,列没了。

再回头看开头的例子:

In [10]: print(a)
[[ 0  1  2  3]
 [ 4  5  6  7]
 [ 8  9 10 11]]

In [11]: np.mean(a,axis=0)
Out[11]: array([4., 5., 6., 7.])
# 在 0 也就是行上上下操作,行全部压缩没了,上下求均数,所以剩下每一列一个均数

In [12]: np.mean(a,axis=1)
Out[12]: array([1.5, 5.5, 9.5])
# 在 1 也就列上所有操作,所以列压缩没了,左右求均数,所以剩下每一行一个均数

能理解了吧。

再来看开头提到的 np.split 。这个函数接受 3 个参数,对谁做切割操作,分成几块,以及 axis 即怎么切。

现在 a 三行四列,b 四行三列。所以考虑切两块的话,a 左右切,在列上左右操作,axis 是 1。b 就是上下切两块,行上上下操作,所以 axis 是 0 。验证一下:

In [13]: np.split(a, 2, axis=1)
Out[13]: 
[array([[0, 1],
        [4, 5],
        [8, 9]]), 
 array([[ 2,  3],
        [ 6,  7],
        [10, 11]])]

In [14]: np.split(b, 2, axis=0)
Out[14]: 
[array([[0, 1, 2],
        [3, 4, 5]]), 
 array([[ 6,  7,  8],
        [ 9, 10, 11]])]
# 为了排版便于阅读我在结果部分加了换行,但是内容没有改动

可以看到,代码确实如我们所想的那样工作。

最后,我们再看 StackOverflow 上那个问题里提到的 df.drop("col4", axis=1) 也就能理解了,我们制定 axis = 1 即在列上左右操作,所以被 drop 掉的肯定是列,然后有参数指定哪些列就行了。

嗯,收工。图书馆刚好关门,晚安。


2018-08-25
医学部图书馆二楼 094 号座位

SQLBolt 课程学习笔记三(9-12 课)

继续上课。

SQL Lesson 9: Queries with expressions

第 9 课,表达式查询。

In addition to querying and referencing raw column data with SQL, you can also use expressions to write more complex logic on column values in a query. These expressions can use mathematical and string functions along with basic arithmetic to transform values when the query is executed, as shown in this physics example.

我们不仅仅可以在原始数据上查询和应用,还可以通过写查询表达式来构建复杂一些的逻辑对列执行判断。表达式可以使用数学和字符语句函数结合算术方法来转换查询结果。比如下面这个物理学的例子:

SELECT particle_speed / 2.0 AS half_particle_speed
FROM physics_data
WHERE ABS(particle_position) * 10.0 > 500;

Each database has its own supported set of mathematical, string, and date functions that can be used in a query, which you can find in their own respective docs.
数据库一般都有自己支持的一套数学、字符和日期相关的函数功能用于查询,可以查看文档。

The use of expressions can save time and extra post-processing of the result data, but can also make the query harder to read, so we recommend that when expressions are used in the SELECT part of the query, that they are also given a descriptive alias using the AS keyword.
使用表达式可以节省时间,避免对结果的再处理,但同时也会使得查询语句可读性降低。因而推荐的做法是,当 SELECT 语句中使用表达式时,通过 AS 给查询起个别名。比如:

SELECT col_expression AS expr_description, …
FROM mytable;

In addition to expressions, regular columns and even tables can also have aliases to make them easier to reference in the output and as a part of simplifying more complex queries.
另外,数据的列和表格都可以有别称,这样有助于引用,也可以简化复杂的查询语句。比如:

SELECT column AS better_column_name, …
FROM a_long_widgets_table_name AS mywidgets
INNER JOIN widget_sales
  ON mywidgets.id = widget_sales.widget_id;

练习时间

You are going to have to use expressions to transform the BoxOffice data into something easier to understand for the tasks below.
使用表达式转换 BoxOffice 数据使得其更好理解。用到的表格还是前面那个:

1.tables.png

  1. List all movies and their combined sales in millions of dollars

列出所有电影及其总票房(百万为单位),就是数学计算咯

SELECT Title, 
       (Domestic_sales + International_sales ) / 1000000 AS Total_sales
FROM Boxoffice b LEFT JOIN movies m ON
    b.Movie_id = m.Id 
ORDER BY Total_sales DESC;

(为了结果好看一点我按总共票房降序排列了。)

  1. List all movies and their ratings in percent
    列出所有电影及其百分比评分,感觉和第 1 题一样啊:
SELECT Title, 
       Rating * 10 AS Rating_pct
FROM Boxoffice b LEFT JOIN movies m ON
    b.Movie_id = m.Id 
ORDER BY Rating_pct DESC;

而且我开始写的 (Rating / 10) * 100 AS Rating_pct 不知道为什么就是不行。

  1. List all movies that were released on even number years
    所有偶数年发行的电影,判断余数是否是 0 咯:
SELECT Title, Year FROM movies
WHERE Year % 2 = 0
ORDER BY Title ASC;

搞定收工。


SQL Lesson 10: Queries with aggregates (Pt. 1)

第10 节课,聚合(一)。预感从这里开始会有点小难。

In addition to the simple expressions that we introduced last lesson, SQL also supports the use of aggregate expressions (or functions) that allow you to summarize information about a group of rows of data. With the Pixar database that you've been using, aggregate functions can be used to answer questions like, "How many movies has Pixar produced?", or "What is the highest grossing Pixar film each year?".
除了上节课介绍的简单的表达式之外,SQL 也支持聚合表达式(或函数)从而可以使多行数据分组归纳。比如之前的皮克斯电影数据,通过聚合我们可以回答“皮克斯每年生产几部电影?”、“皮克斯每年票房最高的电影是什么?”之类的问题。语法:

SELECT AGG_FUNC(column_or_expression) AS aggregate_description, …
FROM mytable
WHERE constraint_expression;

Without a specified grouping, each aggregate function is going to run on the whole set of result rows and return a single value. And like normal expressions, giving your aggregate functions an alias ensures that the results will be easier to read and process.
没有特定的分组的时候,聚合功能会直接作用于所有行并返回单个值作为结果。跟上节课提到的一样,给聚合函数起个别名也会使结果更易读和更易于后续处理。


Common aggregate functions 常见的聚合函数

Function Description
COUNT(*), COUNT(column) A common function used to counts the number of rows in the group if no column name is specified. Otherwise, count the number of rows in the group with non-NULL values in the specified column.
MIN(column) Finds the smallest numerical value in the specified column for all rows in the group.
MAX(column) Finds the largest numerical value in the specified column for all rows in the group.
AVG(column) Finds the average numerical value in the specified column for all rows in the group.
SUM(column) Finds the sum of all numerical values in the specified column for the rows in the group.

Grouped aggregate functions 分组聚合

In addition to aggregating across all the rows, you can instead apply the aggregate functions to individual groups of data within that group (ie. box office sales for Comedies vs Action movies).
This would then create as many results as there are unique groups defined as by the GROUP BY clause.
除了所有行聚合之外,我们还可以对分组的行每组进行聚合(比如对喜剧片和动作片分别统计票房)。这样可以得到由 GROUP BY 定义的组别一样多的结果。通俗点说就是,分组一共得到几组,聚合得到的结果就有几个。语法:

SELECT AGG_FUNC(column_or_expression) AS aggregate_description, …
FROM mytable
WHERE constraint_expression
GROUP BY column;

The GROUP BY clause works by grouping rows that have the same value in the column specified.
GROUP BY 的分组依据是分组所指定列具有相同值的行分到一组。


练习

For this exercise, we are going to work with our Employees table. Notice how the rows in this table have shared data, which will give us an opportunity to use aggregate functions to summarize some high-level metrics about the teams. Go ahead and give it a shot.
这次练习 又用到前面那个 Employees 表格:

2.empl

可以看到表格中有的行有重复数据(比如 Role 和 Building),这就提供了聚合归纳得到更高级别数据的机会。试试吧:

  1. Find the longest time that an employee has been at the studio
    找到雇佣时间最长的雇员。
    嗯,雇员分组,时间求和,时间降序排列取第一个就是最长了:
SELECT Name, SUM(Years_employed) AS Empl_time FROM employees 
GROUP BY Name ORDER BY Empl_time DESC
LIMIT 1;
  1. For each role, find the average number of years employed by employees in that role
    对于每种工种计算平均工作年限,那就是 Role 分组,时间平均咯:
SELECT Role, AVG(Years_employed) AS Avg_empl_time FROM employees 
GROUP BY Role;
  1. Find the total number of employee years worked in each building
    计算每栋楼里所有雇员的总工作时间。楼分组,时间加和咯:
SELECT Building, SUM(Years_employed) AS Total_empl_time FROM employees 
GROUP BY Building;

竟然出奇的简单,开三。


SQL Lesson 11: Queries with aggregates (Pt. 2)

第 11 课,聚合(二)。

Our queries are getting fairly complex, but we have nearly introduced all the important parts of a SELECT query. One thing that you might have noticed is that if the GROUP BY clause is executed after the WHERE clause (which filters the rows which are to be grouped), then how exactly do we filter the grouped rows?
现在我们的查询语句已经有点小复杂了,但是其实 SELECT 语句的重要部分还没完全讲完。可以发现 GROUP BY 是在 WHERE 从句的后面执行的(即对 WHERE 筛选过的行分组),那我们还想对分组后的行在筛选一遍该如何是好捏?

Luckily, SQL allows us to do this by adding an additional HAVING clause which is used specifically with the GROUP BY clause to allow us to filter grouped rows from the result set.
所幸 SQL 提供了 HAVING ,它可以对 GROUP BY 从句分组后的结果进行筛选。语法:

SELECT group_by_column, AGG_FUNC(column_expression) AS aggregate_result_alias, …
FROM mytable
WHERE condition
GROUP BY column
HAVING group_condition;

The HAVING clause constraints are written the same way as the WHERE clause constraints, and are applied to the grouped rows. With our examples, this might not seem like a particularly useful construct, but if you imagine data with millions of rows with different properties, being able to apply additional constraints is often necessary to quickly make sense of the data.
HAVING 从句的写法和 WHERE 一样,并作用于分组后的行。在我们的例子里这个可能看起来没什么大用,但是如果放到一个几百万行的有不同性质的数据中,能额外对数据再进行一些限制性操作往往就很有必要了。

If you aren't using the GROUP BY clause, a simple WHERE clause will suffice.
在不使用 GROUP BY 的情况下,一个简单的 WHERE 就足够了。


For this exercise, you are going to dive deeper into Employee data at the film studio. Think about the different clauses you want to apply for each task.
练习题还是用前面的 Employee 数据:

3.empl

我们会再深入挖掘一下这个数据。做题的时候想一想你想用的那些从句。看题:

  1. Find the number of Artists in the studio (without a HAVING clause)
    不使用 HAVING 计算影楼里的 Artist 的数量,把 Artist 都选出来然后 COUNT 一下呗:
SELECT Role, Count() FROM employees WHERE Role = 'Artist';

这就是上面那个说的,没有 GROUP BY 的时候,HAVING 可以靠 WHERE 实现。

我能说我没想到是怎么用 HAVING 的么....上课没认真吗...
思考了一下,首先肯定是 Role 分组。然后 HAVING 分组后只要 Artist,然后 COUNT:

SELECT Role, Count() FROM employees GROUP BY Role HAVing Role = 'Artist';
  1. Find the number of Employees of each role in the studio
    每个工种雇员数量,那就是 Role 分组,雇员求和咯:
SELECT Role, Count(Name) FROM employees GROUP BY Role;
  1. Find the total number of years employed by all Engineers
    计算所有 Engineer 的工作时间。
    首先 Role 分组跑不掉,只要 Engineer 的话 HAVING 跑不掉,然后时间求和咯:
SELECT Role, SUM(Years_employed) AS Total_empl_time FROM employees
GROUP BY Role
HAVING Role = 'Engineer';

K.O.


SQL Lesson 12: Order of execution of a Query

第 12 课,查询语句的执行顺序。

Now that we have an idea of all the parts of a query, we can now talk about how they all fit together in the context of a complete query.
现在我们基本了解了一个查询的各个部分,可以来聊一聊在一个完整的查询中这些部分是如何组合到一起的了。比如下面这个查询:

SELECT DISTINCT column, AGG_FUNC(column_or_expression), …
FROM mytable
    JOIN another_table
      ON mytable.column = another_table.column
    WHERE constraint_expression
    GROUP BY column
    HAVING constraint_expression
    ORDER BY column ASC/DESC
    LIMIT count OFFSET COUNT;

Each query begins with finding the data that we need in a database, and then filtering that data down into something that can be processed and understood as quickly as possible. Because each part of the query is executed sequentially, it's important to understand the order of execution so that you know what results are accessible where.
每个查询都是以查找数据库中我们需要的数据开始,然后就是过滤直到得到能尽快处理和理解的东西为止。因为查询的各个部分是序贯执行的,因此理解执行顺序很重要,因为只有这样我们才能到哪里有什么结果。

Query order of execution 查询的执行顺序

1. FROM and JOINs

The FROM clause, and subsequent JOINs are first executed to determine the total working set of data that is being queried. This includes subqueries in this clause, and can cause temporary tables to be created under the hood containing all the columns and rows of the tables being joined.
FROM 和 后续的 JOIN,包括其子查询,都是第一个执行的,以此来决定本次查询所要用到的全部数据。这可能会生成由参与合并的所有数据的行和列所组成的临时的表格。

2. WHERE

Once we have the total working set of data, the first-pass WHERE constraints are applied to the individual rows, and rows that do not satisfy the constraint are discarded. Each of the constraints can only access columns directly from the tables requested in the FROM clause. Aliases in the SELECT part of the query are not accessible in most databases since they may include expressions dependent on parts of the query that have not yet executed.
只要需要用的数据确定了,下面就是第一轮的 WHERE 逐行应用并去掉不满足条件的行。每个限制条件都只能访问到 FROM 语句里直接导入的所有的表格的列。此时通过 SELECT 生成的别名列在多数数据库里还无法访问,因为它们可能还依赖于查询语句中尚未被执行的部分。

3. GROUP BY

The remaining rows after the WHERE constraints are applied are then grouped based on common values in the column specified in the GROUP BY clause. As a result of the grouping, there will only be as many rows as there are unique values in that column. Implicitly, this means that you should only need to use this when you have aggregate functions in your query.
WHERE 执行完后剩下的行会通过 GROUP BY 指定的列的值进行分组。分组后,分组后的数据会有多少行取决于分组使用的列有多少个唯一值。说白了,一般有聚合操作的时候才会用到分组。

4. HAVING

If the query has a GROUP BY clause, then the constraints in the HAVING clause are then applied to the grouped rows, discard the grouped rows that don't satisfy the constraint. Like the WHERE clause, aliases are also not accessible from this step in most databases.
查询中有 GROUP BY 的时候,HAVING 后在分组后应用于分组后的行并去掉不满足条件的行。和 WHERE 类似,此时别名仍然不可用。

5. SELECT

Any expressions in the SELECT part of the query are finally computed.
SELECT 会在最后执行。

(这也就是前面说由 SELECT 生成的别名一直不可用的原因)

6. DISTINCT

Of the remaining rows, rows with duplicate values in the column marked as DISTINCT will be discarded.
剩下的行中,DISTINCT 作用的列中的重复行会被去掉。

7. ORDER BY

If an order is specified by the ORDER BY clause, the rows are then sorted by the specified data in either ascending or descending order. Since all the expressions in the SELECT part of the query have been computed, you can reference aliases in this clause.
如果 ORDER BY 指定了排序,那么行会升序或降序排列。由于这个时候查询中的 SELECT 部分已经全部执行完了,我们可以引用别名了。(哇,终于可以了么)

8. LIMIT / OFFSET

Finally, the rows that fall outside the range specified by the LIMIT and OFFSET are discarded, leaving the final set of rows to be returned from the query.
最后的最后,LIMITOFFSET 指定范围之外的行会被去掉,最后剩下的就是返回的查询结果了。

Conclusion 结论

Not every query needs to have all the parts we listed above, but a part of why SQL is so flexible is that it allows developers and data analysts to quickly manipulate data without having to write additional code, all just by using the above clauses.
并不是所有的查询都会包含上面说到的这些部分,但是 SQL 就是这么的灵活,以至于仅仅靠上面提到的这些查询从句,开发者和数据分析师就可以在不需要写其他的代码的情况下迅速操纵数据。(这个牛吹得可以!)


练习时间又到了。

Here ends our lessons on SELECT queries, congrats of making it this far! This exercise will try and test your understanding of queries, so don't be discouraged if you find them challenging. Just try your best.
最后通过 SELECT 查询结束我们的课程,恭喜你已经走了这么远了。本次练习将会考察我们对于查询的理解,觉得有点难的话不要灰心。尽力做!

你们这么一说我还有点小忐忑呢....

数据还是我们滚瓜烂熟的电影票房数据:

4.movie

  1. Find the number of movies each director has directed
    查询每个导演的作品数。
    嗯,导演分组,电影求和:
SELECT Director, COUNT(Title) as Total FROM movies GROUP BY Director;

开始一直把 COUNT 写成 SUM 导致卡了好久 23333333。想想其实不是电影求和,而是计数。

  1. Find the total domestic and international sales that can be attributed to each director
    每个导演的总国内外票房。
    导演分组,票房分别求和:
SELECT Director, SUM(Domestic_sales + International_sales) as Total_sales FROM movies m
INNER JOIN Boxoffice b ON
    m.Id = b.Movie_id
GROUP BY Director;

好吧,英语理解的问题,我开始以为是国内外分别求和,又瞎浪费了一小会儿。

R 基础知识——数据类型

内容来自于看《R 语言实战》时做的笔记
2017-04-24

向量

向量是一个一维数组,用于存储数值型、字符型或逻辑型数据。执行组合功能的函数c()可用来创建向量;

> a<-c(1, 2, 5, 3, 6, 2, 4)
> b<-c("one", "two", "three")
> c<-c(TRUE,TRUE,TRUE,FALSE,TURE,FALSE)

a 是数值型向量,b 是字符型向量,c 是逻辑向量。

  • 单个向量中的数据必须拥有相同的类型或模式(数值型、字符型或逻辑型)。同一向量中无法混杂不同模式的数据。
  • 通常在方括号中给定元素所处位置的数值,我们可以访问向量中的元素。
> a <- c(1, 2, 3, 4, 5, 6)
> a[3]
[1] 3
> a[c(1, 3, 5)]
[1] 1 3 5
> a[2:6]
[1] 2 3 4 5 6

矩阵

  • 矩阵是一个二维数组,只是每个元素都拥有相同的模式(数值型、字符型或逻辑型)
  • 通过matrix创建矩阵,一般使用格式为mymatrix <- matrix(vector,nrow=number_of_rows,ncol=number_of_columns, byrow=logical_value, dimnames=list(char_vector_rownames, char_vector_colnames)) 其中vector包含了矩阵的元素,nrowncol用以指定行和列的维数,dimnames包含了可选的、以字符型向量表示的行名和列名。选项byrow则表明矩阵应当按行填充(byrow=TRUE)还是按列填充(byrow=FALSE),默认情况下按列填充
  • 使用下标和方括号来选择矩阵中的行、列或元素。x[i,]指矩阵X中的第i行,x[,j]指矩阵X中的第j列,x[i,j]指第i行第j个元素。选择多行或多列时,下标i和j可为数值型向量
> cells <- c(1, 2, 3, 4)
> rnames <- c("R1","R2")
> cnames <- c("C1","C2")
> mymatrix <- (cells,nrow=2,ncol=2,byrow=TRUE,dimnames=list(rnames,cnames)
> mymatrix
    C1 C2
R1   1  2
R2   3  4
> ymatrix <- matrix(120,nrow=4,ncol=5)
> y
       [,1]  [,2]  [,3]  [,4]  [,5]
[1,]      1     5     9    13    17
[2,]      2     6    10    14    18
[3,]      3     7    11    15    19
[4,]      4     8    12    16    20
> y[,1]
[1] 1 2 3 4
> y[2,2]
[1] 6
> y[1,c(4,5)
[1] 13 17

数组

  • 数组(array)与矩阵类似,但是维度可以大于2。数组可通过array函数创建: myarray <- array(vector, dimensions, dimnames)。 其中vector包含了数组中的数据,dimensions是一个数值型向量,给出了各个维度下标的最大值,而dimnames是可选的、各维度名称标签的列表。
  • z<-array(1:24,c(2,3,4), dimnames=list(dim1,dim2,dim3)), c(2,3,4)表示二行三列四组
    数组与矩阵一样,只能拥有一种模式。

数据框

  • 数据框(data frame)是R中用于储存数据的一种结构:列表示变量,行表示观测。在同一个数据框中可以储存不同类型的(如数值型、字符型)变量。数据框是用来存储数据集的主要数据结构。
  • 由于不同的列可以包含不同模式(数值型,字符型)的数据,数据框的概念更为符合现实情况,数据框是R中最常处理的数据结构。
  • 数据框可通过函数data.frame()创建,mydata <- data.frame(col1, col2, col3,...)其中的列向量col1,col2,col3...可为任何类型(如字符型、数值型或逻辑型), 每一列的名称可由函数 names 指定。每一列数据的模式必须唯一,不过可以将多个模式的不同列放到一起组成数据框。
> patientID <- c(1,2,3,4)
> age <- c(23,24,25,26)
> diabetes <- c("Type1","Type2","Type1","Type2")
> status <- c("Poor","Improved","Excellent","Poor")
> patientdata <- data.frame(patientID, age, diabetes,status)
> patientdata
    patientID age diabetes    status
1           1  23    Type1      Poor
2           2  24    Type2  Improved
3           3  25    Type1 Excellent
4           4  26    Type2      Poor
  • 选数据框中的元素的方式有若干种,可以使用前述的下标记号,或者直接指定列名;可以用$选取一个给定数据框中的某个特定变量;还可以生成糖尿病类型变量diabetes和病情变量status的列联表:
> patientdata[1,2]
    patientID age
1           1  23
2           2  24
3           3  25
4           4  26
> patientdata[,3:4]
    diabetes    status
1      Type1      Poor
2      Type2  Improved
3      Type1 Excellent
4      Type2      Poor
> patientdata[c("diabetes","status")
    diabetes    status
1      Type1      Poor
2      Type2  Improved
3      Type1 Excellent
4      Type2      Poor
> patientdata$age
[1] 23 24 25 26
> patienttable <- table(patientdata$diabetes,patientdata$status)
> patienttable
          Excellent Improved Poor
    Type1         1        0    1
    Type2         0        1    1
  • 在病例数据中,病人编号(patientID)用于区分数据集中不同的个体。在R中,实例标识符(case identifier)可通过数据框操作函数中的row.names选项指定。
> patientdata <- data.frame(patientID, age, diabetes, status, row.names = patientID)
> patientdata
    patientID age diabetes    status
1           1  23    Type1      Poor
2           2  24    Type2  Improved
3           3  25    Type1 Excellent
4           4  26    Type2      Poor
> patientdata <- data.frame(patientID, age, diabetes, status, row.names = age)
> patientdata
     patientID age diabetes    status
    23         1  23    Type1      Poor
    24         2  24    Type2  Improved
    25         3  25    Type1 Excellent
    26         4  26    Type2      Poor

因子

  • 变量可归结为名义型、有序型或连续型变量。名义型变量是没有顺序之分的类别变量。糖尿病类型Diabetes(Types1,Type2)是名义型变量的一例;
  • 有序型变量表示一种顺序关系,而非数量关系,病情Status(poor,improved,excellent)是顺序型变量的一个佳例,病情为poor的病人状态不如improved的病人,但并不知道相差多少;
  • 连续型变量可以呈现为某个范围内的任意值,并同时表示了顺序和数量。年龄Age就是一个连续型变量。
  • 类别(名义型)变量和有序类别(有序型)变量在R中称为因子(factor)。因子在R中非常重要,因为它决定了数据的分析方式以及如何进行视觉呈现。
  • 函数factor()以一个整数向量的形式存储类别值,整数的取值范围1....k(其中k是名义型变量中唯一值的个数),同时一个由字符串(原始值)组成的内部向量将映射到这些整数上。
  • 举例:假设有向量:diabetes <- c("Type1", "Type2", "Type1", "Type1")语句diabetes <- factor(diabetes)将此向量储存为(1,2,1,1),并在内部将其关联为1=Type12=Type2(具体赋值根据字母顺序而定)。针对向量diabetes进行的任何分析都会将其作为名义型变量,并自动选择合适的统计方法;
  • 表示有序型变量,需要为函数factor()指定参数ordered=TRUE。给定向量 status <- c("Poor", "Improved", "Excellent", "Poor"),语句status <- factor(status, ordered=TRUE)会将向量编码为(3,2,1,3),并在内部将这些值关联为1=Excellent2=Improved,以及3=Poor。另外,针对此向量进行的任何分析都会将其作为有序型变量对待,并自动选择合适的统计方法;对于字符型变量,因子的水平默认依字母的顺序创建,但按默认的字母顺序排序的因子很少能够让人满意。可以通过指定levels选项来覆盖默认排序,status <- factor(status, order=TRUE, leves=c("Poor","Improved","Excellent"))各水平的赋值将为1=Poor2=Improved3=Excellent。请保证指定的水平与数据中真实值相匹配,因为任何在数据中出现而未在参数中列举的数据都将被设为缺失值。
> diabetes <- factor(diabetes)
> diabetes
    [1] Type1 Type2 Type1 Type2
    Levels: Type1 Type2
    > str(patientdata)
    'data.frame':    4 obs. of  4 variables:
    $ patientID: num  1 2 3 4
    $ age      : num  23 24 25 26
    $ diabetes : Factor w/ 2 levels "Type1","Type2": 1 2 1 2
    $ status   : Factor w/ 3 levels "Excellent","Improved",..: 3 2 1 3
> status <- factor(status,ordered = TRUE)
> status
    [1] Poor Improved  Excellent Poor
    Levels: Excellent < Improved < Poor
    > str(patientdata)
    'data.frame':    4 obs. of  4 variables:
    $ patientID: num  1 2 3 4
    $ age      : num  23 24 25 26
    $ diabetes : Factor w/ 2 levels "Type1","Type2": 1 2 1 2
    $ status   : Factor w/ 3 levels "Excellent","Improved",..: 3 2 1 3
> status <- factor(status,ordered = TRUE,levels = c("Poor","Improved","Excellent"))
> status
    [1] Poor  Improved  Excellent Poor
    Levels: Poor < Improved < Excellent
> str(patientdata)
    'data.frame':    4 obs. of  4 variables:
    $ patientID: num  1 2 3 4
    $ age      : num  23 24 25 26
    $ diabetes : Factor w/ 2 levels "Type1","Type2": 1 2 1 2
    $ status   : Factor w/ 3 levels "Excellent","Improved",..: 3 2 1 3
> summary(patientdata)
     patientID         age         diabetes       status
    Min.   :1.00   Min.   :23.00   Type1:2   Excellent:1
    1st Qu.:1.75   1st Qu.:23.75   Type2:2   Improved :1
    Median :2.50   Median :24.50             Poor     :2
    Mean   :2.50   Mean   :24.50
    3rd Qu.:3.25   3rd Qu.:25.25
    Max.   :4.00   Max.   :26.00

列表

  • 列表(list)是 R 的数据类型中最为复杂的一种。一般来说,列表就是一些对象(或成分, component)的有序集合。列表允许整合若干(可能无关的)对象到单个对象名下。例如,列表中可能是若干向量,矩阵,数据框,甚至是其他列表的组合。可以使用函数list()创建列表。mylist <- list(object1,object2,...);其中的对象可以是目前为止讲到的任何结构,还可以为列表中的对象命名:mylist <-list(name1=object1, name2=object2, ...)
  • 可以通过在双重括号中指明代表某个成分的数字或名称来访问列表中的元素。此例中的mylist[[2]]mylist[["name2"]]均指第二个元素。
  • 由于两个原因,列表成为 R 中的重要数据结构。首先,列表允许以一种简单的方式组织和重新调用不相干的信息;其次,许多 R 函数的运行结果都是以列表的形式返回的。需要取出其中哪些成分由分析人员决定。

SQLBolt 课程学习笔记五(番外篇)

SQL Topic: Subqueries 子查询

You might have noticed that even with a complete query, there are many questions that we can't answer about our data without additional post, or pre, processing. In these cases, you can either make multiple queries and process the data yourself, or you can build a more complex query using SQL subqueries.
一次查询回答不了很多问题,这时候要么多次查询,要么搞个复杂的查询。例子:

Example: General subquery

Lets say your company has a list of all Sales Associates, with data on the revenue that each Associate brings in, and their individual salary. Times are tight, and you now want to find out which of your Associates are costing the company more than the average revenue brought per Associate.

First, you would need to calculate the average revenue all the Associates are generating:

SELECT AVG(revenue_generated)
FROM sales_associates;

And then using that result, we can then compare the costs of each of the Associates against that value. To use it as a subquery, we can just write it straight into the WHERE clause of the query:

SELECT *
FROM sales_associates
WHERE salary > 
   (SELECT AVG(revenue_generated)
    FROM sales_associates);

As the constraint is executed, each Associate's salary will be tested against the value queried from the inner subquery.

例子懒得翻译了。

A subquery can be referenced anywhere a normal table can be referenced. Inside a FROM clause, you can JOIN subqueries with other tables, inside a WHERE or HAVING constraint, you can test expressions against the results of the subquery, and even in expressions in the SELECT clause, which allow you to return data directly from the subquery. They are generally executed in the same logical order as the part of the query that they appear in, as described in the last lesson.
能引用表格的地方就可以使用子查询。FORM 里子查询可以和其他表格 JOINWHERE 或者 HAVING 语句里,可以针对子查询的结果做判断,哪怕是子查询的结果是 SELECT 语句返回的数据。子查询的执行顺序基本上遵循前面讲到的规则。

Because subqueries can be nested, each subquery must be fully enclosed in parentheses in order to establish proper hierarchy. Subqueries can otherwise reference any tables in the database, and make use of the constructs of a normal query (though some implementations don't allow subqueries to use LIMIT or OFFSET).
子查询可以嵌套组合,每个子查询必须用括号包围以保证正确的层次关系。子查询可以引用数据库中的任何表,并使用普通查询的构造(尽管某些实现不允许子查询使用 LIMITOFFSET)。

Correlated subqueries

A more powerful type of subquery is the correlated subquery in which the inner query references, and is dependent on, a column or alias from the outer query. Unlike the subqueries above, each of these inner queries need to be run for each of the rows in the outer query, since the inner query is dependent on the current outer query row.
更为强大的是内查询引用了或者依赖与外查询的数据列或别名的相关联的子查询。和上面的子查询不同之处在于,这时候内查询依赖与外查询数据行,因此内查询必须在外查询每一行上都执行一次。

说得有点绕,意思大概清楚,翻译得有点词不达意。看例子吧:

Example: Correlated subquery

Instead of the list of just Sales Associates above, imagine if you have a general list of Employees, their departments (engineering, sales, etc.), revenue, and salary. This time, you are now looking across the company to find the employees who perform worse than average in their department.

For each employee, you would need to calculate their cost relative to the average revenue generated by all people in their department. To take the average for the department, the subquery will need to know what department each employee is in:

SELECT *
FROM employees
WHERE salary > 
   (SELECT AVG(revenue_generated)
    FROM employees AS dept_employees
    WHERE dept_employees.department = employees.department);

These kinds of complex queries can be powerful, but also difficult to read and understand, so you should take care using them. If possible, try and give meaningful aliases to the temporary values and tables. In addition, correlated subqueries can be difficult to optimize, so performance characteristics may vary across different databases.
这种复杂的查询十分强大,但同时也降低了可读性,提高了理解难度,所以用的时候应该多加小心并且对临时的值和表格使用合适的别名。另外,关联子查询的优化也很困难,因为不同数据库的性能表现可能也不尽相同。

Existence tests

When we introduced WHERE constraints in Lesson 2: Queries with constraints, the IN operator was used to test whether the column value in the current row existed in a fixed list of values. In complex queries, this can be extended using subqueries to test whether a column value exists in a dynamic list of values.
在第二节课介绍带限制条件的 WHERE 查询时,我们用 IN 来判断某一列中当前行的值是否在一个固定的列表中。在复杂的查询中,子查询拓展到判断某一列的值是否存在于一个动态的列表中。语法:

-- Select query with subquery constraint
SELECT *, …
FROM mytable
WHERE column
    IN/NOT IN (SELECT another_column
               FROM another_table);

When doing this, notice that the inner subquery must select for a column value or expression to produce a list that the outer column value can be tested against. This type of constraint is powerful when the constraints are based on current data.
可以注意到,此时子查询需要选定一列值或者一个表达式给外查询提供判断依据。这种类型限制条件在需要使用限制条件基于同一个数据时十分强大。

SQL Topic: Unions, Intersections & Exceptions

When working with multiple tables, the UNION and UNION ALL operator allows you to append the results of one query to another assuming that they have the same column count, order and data type. If you use the UNION without the ALL, duplicate rows between the tables will be removed from the result.
同时操作多个表格的时候,UNIONUNION ALL 可以在多个查询的结果具有相同的列数、列的顺序和数据类型时把它们连接到一起。只使用 UNION 不加 ALL 的时候,重复的行会被移除。

-- Select query with set operators
SELECT column, another_column
   FROM mytable
UNION / UNION ALL / INTERSECT / EXCEPT
SELECT other_column, yet_another_column
   FROM another_table
ORDER BY column DESC
LIMIT n;

In the order of operations as defined in Lesson 12: Order of execution, the UNION happens before the ORDER BY and LIMIT. It's not common to use UNIONs, but if you have data in different tables that can't be joined and processed, it can be an alternative to making multiple queries on the database.
在第 12 节课提到的执行顺序里,UNION 的执行早于 ORDER BYLIMIT。使用 UNION 并不是很常见,但是如果你的数据分散在无法处理或合并的多个表格里的话,这确实是一种避免多次查询的办法。

Similar to the UNION, the INTERSECT operator will ensure that only rows that are identical in both result sets are returned, and the EXCEPT operator will ensure that only rows in the first result set that aren't in the second are returned. This means that the EXCEPT operator is query order-sensitive, like the LEFT JOIN and RIGHT JOIN.
UNION 类似,INTERSECT 会返回多个结果间共有的行,EXCEPT 则只会返回第一个结果中有而第二个结果中没有的行。就是说 EXCEPT 是查询顺序敏感的操作,和 LEFT JOINRIGHT JOIN 一样。

Both INTERSECT and EXCEPT also discard duplicate rows after their respective operations, though some databases also support INTERSECT ALL and EXCEPT ALL to allow duplicates to be retained and returned.
INTERSECTEXCEPT 也都会去掉结果中的重复行,但有的数据库支持通过 INTERSECT ALLEXCEPT ALL 保留重复行。

THE END


这次真的结束了,终于。
但是这一篇翻译得很不走心,几乎是字面翻译。有空再改吧。

《R Graphic Cookbook》第二章学习笔记

2017-06-25 15:20:06

设置各种元素的颜色

Setting colors for text elements: axis annotations, labels, plot titles, and legends

plot(rnorm(100), 
     main="Plot Title",
     col.axis="blue",
     col.lab="red",
     col.main="darkblue",
     col='darkgreen')

得到下图:

1

R 内置的默认颜色组:

palette()
[1] "black"   "red"     "green3"  "blue"    "cyan"    "magenta" "yellow"  "gray" 
palette(c("red","blue","green","orange"))
palette()
[1] "red"    "blue"   "green"  "orange"
# To revert back to the default palette type:
palette("default")

设置字体

字体一般通过 par() 来设置,例如:

par(family="serif",font=2)

A font is specified in two parts: a font family (such as Helvetica or Arial) and a font face within that family (such as bold or italic).
The available font families vary by operating system and graphics devices. So R provides some proxy values which are mapped on to the relevant available fonts irrespective of the system. Standard values for family are "serif", "sans", and "mono".

The font argument takes numerical values: 1 corresponds to plain text (the default), 2 to bold face, 3 to italic, and 4 to bold italic.

The fonts for axis annotations, labels, and plot main title can be set separately using the font.axis, font.lab, and font.main arguments respectively.

点和线的种类和样式

点的种类通过pch参数设置,共25种:

par(mfrow = c(5, 5))
for(i in 1:5){
  if(i < 5){
    for(j in 1:5){plot(1, pch = (i-1)*5 + j, cex = 2, col = 'black')}} # pch设置点的样式,cex设置点的大小
  else
    for(j in 1:5){plot(1, pch = (i-1)*5 + j, cex = 2, col = 'darkgreen', bg = 'red')}
}

4

线的样式和粗细分别通过ltylwd来设置:

plot(rain$Tokyo,
     ylim=c(0,250),
     main="Monthly Rainfall in major cities",
     xlab="Month of Year",
     ylab="Rainfall (mm)",
     type="l",
     lty=1,
     lwd=2)
lines(rain$NewYork,lty=2,lwd=2)
lines(rain$London,lty=3,lwd=2)
lines(rain$Berlin,lty=4,lwd=2)
legend("top",
       legend=c("Tokyo","New York","London","Berlin"),
       ncol=4,
       cex=0.8,
       bty="n",
       lty=1:4,
       lwd=2)

得到下图:

2

Line type number values correspond to types of lines:

  • 0: blank
  • 1: solid (default)
  • 2: dashed
  • 3: dotted
  • 4: dotdash
  • 5: longdash
  • 6: twodash

We can also use the character strings instead of numbers, for example, lty="dashed" instead of lty=2.

设置坐标轴标签和刻度

可以分别通过xaxpyaxp设置坐标轴的范围和间距,格式为c(min,max,n)

plot(rnorm(100),xaxp=c(0,100,10),yaxp=c(-2,2,4))

得到:

3

las参数可以设置坐标轴刻度标识与轴的方向关系:

  • 0: always parallel to the axis (default)

  • 1: always horizontal

  • 2: always perpendicular to the axis

  • 3: always vertical

SQLBolt 课程学习笔记一(1-5 课)

终究还是要回来好好补课,看文档太无聊。看网上很多人都推荐 SQLBolt 这个在线网站学习课程,所以今天就打算来看这个了。

toc

整个课程包括介绍 + 18 节课 + 结束课程。网站不需要注册,每节课包括简单地知识点介绍和练习题。做练习题时有实时命令错误提示和结果预览,只有答对才能继续下一题。实在不会做也有 Solution 放在旁边,非常好。

好的,开始吧。

Introduction to SQL

Welcome to SQLBolt, a series of interactive lessons and exercises designed to help you quickly learn SQL right in your browser.
在浏览器里学习 SQL,好。

SQL, or Structured Query Language, is a language designed to allow both technical and non-technical users query, manipulate, and transform data from a relational database. And due to its simplicity, SQL databases provide safe and scalable storage for millions of websites and mobile applications.
才知道 SQL 是 Structured Query Language 的缩写,好吧。

Since most users will be learning SQL to interact with an existing database, the lessons begin by introducing you to the various parts of an SQL query. The later lessons will then show you how to alter a table (or schema) and create new tables from scratch.
课程首先教我们怎么查询既有数据库,后面再教怎么创建和修改数据库、表格和 Schema。

By the end, we hope you will be able to have a strong foundation for using SQL in your own projects and beyond.
好的,开始上课吧。

SQL Lesson 1: SELECT queries 101

第一节课学取数据,语法就很简单咯:

SELECT column, another_column, …
FROM mytable;

练习题是一个关于电影数据的表格:

movies

来看题目:

  1. Find the title of each film
    简单:SELECT Title FROM movies;

  2. Find the director of each film
    一样,SELECT Director FROM movies;

  3. Find the title and director of each film
    1 + 2 咯:SELECT Title, Director FROM movies;

  4. Find the title and year of each film
    和 3 差不多:SELECT Title, Year FROM movies;

  5. Find all the information about each film
    全部,SELECT * FROM movies;

第一课轻松收工,撒花。

SQL Lesson 2: Queries with constraints (Pt. 1)

第二课主要是学习带限制性语句的查询。

Now we know how to select for specific columns of data from a table, but if you had a table with a hundred million rows of data, reading through all the rows would be inefficient and perhaps even impossible.
第一节课学习的是怎么查询表的特定列。但是当一个表极其大,有 N 多行的时候,我们可能只想得到某些特定的行,这时候就要用限制性查询语句了。

In order to filter certain results from being returned, we need to use a WHERE clause in the query. The clause is applied to each row of data by checking specific column values to determine whether it should be included in the results or not.
限制性查询通过 WHERE 从句实现,它会在数据的每一行执行判断来决定结果应该包含哪些行。基本语法为:

SELECT column, another_column, …
FROM mytable
WHERE condition
    AND/OR another_condition
    AND/OR …;

More complex clauses can be constructed by joining numerous AND or OR logical keywords.
复杂的限制条件可以通过使用 AND 或者 OR 组成。以及还有其他一些:

Operator Condition SQL Example
=, !=, < <=, >, >= Standard numerical operators col_name != 4
BETWEEN … AND … Number is within range of two values (inclusive) col_name BETWEEN 1.5 AND 10.5
NOT BETWEEN … AND … Number is not within range of two values (inclusive) col_name NOT BETWEEN 1 AND 10
IN (…) Number exists in a list col_name IN (2, 4, 6)
NOT IN (…) Number does not exist in a list col_name NOT IN (1, 3, 5)

As you might have noticed by now, SQL doesn't require you to write the keywords all capitalized, but as a convention, it helps people distinguish SQL keywords from column and tables names, and makes the query easier to read.
SQL 不要求命令大写,但是习惯上大家为了可读性都惯例性的把命令大写。好的,当然。

又到了紧张刺激的练习题时间。

表格还是那张表格:

movies

  1. Find the movie with a row id of 6
    row id 是 6 的电影,简单,SELECT Id, Title FROM movies WHERE Id=6;

  2. Find the movies released in the years between 2000 and 2010
    BETWEEN 嘛,好说,SELECT Title, Year FROM movies WHERE Year Between 2000 and 2010;

  3. Find the movies not released in the years between 2000 and 2010
    NOT BETWEEN 嘛,也好说,SELECT Title, Year FROM movies WHERE Year NOT Between 2000 and 2010;

  4. Find the first 5 Pixar movies and their release year
    前 5 个,LIMIT 超纲了啊老师,SELECT Title, Year FROM movies WHERE Year LIMIT 5;

学完走人

SQL Lesson 3: Queries with constraints (Pt. 2)

第三课继续学习限制性查询语句。

When writing WHERE clauses with columns containing text data, SQL supports a number of useful operators to do things like case-insensitive string comparison and wildcard pattern matching.
WHERE 从句支持大小写敏感/非敏感的字符串比较,一些例子:

Operator Condition Example
= Case sensitive exact string comparison (notice the single equals) col_name = "abc"
!= or <> Case sensitive exact string inequality comparison col_name != "abcd"
LIKE Case insensitive exact string comparison col_name LIKE "ABC"
NOT LIKE Case insensitive exact string inequality comparison col_name NOT LIKE "ABCD"
% Used anywhere in a string to match a sequence of zero or more characters (only with LIKE or NOT LIKE) col_name LIKE "%AT%" (matches "AT", "ATTIC", "CAT" or even "BATS")
_ Used anywhere in a string to match a single character (only with LIKE or NOT LIKE) col_name LIKE "AN_" (matches "AND", but not "AN")
IN (…) String exists in a list col_name IN ("A", "B", "C")
NOT IN (…) String does not exist in a list col_name NOT IN ("D", "E", "F")

趁热打铁练习一下:

  1. Find all the Toy Story movies
    Toy Story 系列电影有 2、3,所以就要匹配 Toy Story* 这样的模式,SQL 用 %,即 SELECT * FROM movies WHERE Title LIKE "Toy Story%";

  2. Find all the movies directed by John Lasseter
    John Lasseter 的电影,得用 =,保不齐有个 John Lasseter Jr 之类的,所以等于靠谱点,SELECT * FROM movies WHERE Director = "John Lasseter";

  3. Find all the movies (and director) not directed by John Lasseter
    SELECT * FROM movies WHERE Director != "John Lasseter";

  4. Find all the WALL-* movies
    和第 1 题很像,SELECT * FROM movies WHERE Title LIKE "WALL-%";

NEXT -->>

SQL Lesson 4: Filtering and sorting Query results

对查询结果作筛选和排序

Even though the data in a database may be unique, the results of any particular query may not be – take our Movies table for example, many different movies can be released the same year. In such cases, SQL provides a convenient way to discard rows that have a duplicate column value by using the DISTINCT keyword.
虽然数据库里的数据每行可能都是唯一的,但是查询结果就不一定了,比如我们用的电影这个表格里有很多电影都是在相同年份发行的。这时候要取要对某列取唯一值就得用 DISTINCT 关键字了。语法:

SELECT DISTINCT column, another_column, …
FROM mytable
WHERE condition(s);

Since the DISTINCT keyword will blindly remove duplicate rows, we will learn in a future lesson how to discard duplicates based on specific columns using grouping and the GROUP BY clause.
DISTINCT 是简单粗暴的直接移除重复行,后面我们会学习通过 GROUP BY 从句来处理重复值。

Ordering results

Unlike our neatly ordered table in the last few lessons, most data in real databases are added in no particular column order. As a result, it can be difficult to read through and understand the results of a query as the size of a table increases to thousands or even millions rows.

To help with this, SQL provides a way to sort your results by a given column in ascending or descending order using the ORDER BY clause.
现实世界的数据往往一团糟没有很好的排序,我们经常需要针对某一列排序来更好地组织结果,这就要用到 ORDER BY 从句了。语法:

SELECT column, another_column, …
FROM mytable
WHERE condition(s)
ORDER BY column ASC/DESC;

When an ORDER BY clause is specified, each row is sorted alpha-numerically based on the specified column's value. In some databases, you can also specify a collation to better sort data containing international text.
ORDER BY是根据字母表顺序排序的。

Limiting results to a subset

Another clause which is commonly used with the ORDER BY clause are the LIMIT and OFFSET clauses, which are a useful optimization to indicate to the database the subset of the results you care about.
The LIMIT will reduce the number of rows to return, and the optional OFFSET will specify where to begin counting the number rows from.
LIMITOFFSET 经常和 ORDER BY 搭配使用。前者指定取多少行,后者指定从第几行开始数。语法:

SELECT column, another_column, …
FROM mytable
WHERE condition(s)
ORDER BY column ASC/DESC
LIMIT num_limit OFFSET num_offset;

If you think about websites like Reddit or Pinterest, the front page is a list of links sorted by popularity and time, and each subsequent page can be represented by sets of links at different offsets in the database. Using these clauses, the database can then execute queries faster and more efficiently by processing and returning only the requested content.
想想 Reddit、 Pinterest 之类的网站,首页一般就是根据热度和时间排序的一堆链接(ORDER BY + LIMIT 的结果),后续页的链接就是在前面的页面基础上 OFFSET 出来的。用这些从句使得数据库查询每一次都只处理需要的结果,应而查询速度更快效率更高。

If you are curious about when the LIMIT and OFFSET are applied relative to the other parts of a query, they are generally done last after the other clauses have been applied. We'll touch more on this in Lesson 12: Order of execution after introducting a few more parts of the query.
你可能很好奇 LIMITOFFSET 相对与整个从句的执行先后顺序,事实上它们基本上是在其他从句执行之后才执行的。后面的第 12 课会讲查询语句的执行顺序的。

一课一练时间到。

表格还是那张表格,这样子:

movies

题目:

  1. List all directors of Pixar movies (alphabetically), without duplicates
    不重复的列出所有导演并排序,SELECT DISTINCT Director FROM movies ORDER BY Director;

  2. List the last four Pixar movies released (ordered from most recent to least)
    按时间从近到远列出最新的 4 部电影,SELECT * FROM movies ORDER BY Year DESC LIMIT 4;

  3. List the first five Pixar movies sorted alphabetically
    排序后列出前 5 部电影,SELECT * FROM movies ORDER BY Title LIMIT 5;

  4. List the next five Pixar movies sorted alphabetically
    3 的基础上下 5 部,那就是 OFFSET 了,SELECT * FROM movies ORDER BY Title LIMIT 5 OFFSET 5;

SQL Review: Simple SELECT Queries

第五节课是复习。前面学的:

SELECT column, another_column, …
FROM mytable
WHERE condition(s)
ORDER BY column ASC/DESC
LIMIT num_limit OFFSET num_offset;

In the exercise below, you will be working with a different table. This table instead contains information about a few of the most populous cities of North America including their population and geo-spatial location in the world.
这次练习会用到跟前面不同的一张表格,这张表格是北美一些最大的城市的地理位置和人口情况:

movies

Positive latitudes correspond to the northern hemisphere, and positive longitudes correspond to the eastern hemisphere. Since North America is north of the equator and west of the prime meridian, all of the cities in the list have positive latitudes and negative longitudes.
经度和纬度的正值表示东经和北纬。由于北美在西、北半球,所以表格里的城市都是负经度和正纬度的。

Try and write some queries to find the information requested in the tasks you know. You may have to use a different combination of clauses in your query for each task. Once you're done, continue onto the next lesson to learn about queries that span multiple tables.
这次的练习需要组合使用前面学过的东西了。有点小紧张呢,来吧。

  1. List all the Canadian cities and their populations
    所有 Canada 城市及其人口,SELECT City, Country, Population FROM north_american_cities WHERE Country='Canada';

  2. Order all the cities in the United States by their latitude from north to south
    美国城市按纬度从北到南,由于北美纬度都是正值,北到南那就是从大到小即降序咯,SELECT * FROM north_american_cities WHERE Country='United States' ORDER BY Latitude DESC;

  3. List all the cities west of Chicago, ordered from west to east
    Chicago 以西的城市从西到东排。经度全是负值,越往西负值越大(负数越小),那就是比 Chicago 经度更小的从小往大排(升序)咯,SELECT * FROM north_american_cities WHERE Longitude < -87.629798 ORDER BY Longitude ASC;。Chicago 的经度得自己手动查询输入,想起了高中考试不给原子质量表和分子质量表得自己死记,差评。

  4. List the two largest cities in Mexico (by population)
    Mexico 人口最大的两个城市,SELECT * FROM north_american_cities WHERE Country='Mexico' ORDER BY Population DESC LIMIT 2;

  5. List the third and fourth largest cities (by population) in the United States and their population
    美国第 3、4 大人口市,嗯,考点 OFFSET,SELECT * FROM north_american_cities WHERE Country='United States' ORDER BY Population DESC LIMIT 2 OFFSET 2;

1 - 5 课上完,大部分东西之前接触过,还比较轻松。这一篇先写到这里吧。

关于源码编译的基础知识 via LinuxSir (下篇)

2017-05-18 21:41:38

首先说下**/etc/ld.so.conf**:
这个文件记录了编译时使用的动态链接库的路径。
默认情况下,编译器只会使用/lib/usr/lib这两个目录下的库文件
如果你安装了某些库,比如在安装gtk+-2.4.13时它会需要glib-2.0 >= 2.4.0, 辛苦的安装好glib后
没有指定--prefix=/usr这样glib库就装到了/usr/local下,而又没有在/etc/ld.so.conf中添加/usr/local/lib
这个搜索路径,所以编译gtk+-2.4.13就会出错了
对于这种情况有两种方法解决:

  1. 在编译glib-2.4.x时,指定安装到/usr下,这样库文件就会放在/usr/lib中,gtk就不会找不到需要的库文件了
    对于安装库文件来说,这是个好办法,这样也不用设置PKG_CONFIG_PATH了 (稍后说明)
  2. /usr/local/lib加入到/etc/ld.so.conf中,这样安装gtk时就会去搜索/usr/local/lib, 同样可以找到需要的库
    /usr/local/lib加入到/etc/ld.so.conf也是必须的,这样以后安装东东到local下,就不会出现这样的问题了。
    将自己可能存放库文件的路径都加入到/etc/ld.so.conf中是明智的选择

再来看看**ldconfig**是个什么东东吧 :
它是一个程序,通常它位于/sbin下,是root用户使用的东东。具体作用及用法可以man ldconfig查到
简单的说,它的作用就是将/etc/ld.so.conf列出的路径下的库文件 缓存到/etc/ld.so.cache以供使用
因此当安装完一些库文件,(例如刚安装好glib),或者修改ld.so.conf增加新的库路径后,需要运行一下/sbin/ldconfig使所有的库文件都被缓存到ld.so.cache中,如果没做,即使库文件明明就在/usr/lib下的,也是不会被使用的,结果 编译过程中抱错,缺少xxx库,去查看发现明明就在那放着 。所以
切记改动库文件后一定要运行一下ldconfig,在任何目录下运行都可以。

再来说说**PKG_CONFIG_PATH**这个变量吧:

经常在论坛上看到有人问"为什么我已经安装了glib-2.4.x, 但是编译gtk+-2.4.x还是提示glib版本太低阿?
为什么我安装了glib-2.4.x,还是提示找不到阿?。。。。。。"都是这个变量搞的鬼。
先来看一个编译过程中出现的错误 (编译gtk+-2.4.13):

checking for pkg-config... /usr/bin/pkg-config checking for glib-2.0 >= 2.4.0 atk >= 1.0.1 pango >= 1.4.0... Package glib-2.0 was not found in the pkg-config search path. 
Perhaps you should add the directory containing `glib-2.0.pc' to the PKG_CONFIG_PATH environment  variable
No package 'glib-2.0' found 

configure: error: Library requirements (glib-2.0 >= 2.4.0 atk >= 1.0.1 pango >= 1.4.0) not met; consider adjusting the PKG_CONFIG_PATH environment variable if your libraries are in a nonstandard prefix so pkg-config can find them. 
[root@NEWLFS gtk+-2.4.13]# 

很明显,上面这段说明,没有找到glib-2.4.x, 并且提示应该将glib-2.0.pc加入到PKG_CONFIG_PATH下。
究竟这个pkg-config目录PKG_CONFIG_PATH变量glib-2.0.pc文件 是做什么的呢?
先说说它是哪冒出来的,当安装了pkgconfig-x.x.x这个包后,就多出了pkg-config,它就是需要PKG_CONFIG_PATH的东东
来看一段说明:

The pkgconfig package contains tools for passing the include path and/or library paths to build tools during the make file execution.
pkg-config is a function that returns meta information for the specified library.
The default setting for PKG_CONFIG_PATH is /usr/lib/pkgconfig because of the prefix we use to install pkgconfig. You may add to PKG_CONFIG_PATH by exporting additional paths on your system where pkgconfig files are installed. Note that PKG_CONFIG_PATH is only needed when compiling packages, not during run-time.

我想看过这段说明后,你已经大概了解了它是做什么的吧。
其实pkg-config就是向configure程序提供系统信息的程序,比如软件的版本、库的版本啦、库的路径,等等
这些信息只是在编译其间使用。你可以 ls /usr/lib/pkgconfig 下,会看到许多的*.pc, 用文本编辑器打开
会发现类似下面的信息:

prefix=/usr 
exec_prefix=${prefix} 
libdir=${exec_prefix}/lib 
includedir=${prefix}/include 

glib_genmarshal=glib-genmarshal 
gobject_query=gobject-query 
glib_mkenums=glib-mkenums 

Name: GLib 
Description: C Utility Library 
Version: 2.4.7 
Libs: -L${libdir} -lglib-2.0 
Cflags: -I${includedir}/glib-2.0 -I${libdir}/glib-2.0/include 

明白了吧,编译期间configure就是靠这些信息判断你的软件版本是否符合要求。并且得到这些东东所在的位置,要不去哪里找呀。
不用我说你也知道为什么会出现上面那些问题了吧。 解决的办法很简单,设定正确的PKG_CONFIG_PATH,假如将
glib-2.x.x装到了/usr/local/下,那么glib-2.0.pc就会在 /usr/local/lib/pkgconfig下, 将这个路径添加
PKG_CONFIG_PATH下就可以了。并且确保configure找到的是正确的glib-2.0.pc, 将其他的lib/pkgconfig目录glib-2.0.pc干掉就是啦 (如果有的话 ) 。
设定好后可以加入到~/.bashrc中,例如:

PKG_CONFIG_PATH=/opt/kde-3.3.0/lib/pkgconfig:/usr/lib/pkgconfig:/usr/local/pkgconfig: /usr/X11R6/lib/pkgconfig 
[root@NEWLFS ~]#echo $PKG_CONFIG_PATH 
/opt/kde-3.3.0/lib/pkgconfig:/usr/lib/pkgconfig:/usr/local/pkgconfig:/usr/X11R6/lib/pkgconfig 

另外./configure通过,make出错,遇到这样的问题比较难办,只能凭经验查找原因,比如某个头文件没有找到,
这时候要顺着出错的位置一行的一行往上找错,比如显示xxxx.h no such file or directory说明缺少头文件
然后去google搜。
或者找到感觉有价值的错误信息,拿到google去搜,往往会找到解决的办法。还是开始的那句话,要仔细看README, INSTALL
一:编译完成后,输入echo $? 如果返回结果为0,则表示正常结束,否则就出错了 :(
echo $? 表示 检查上一条命令的退出状态,程序正常退出 返回0,错误退出返回非0。
二:编译时,可以用&&连接命令, && 表示"当前一条命令正常结束,后面的命令才会执行",就是"与"啦。
这个办法很好,即节省时间,又可防止出错。例:

./configure --prefix=/usr && make && make install 

实例:
编译DOSBOX时出现cdrom.h:20:23: SDL_sound.h: No such file or directory
于是下载,安装,很顺利,没有指定安装路径,于是默认的安装到了/usr/local/
当编译DOSBOX make时,出现如下错误:

if g++ -DHAVE_CONFIG_H -I. -I. -I../.. -I../../include -I/usr/include/SDL -D_REENTRANT -march=pentium4 -O3 -pipe -fomit-frame-pointer -MT dos_programs.o -MD -MP -MF ".deps/dos_programs.Tpo" -c -o dos_programs.o dos_programs.cpp; \ 
then mv -f ".deps/dos_programs.Tpo" ".deps/dos_programs.Po"; else rm -f ".deps/dos_programs.Tpo"; exit 1; fi 
In file included from dos_programs.cpp:30: 
cdrom.h:20:23: SDL_sound.h: No such file or directory <------错误的原因在这里 
In file included from dos_programs.cpp:30: 
cdrom.h:137: error: ISO C++ forbids declaration of `Sound_Sample' with no type 
cdrom.h:137: error: expected `;' before '*' token 
make[3]: *** [dos_programs.o] Error 1 
make[3]: Leaving directory `/root/software/dosbox-0.63/src/dos' 
make[2]: *** [all-recursive] Error 1 
make[2]: Leaving directory `/root/software/dosbox-0.63/src' 
make[1]: *** [all-recursive] Error 1 
make[1]: Leaving directory `/root/software/dosbox-0.63' 
make: *** [all] Error 2 
[root@NEWLFS dosbox-0.63]# 

看来是因为cdrom.h没有找到SDL_sound.h这个头文件
所以出现了下面的错误,但是我明明已经安装好了SDL_sound阿?
经过查找,在/usr/local/include/SDL/下找到了SDL_sound.h
看来dosbox没有去搜寻/usr/local/include/SDL下的头文件,既然找到了原因,就容易解决啦

[root@NEWLFS dosbox-0.63]#ln -s /usr/local/include/SDL/SDL_sound.h /usr/include 

做个链接到/usr/include下,这样DOSBOX就可以找到了,顺利编译成功。

缺失值的简单处理—— MICE 和 Amelia 篇

缺失值的简单处理—— MICE 和 Amelia 篇

0.cover

参考资料:

因为最近一直在学数据库和处理病例数据的相关的东西,病例数据嘛,有缺失值太正常了。同时,也因为写的是实际碰到问题的时候找解决办法用的记录,所以都偏向实战而非理论。这篇博文主要针对连续型数据缺失值,其他类型数据可能还需要仔细看文档。

以往碰到的数据缺失值都不是很多,所以对于缺失值的处理都比较简单粗暴——先用 Amelia::missmap()简单看下大概没问题之后, complete.case() 一把梭哈。但是这次实际处理病例数据发现,病例数据本来算是比较宝贵的的,随便complete.case() 损失很多信息。并且当然文献里也都是用 mean/median 之类来代替这样,所以就觉得好好研究下怎么来处理缺失值。

MICE

先来看大名鼎鼎的 MICE 包,这个包全名就叫“Multivariate Imputation by Chained Equations”,从名字和介绍就可以看出来人家就是为处理各种类型的数据里的缺失值的:

Multiple imputation using Fully Conditional Specification (FCS) implemented by the MICE algorithm as described in Van Buuren and Groothuis-Oudshoorn (2011) doi:10.18637/jss.v045.i03. Each variable has its own imputation model. Built-in imputation models are provided for continuous data (predictive mean matching, normal), binary data (logistic regression), unordered categorical data (polytomous logistic regression) and ordered categorical data (proportional odds). MICE can also impute continuous two-level data (normal model, pan, second-level variables). Passive imputation can be used to maintain consistency between variables. Various diagnostic plots are available to inspect the quality of the imputations.

缺失值总体来说分为两类:

  1. MAR: Missing at random. 随机缺失。这一般也是也是我们希望的理想情况。
  2. MNAR: Missing NOT at random. 非随机缺失。

非随机就比较麻烦了,数据不是随机缺失的处理缺失值当然更容易不准。但是就算数据是随机缺失,缺失值太多肯定也不太好。一般来说,约定俗成的认为 5% 以内的缺失值可以接受。如果哪个变量或者观测的缺失超过 5% 了我们可能就需要考虑要不要把这个变量或者观测删掉了。

MICE 假定数据缺失是 MAR,随机缺失意味着某个值缺失的可能性是依赖于其他的值的,所以也就可以通过其他的值来预测这个缺失的值了。MICE 对缺失值的模拟是通过对一个一个变量的模拟模型进行的。比如我们有 X1、X2 ... Xk 一个 k 个变量。如果 X1 有缺失值,那就用剩下的 X2 ~ Xk 变量对 X1 进行回归,X1 的缺失值的模拟值就用回归的结果来代替。依此类推,只要哪个变量有缺失值就用剩余其他变量来回归模拟缺失值进行填补。

默认情况下,对连续性数据的模拟采用线性回归,分类变量就用逻辑回归。所以模拟完成的时候,我们得到一系列的数据,并且这些数据的不同仅仅在于模拟填补的缺失值部分。一般来说,后面最好对这些数据分开建模然后合并结果。MICE 包用到的方法有:

  1. PMM (Predictive Mean Matching) – 数值型变量
  2. logreg (Logistic Regression) – 二分类变量
  3. polyreg (Bayesian polytomous regression) – 类别超过 2 的分类变量
  4. Proportional odds model - 有序的分类变量

下面我们就应 MICE 和一个随机添加了缺失值的 iris 数据作为实例来看 MICE 是怎么用的。

library(mice)
library(missForest)
data(iris)

summary(iris)
  Sepal.Length    Sepal.Width     Petal.Length    Petal.Width          Species  
 Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100   setosa    :50  
 1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300   versicolor:50  
 Median :5.800   Median :3.000   Median :4.350   Median :1.300   virginica :50  
 Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199                  
 3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800                  
 Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500

然后我们随机在数据里产生 10% 的 缺失值。同时,这里我们先来看连续型数据缺失值的处理,所以我们把 Species 这个分类变量也去掉了。

set.seed(1234)
iris.mis <- missForest::prodNA(iris, noNA = 0.1) %>% 
    select(-Species)
summary(iris.mis)
  Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
 Min.   :4.300   Min.   :2.200   Min.   :1.000   Min.   :0.100  
 1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
 Median :5.800   Median :3.000   Median :4.400   Median :1.300  
 Mean   :5.854   Mean   :3.063   Mean   :3.773   Mean   :1.219  
 3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
 Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
 NA's   :16      NA's   :16      NA's   :14      NA's   :18 

MICE 也提供了可视化缺失值的函数 md.pattern()

md.pattern(iris.mis)

   Petal.Length Sepal.Length Sepal.Width Petal.Width   
99            1            1           1           1  0
11            1            1           1           0  1
10            1            1           0           1  1
2             1            1           0           0  2
10            1            0           1           1  1
2             1            0           1           0  2
2             1            0           0           1  2
7             0            1           1           1  1
3             0            1           1           0  2
2             0            1           0           1  2
2             0            0           1           1  2
             14           16          16          18 64

1.md.pattern

或者 Ameliamissmap() 其实更加直观一点,当然没有那么多信息:

Amelia::missmap(iris.mis)

2.missmap

下面我们就可以开始模拟填补缺失值了。

imputed_Data <- mice(iris.mis, m=5, maxit = 50, method = 'pmm', seed = 123)
summary(imputed_Data)

Class: mids
Number of multiple imputations:  5 
Imputation methods:
Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
       "pmm"        "pmm"        "pmm"        "pmm" 
PredictorMatrix:
             Sepal.Length Sepal.Width Petal.Length Petal.Width
Sepal.Length            0           1            1           1
Sepal.Width             1           0            1           1
Petal.Length            1           1            0           1
Petal.Width             1           1            1           0

上面的代码的意思是:

  • m = 5 ,表示生成 5 个填补好的数据
  • maxit = 50,每次产生填补数据的迭代次数,这里取 50 次
  • method = ‘pmm’,上面介绍的连续型数据采用 Predictive Mean Matching 的方法

来看看我们刚刚生成的 Sepal.Width 值:

imputed_Data$imp$Sepal.Width
     1   2   3   4   5
9  3.1 3.0 3.3 3.3 2.8
14 3.0 3.2 3.0 3.0 3.3
20 3.3 3.0 4.1 3.5 3.7
22 3.3 3.7 3.5 3.5 3.5
23 3.0 3.5 3.4 3.4 3.0
35 3.1 3.0 3.6 3.0 3.2
41 3.4 3.8 3.0 3.1 3.2
46 3.4 3.4 3.4 3.0 3.6
59 3.0 3.1 3.2 3.6 2.8
61 2.8 2.9 3.7 2.6 2.8
66 3.8 3.2 4.1 3.3 3.0
67 2.8 2.3 2.4 2.8 2.7
69 3.8 3.1 2.3 2.6 3.4
71 2.7 3.8 2.8 2.5 2.7
82 2.7 3.0 2.9 2.9 3.2
83 2.8 3.2 2.8 2.9 3.4

dim(imputed_Data$imp$Sepal.Width)
[1] 16  5

我们一共生成了 5 组数据,前面我们看到 Sepal.Width 里有 16 个 NA,所以这里我们就得到一个 16 * 5 的数据。

或者我们只想要生成数据里的某一个:

# get complete data ( 2nd out of 5)
completeData.2 <- mice::complete(imputed_Data,2)

sum(is.na(completeData.2))
[1] 0

head(completeData.2)
  Sepal.Length Sepal.Width Petal.Length Petal.Width
1          5.1         3.5          1.4         0.2
2          4.9         3.0          1.4         0.2
3          4.7         3.2          1.3         0.2
4          4.6         3.1          1.5         0.2
5          5.0         3.6          1.4         0.2
6          5.4         3.9          1.7         0.2

要建模做统计分析的时候,前面也提到,我们可以对每个模型都建模然后合并结果:

# build predictive model
fit <- with(data = imputed_Data, exp = lm(Sepal.Width ~ Sepal.Length + Petal.Width)) 

# combine results of all 5 models
combine <- pool(fit)

summary(combine)
               estimate  std.error statistic       df      p.value
(Intercept)   1.8677059 0.36106651  5.172748 36.47085 8.567029e-06
Sepal.Length  0.3028491 0.07477405  4.050190 33.00981 2.560708e-04
Petal.Width  -0.4761426 0.08182634 -5.818941 28.54299 1.160180e-06

我们把原始数据的模型和这个对比一下:

summary(lm(Sepal.Width ~ Sepal.Length + Petal.Width, data = iris))
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)   1.92632    0.32094   6.002 1.45e-08 ***
Sepal.Length  0.28929    0.06605   4.380 2.24e-05 ***
Petal.Width  -0.46641    0.07175  -6.501 1.17e-09 ***

还行,结果还是比较接近的。

其他类型数据

这里简单看一个其他类型数据的处理的例子。

(dat <- read_csv("/path/to/dt_simulated.csv"))
Parsed with column specification:
cols(
  Age = col_double(),
  Gender = col_character(),
  Cholesterol = col_double(),
  SystolicBP = col_double(),
  BMI = col_double(),
  Smoking = col_character(),
  Education = col_character()
)
# A tibble: 250 x 7
     Age Gender Cholesterol SystolicBP   BMI Smoking Education
   <dbl> <chr>        <dbl>      <dbl> <dbl> <chr>   <chr>    
 1  67.9 Female        236.       130.  26.4 Yes     High     
 2  54.8 Female        256.       133.  28.4 No      Medium   
 3  68.4 Male          199.       158.  24.1 Yes     High     
 4  67.9 Male          205        136   19.9 No      Low      
 5  60.9 Male          208.       145.  26.7 No      Medium   
 6  44.9 Female        222.       131.  30.6 No      Low      
 7  49.9 Male          202.       152.  27.3 No      Medium   
 8  55.1 Female        206.       151.  27.5 No      Low      
 9  57.5 Male          202.       142.  28.3 No      High     
10  77.2 Male          240.       161.  29.1 No      High     
# ... with 240 more rows

sapply(dat, function(x) sum(is.na(x)))
        Age      Gender Cholesterol  SystolicBP         BMI     Smoking   Education 
          0           0           0           0           0           0           0 

(这个数据的是dt_simulated.csv,但是我在 R 里直接没读进来,大概是网络原因我也懒得去找了。下载到本地自己读的,然后现在也放在这个 Repo 里了:dt_simulated.csv)

数据一共 205 行 × 7 列,列分别为年龄、性别、胆固醇、血压、BMI、是否抽烟以及教育程度,然后原始数据是没有缺失值的。所以我们先随机的加一些 NA 进去,随后把字符型变量转换成因子:

original <- dat

set.seed(10)
dat[sample(1:nrow(dat), 5), "Age"] <- NA
dat[sample(1:nrow(dat), 20), "Cholesterol"] <- NA
dat[sample(1:nrow(dat), 5), "BMI"] <- NA
dat[sample(1:nrow(dat), 20), "Smoking"] <- NA
dat[sample(1:nrow(dat), 20), "Education"] <- NA

sapply(dat, function(x) sum(is.na(x)))
        Age      Gender Cholesterol  SystolicBP         BMI     Smoking   Education 
          5           0          20           0           5          20          20 

dat <- dat %>%
    mutate(
        Smoking = as.factor(Smoking),
        Education = as.factor(Education),
        Cholesterol = as.numeric(Cholesterol)
    )

为了自定义 MICE 的整个填补过程,我们先构建一个 mice 对象,然后:

init = mice(dat, maxit=0)
init

Class: mids
Number of multiple imputations:  5 
Imputation methods:
        Age      Gender Cholesterol  SystolicBP         BMI     Smoking   Education 
      "pmm"          ""       "pmm"          ""       "pmm"    "logreg"   "polyreg" 
PredictorMatrix:
            Age Gender Cholesterol SystolicBP BMI Smoking Education
Age           0      0           1          1   1       1         1
Gender        1      0           1          1   1       1         1
Cholesterol   1      0           0          1   1       1         1
SystolicBP    1      0           1          0   1       1         1
BMI           1      0           1          1   0       1         1
Smoking       1      0           1          1   1       0         1
Number of logged events:  1 
  it im dep     meth    out
1  0  0     constant Gender


meth = init$method
meth
        Age      Gender Cholesterol  SystolicBP         BMI     Smoking   Education 
      "pmm"          ""       "pmm"          ""       "pmm"    "logreg"   "polyreg"



predM = init$predictorMatrix
predM

            Age Gender Cholesterol SystolicBP BMI Smoking Education
Age           0      0           1          1   1       1         1
Gender        1      0           1          1   1       1         1
Cholesterol   1      0           0          1   1       1         1
SystolicBP    1      0           1          0   1       1         1
BMI           1      0           1          1   0       1         1
Smoking       1      0           1          1   1       0         1
Education     1      0           1          1   1       1         0

可以看到,这个对象里包含了填补缺失值的方法、使用的变量和其他的参数等。我们把 method(用来定义每个变量模拟填补缺失值的方法)和 predictorMatrix(名字就很明显了,用来定义模拟填补每个变量时用到的变量矩阵) 单独取出来了,这样我们就可以模拟过程进行自定义了。

比如有的时候,数据里可能有一列是 ID 值,这时候显然把它用来帮助模拟其他变量的缺失值完全没有意义。以我们这个数据里的 BMI为例,假设 BMI 是某种编号信息,我们想在预测填补其他变量的缺失值的时候不要使用这种变量,可以predM[, c("BMI")]=0 把变量矩阵里对应 BMI 这一列全部改成 0,所以矩阵现在就变成预测其他变量的时候不使用 BMI 变量这一列。

但是上面的方法有一个问题,现在给其他变量预测缺失值的时候不会使用 BMI这一列,但 MICE 仍然会对 BMI 的缺失值进行填补,这显然也没什么意义。meth["BMI"] ="" 会把预测 BMI 时使用的方法变成空值,即不对 BMI 进行预测了。

下面我们把 Age 排除在填补范围外,并且分别针对连续型变量 Cholesterol,二分类变量 Smoke 和有序变量 Education 分别自定义模拟方法,然后进行模式值模拟:

meth[c("Age")] = ""
meth[c("Cholesterol")] = "norm" 
meth[c("Smoking")] = "logreg" 
meth[c("Education")] = "polyreg"

set.seed(1234)
imputed <- mice(dat, method = meth, predictorMatrix = predM, m = 5)
imputed <- mice::complete(imputed)

然后我们检查一下是不是缺失值都没了:

sapply(imputed, function(x) sum(is.na(x)))
        Age      Gender Cholesterol  SystolicBP         BMI     Smoking   Education 
          5           0           0           0           0           0           1 

Age 里的 5 个缺失值都没有处理,但是 Education 里还剩下一个缺失值也没有处理。似乎与 Age 变量没有处理有关,因为一旦把 Age 的缺失值也处理掉,Education 里全部的缺失值也能得到处理。此处原因待更新。

最后我们来看看与原始数据相比,填补缺失值的效果怎么样:

# Cholesterol
actual <- original$Cholesterol[is.na(dat$Cholesterol)]
predicted <- imputed$Cholesterol[is.na(dat$Cholesterol)]

mean(actual)
[1] 231.07
mean(predicted)
[1] 223.4087

# Smoking
actual <- original$Smoking[is.na(dat$Smoking)] 
predicted <- imputed$Smoking[is.na(dat$Smoking)] 

table(actual)
 No Yes 
 11   9 
table(predicted)
 No Yes 
 16   4 

效果还行吧。Cholesterol 实际均值 231.07, 预测值为 223.4087;Smoking 缺失值里 15/20 预测是对的。

Amelia

Amelia 包前面就出现过了。我一般用 missmap 来迅速看一下数据里的缺失值的分布情况。

Amelia 这个名字来源于 Amelia Earhart,美国航空先驱、作家。她是世界上第一个独立飞行穿越大西洋的女飞行员。但在 1937 年一次环球飞行中,她在途径太平洋上空时神秘失踪(missing)了。所以这个专门用来处理 missing value 的包以 Amelia 命名。

Amelia 对缺失值假设为:

  • 缺失值随机
  • 数据中所有变量都满足多元正态分布(Multivariate Normal Distribution, MVN),可以使用均值和协方差来描述数据。

Amelia 利用 bootstrap,同样也是生成多组填补值。但相比 MICE,MVN 还是有一些局限:

  1. MICE 对缺失值的模拟是一个一个变量进行的,而 MVN 依赖整体数据的多元正态分布
  2. MICE 可以处理多种类型数据的缺失值,而 MVN 只能处理正态分布或经转换后近似正态分布的变量
  3. MICE 能在数据子集的基础上处理缺失值,MVN 则不能

Amelia 适合用于符合多元正态分布的数据。如果数据不符合条件,可能需要事先将数据转换为近似正态分布。

我们一样还用前面那个数据:

library(mice)

set.seed(1234)
iris.mis <- missForest::prodNA(iris, noNA = 0.1)
amelia_fit <- amelia(iris.mis, m=5, parallel = "multicore", noms = "Species")

# 1st of the rsults
head(amelia_fit$imputations$imp1)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4   0.2000000  setosa
2          4.9         3.0          1.4   0.2000000  setosa
3          4.7         3.2          1.3   0.2000000  setosa
4          4.6         3.1          1.5   0.2000000  setosa
5          5.0         3.6          1.4   0.2000000  setosa
6          5.4         3.9          1.7   0.2200646  setosa

然后看看回归分析结果:

fit2 <- Zelig::zelig(Sepal.Width ~ Sepal.Length + Petal.Width, data = amelia_fit, model = "ls")

summary(fit2)
               estimate  std.error statistic       df      p.value
(Intercept)   1.8677059 0.36106651  5.172748 36.47085 8.567029e-06
Sepal.Length  0.3028491 0.07477405  4.050190 33.00981 2.560708e-04
Petal.Width  -0.4761426 0.08182634 -5.818941 28.54299 1.160180e-06

均值或者中位值填补

在文献里可以大量看到直接用均值/中位值来填补缺失数据的。但这样做的前提应该也是数据里缺失值很少。

我在网上搜了几个办法:

How to fill NA with median?:

library(dplyr)
df %>% 
   mutate_all(~ifelse(is.na(.), median(., na.rm = TRUE), .))

# to replace a subset of columns:
df %>% 
  mutate_at(vars(value), ~ifelse(is.na(.), median(., na.rm = TRUE), .))

下面这个很重要:因为我们用均值/中位值填补 NA 只适用于连续型数据,而实际上数据往往是多种变量都有的,所以实际情况我们需要的往往是只对部分变量进行处理。

综合一下,我把上面的代码小小改动了一下,直接对变量进行筛选,遇到数值型变量就应用:

df %>%
  mutate_if(., is.numeric, .funs = ~ifelse(is.na(.), median(., na.rm = TRUE), .))

在 Debian 中使用 Zotero 文献管理软件

2017-05-14 20:51:47

我在 Debian 中用 Zotero 管理文献时发现 PDF 导入后获取文件信息获取不到。以下时解决过程记录。

Install the pdftotext and pdfinfo programs:

sudo apt-get install poppler-utils

Find the kernel and architecture:

uname --kernel-name --machine

In the Zotero data directory create a symbolic link to the installed programs. The printed kernel-name and machine is part of the link's name:

cd ~/.zotero
ln -s $(which pdftotext) pdftotext-$(uname -s)-$(uname -m)
ln -s $(which pdfinfo) pdfinfo-$(uname -s)-$(uname -m)

Install a small helper script to alter pdftotext paramaters:

cd ~/.zotero
wget -O pdfinfo.sh https://raw.githubusercontent.com/zotero/zotero/4.0/resource/redirect.sh
chmod a+x pdfinfo.sh

Create some files named *.version containing the version numbers of the utilities. The version number appears in the third field of the first line on stderr:

cd ~/.zotero
pdftotext -v 2>&1 | head -1 | cut -d ' ' -f3 > pdftotext-$(uname -s)-$(uname -m).version
pdfinfo -v 2>&1 | head -1 | cut -d ' ' -f3 > pdfinfo-$(uname -s)-$(uname -m).version

Start Zotero's gear icon, "Preferences"-"Search" should report something like:

PDF indexing
  pdftotext version 0.26.5 is installed
  pdfinfo version 0.26.5 is installed

Do not press "check for update". The usual maintenance of the operating system will keep those utilities up to date.

更多

如果上述完成还不可用,运行这个脚本:

#!/bin/bash

version=$(dpkg-query -W -f='${Version}' poppler-utils || echo "please_install_poppler-utils")

totextbinary='pdftotext-Linux-x86_64'
infobinary='pdfinfo-Linux-x86_64'
infohack='pdfinfo.sh'

for zoteropath in $(find $HOME/.zotero $HOME/.mozilla -name zotero.sqlite -exec dirname {} \;)
do
	echo $version > $zoteropath/"$totextbinary.version"
	echo $version > $zoteropath/"$infobinary.version"

	ln -s /usr/bin/pdftotext "$zoteropath/$totextbinary"
	ln -s /usr/bin/pdfinfo "$zoteropath/$infobinary"

	cat > $zoteropath/$infohack << EOF

#!/bin/sh
if [ -z "\$1" ] || [ -z "\$2" ] || [ -z "\$3" ]; then
    echo "Usage: $0 cmd source output.txt"
    exit 1
fi
"\$1" "\$2" > "\$3"
EOF

	chmod +x $zoteropath/$infohack

done

脚本来自 bugs.debian.org

跟着 mimic-code 探索 MIMIC 数据之 tutorials (一)

LongRoad

SQL 算是学完了,结果回去看 mimic-code 发现大多数脚本根本看不懂!想起来小学做数学习题:

  • 课本例题: 小明有 3 个苹果,吃了 1 个,请问小明还有几个苹果 ?

  • 课后习题:小华前天买了 5 个橘子,昨天吃了 1 个梨,请问小红今天还剩下几个苹果?

  • :卒.....

没有办法,我就把 mimic-code 翻来覆去地看,看看有没有什么我能看懂的。果然,MIMIC 很良心的,mimic-code/tutorials/ 里面就放了针对新人的几个简单的小课程,小课程搭配习题,答案也有,可以说是很好了。

一个个来看吧。


1. sql-intro

这个文档基本就是教我们 SQL 的了。基本上我就是泛泛地看了看。有几个值得记下来的:

How can we use temporary tables to help manage queries?

临时的表格可以用 WITH foo AS bar 这样的语法来存放。比如我们要得到从 patients 表格中得出年龄并做他用:

WITH patient_dates AS (
SELECT p.subject_id, p.dob, a.hadm_id, a.admittime,
    ( (cast(a.admittime as date) - cast(p.dob as date)) / 365.2 ) as age
FROM patients p
INNER JOIN admissions a
ON p.subject_id = a.subject_id
ORDER BY subject_id, hadm_id
)
SELECT *
FROM patient_dates;

另一个办法是使用 materialised views,即物化视图:

-- we begin by dropping any existing views with the same name
DROP MATERIALIZED VIEW IF EXISTS patient_dates_view;
CREATE MATERIALIZED VIEW patient_dates_view AS
SELECT p.subject_id, p.dob, a.hadm_id, a.admittime,
    ( (cast(a.admittime as date) - cast(p.dob as date)) / 365.2 ) as age
FROM patients p
INNER JOIN admissions a
ON p.subject_id = a.subject_id
ORDER BY subject_id, hadm_id;

CASE statement for if/else logic

CASE WHEN 是简单的逻辑判断语句。比如我们想对 icustays 中 ICU 住院时间长短 (los) 分组:

-- Use if/else logic to categorise length of stay
-- into 'short', 'medium', and 'long'
SELECT subject_id, hadm_id, icustay_id, los,
    CASE WHEN los < 2 THEN 'short'
         WHEN los >=2 AND los < 7 THEN 'medium'
         WHEN los >=7 THEN 'long'
         ELSE NULL END AS los_group
FROM icustays;

Window functions

Window functions 中文好像翻译为窗口函数,这个窗口其实是艾滋病感染潜伏期窗口那个意思,在 Bowtie2 之类的 RNA-Seq 数据比对之类的软件计算比对质量的时候也会用到这个这个概念。
不知道为什么 SQLBolt 竟然没有涉及到 Window functions,感觉很实用的功能。

Window functions 和 Aggregate 很像,但是 Aggregate 是聚合,会按照我们要求对相同的行进行合并,而 Window functions 则不同。用例子来看会很清楚,比如我们想要对同一个病人多次住 ICU 进行编号。这种情况下直接用 GROUP BY subject_id 会直接把同一个病人信息合并到一行,而我们想要的是每个病人每次入 ICU 的信息仍然单独是一行,顺序通过 admission_time 进行编号。这里的 就是 subject_id ,每个病人为一个处理单位,RANK() 生成顺序编号。 代码:

-- find the order of admissions to the ICU for a patient
SELECT subject_id, icustay_id, intime,
    RANK() OVER (PARTITION BY subject_id ORDER BY intime)
FROM icustays;

有了这样一个编号我们就可以很方便的筛选只住过一次 ICU 的病例了(这个在文献里经常看到):

-- select patients from icustays who've stayed in ICU for only once
WITH icustayorder AS (
SELECT subject_id, icustay_id, intime,
  RANK() OVER (PARTITION BY subject_id ORDER BY intime)
FROM icustays
)
SELECT *
FROM icustayorder
WHERE rank = 1;

Multiple temporary views

多个临时视图,这个在 mimic-code 里简直不要太常见。

services 表格包含了病人接受治疗的情况(比如是在外科还是内科这种):

-- find the care service provided to each hospital admission
SELECT subject_id, hadm_id, transfertime, prev_service, curr_service
FROM services;

但是这个表格里没有 icustay_id,我们只能通过 hadm_idJOIN

WITH serv as (
  SELECT subject_id, hadm_id, transfertime, prev_service, curr_service
  FROM services
)
, icu as
(
  SELECT subject_id, hadm_id, icustay_id, intime, outtime
  FROM icustays
)
SELECT icu.subject_id, icu.hadm_id, icu.icustay_id, icu.intime, icu.outtime
, serv.transfertime, serv.prev_service, serv.curr_service
FROM icu
INNER JOIN serv
ON icu.hadm_id = serv.hadm_id

但是,这个过程其实中间是有一些猫腻的。INNER JOIN 是取交集的:

Alt text

那么取完后的结果的行数肯定不多于之前的数据。但是我们看看我们的数据:

WITH serv as (
  SELECT subject_id, hadm_id, transfertime, prev_service, curr_service
  FROM services
)
, icu as
(
  SELECT subject_id, hadm_id, icustay_id, intime, outtime
  FROM icustays
)
SELECT COUNT(*)
FROM icu
INNER JOIN serv
ON icu.hadm_id = serv.hadm_id

这个在我电脑上显示 78840 行,那我们再看 icustays 数据:

SELECT count(*)
FROM icustays;

61532行。哈哈,INNER JOIN 之后行数变多了,刺激!

下面很快给出了解释:

事实是,每个 hadm_id 可能对应了好几个 service 和好几个 icustay_id ,即一个病人院内转科和多次住 ICU 的情况。所以当通过 hadm_idJOIN 两个表的时候,在 hadm_id 相同而 icustay_idservices 不同时每种组合都会在结果里作为单独的一行。专业的解释:

More technically, the first query joined two tables on non-unique keys: there may be multiple hadm_id with the same value in the services table, and there may be multiple hadm_id with the same value in the admissions table. For example, if the services table has hadm_id = 100001 repeated N times, and the admissions table has hadm_id = 100001 repeated M times, then joining these two on hadm_id will result in a table with NxM rows: one for every pair. With MIMIC, it is generally very bad practice to join two tables on non-unique columns: at least one of the tables should have unique values for the column, otherwise you end up with duplicate rows and the query results can be confusing.

所以最后,我们可以通过在 services 里对相同的hadm_id 利用窗口函数排序,只留下第一个 service 记录,这样 hadm_id 也就变成了 unique key 了。

 WITH serv as (
   SELECT subject_id, hadm_id, transfertime, prev_service, curr_service,
    RANK() OVER (PARTITION BY hadm_id ORDER BY transfertime) as rank
   FROM services
   )
   , icu as
   (
   SELECT subject_id, hadm_id, icustay_id, intime, outtime
   FROM icustays
   )
   SELECT COUNT(*)
   FROM icu
   INNER JOIN serv
   ON icu.hadm_id = serv.hadm_id
   AND serv.rank = 1;

本来打算只写一点点做个笔记,没想到已经这么长了,那干脆分篇好了。

Ref


THE END

《R Graphic Cookbook》第一章学习笔记

2017-05-25

书中涉及的示例数据可以在 GitHub 上找到。

散点图

使用自带的 cars 数据。数据分两列,分别为速度和路程。

Let's use one of R's inbuilt datasets called cars to look at the relationship between the speed of cars and the distances taken to stop (recorded in the 1920s).

plot(cars$dist~cars$speed)

得到基本的关于汽车行驶路程和行驶速度的散点图:

1

稍微修饰下图片,添加一些参数:

plot(cars$dist~cars$speed, # y~x
     main="Relationship between car distance & speed", # Plot Title
     type='p', ##Specify type of plot as p for point(default option)
     xlab="Speed (miles per hour)", #X axis title
     ylab="Distance travelled (miles)", #Y axis title
     xlim=c(0,30), #Set x axis limits from 0 to 30
     ylim=c(0,140), #Set y axis limits from 0 to 140
     xaxs="i", #Set x axis style as internal
     yaxs="i", #Set y axis style as internal
     col="red", #Set the color of plotting symbol to red
     pch=19) #Set the plotting symbol to filled dots)

然后图片变为:

2

pch 参数

pch是 plotting character 的缩写。pch 缺省下设定数据显示为点状。pch 符号可以使用0 : 25来表示26 个标识。21 : 25这几个符号还可以使用bg="颜色" 参数进行不同的颜色填充。颜色参数 col 则可以用于设置1:25所表示符号的颜色。作一张图看看 pch 都有哪些:

par(mfrow = c(5, 5))
for(i in 1:5){
  if(i < 5){
    for(j in 1:5){plot(1, pch = (i-1)*5 + j, cex = 2, col = 'black')}}
  else
    for(j in 1:5){plot(1, pch = (i-1)*5 + j, cex = 2, col = 'darkgreen', bg = 'red')}
}

4

折线图

示例数据 dailysales.csv。数据有两列,第 1 列为日期,第 2 列为销售量。

直接上图:

plot(sales$units~as.Date(sales$date,"%d/%m/%y"),
type="l", #Specify type of plot as l for line
main="Unit Sales in the month of January 2010",
xlab="Date",
ylab="Number of units sold",
col="blue")

生成一个显式每日销售量的折线图:

3

与散点图相比,基本上就是添加了 type="l" 这个参数,l 代表 line,即画线图,type 默认为 p 即point 点图,所以在画散点图的时候不指定也可以画图,画线图就需要显式的指定这个参数为 l

还可以在图上用另一个数据添加一条线:

sales$units2 <- sales$units -  1000
lines(sales$units2~as.Date(sales$date,"%d/%m/%y"), col="red")

3 1

条形图

使用示例数据 citysales.csv 。数据分 4 列, 第一列为城市名,后三列分别产品 A、B、C的销售量。

画条形图展示产品 A 在不同城市的销量:

barplot(sales$ProductA,
        names.arg= sales$City,
        col="black")

得到如下条形图:

5

默认画出的条形图是竖直的,也可以通过参数 horiz = TRUE 来画水平的条形图:

barplot(sales$ProductA, names.arg = sales$City, horiz = TRUE, col = 'black')

6

The labels for the bars are specified by the names.arg argument, but we use this argument only when plotting single bars. In the example with sales figures for multiple products, we didn't specify names.arg . R automatically used the product names as the labels and we had to instead specify the city names as the legend.

参数 names.arg 用来指定数据条的名称,但只有在画单个数据的条形图才需要这个参数, 多个数据的条形图不需要指定这个参数。

多个数据组的条形图

我们经常对多个组别之间多个数据画条形图,例如一共三组数据(比如某个指标 0h、24h 和 48h 分别测得的数值),每组数据包含三个指标(比如红细胞计数,血小板和血红蛋白)。这个时候就要用到多个数据组的条形图了,依然延续上个例子的数据,不过现在每个城市的销售产品都有 A、B、C 三种:

barplot(as.matrix(sales[,2:4]), legend = sales$City, col = heat.colors(5), beside = TRUE, border = 'white')

得到的图形图为:

5 1

The beside argument is used to specify whether we want the bars in a group of data to be stacked or adjacent to each other. By default, beside is set to FALSE , which produces a stacked bar graph. To make the bars adjacent, we set beside to TRUE .

直方图和密度图

一个包含 1000 个数据的正态分布的直方图的例子:

hist(rnorm(1000), col = heat.colors(5), border = 'white')

7

As you may have noticed in the preceding examples, the default setting for histograms is to display the frequency or number of occurrences of values in a particular range on the Y axis. We can also display probabilities instead of frequencies by setting the prob (for probability) argument to TRUE or the freq (for frequency) argument to FALSE .

直方图默认以数据的频数作图,如要以频率作图,可以指定参数 freq = FALSE 或者 prob = TRUE

如:

hist(rnorm(1000),col = heat.colors(5), border = 'white', freq = FALSE)
hist(rnorm(1000),col = heat.colors(5), border = 'white', probability = TRUE)

都会得到:

7 1

画密度分布图则需要额外使用 density() 函数:

plot(density(rnorm(1000)), col = 'red')

7 2


箱式图

示例数据 metals.csv。数据为伦敦不同地点空气中金属离子含量。数据第一列为不同地点,从第二列起为不同金属离子浓度值。

画箱式图展示金属离子含量:R

boxplot(metals[,2:ncol(metals)],
        xlab="Metals",
        ylab="Atmospheric Concentration in ng per cubic metre",
        main="Atmospheric Metal Concentrations in London")

8

The dark line inside the box for each metal represents the median of values for that metal. The bottom and top edges of the box represent the first and third quartiles respectively. Thus, the length of the box is equal to the interquartile range (IQR, difference between first and third quartiles). The maximum length of a whisker is a multiple of the IQR (default multiplier is approximately 1.5). The ends of the whiskers are at data points closest to the maximum length of the whisker.
All the points lying beyond these whiskers are considered outliers.

还可以通过对数据进行分组然后画箱式图,例如显示铜离子在不同地点的含量情况:

boxplot(copper$Cu ~ copper$Source, 
        xlab="Measurement Site",
        ylab="Atmospheric Concentration of Copper in ng per cubic metre",
        main="Atmospheric Copper Concentrations in London")

8 1

关于源码编译的基础知识 via LinuxSir (上篇)

2017-05-18 

这一篇博文很长,我分成下上下两篇。

内容来自已经不存在的 LinuxSir 社区(RIP)的 LFS 板块,当时很多东西我并不懂,只是觉得很有趣所以复制粘贴到我的笔记软件里了。
几年之后我对 Linux 熟悉一些,也用过 Gentoo 很久并一直认为这是我最喜欢的 Linux 发行版。虽然我对源码编译很感兴趣,但是知之甚少。偶然再看到这篇博客很多东西有种豁然开朗的感觉。再想回去 LinuxSir 社区,发现它早已不复存在。

我现在仍然觉得这两篇博文很有趣,并且打算还保留它们。如果原作者看到的话,有什么问题请和我联系。

以下时原文,有少量改动。


如果不出意外的话,会出现say.so => not found. 这时的./test是不能运行的. 但至少说明程序运行时是需要这个库的. 那为什么找不到这个库呢? 那就让我们看看系统是怎样寻找这些库的吧.

首先是ld-linux.so.2这个不能不说,它太重要了,以至于也决定了后面的搜索方式.

先是程序内部决定的.

strings test 

还好我们这个test程序不大,不用过滤输出,好,你看见什么, /lib/ld-linux.so.2, say.so, libc.so.6, 对, 用到的库!

但我们发现不同,有的有路径,有的没有,先不管没有路径的怎么寻找,有路径的肯定是能找到了,那好,我们让say.so也有了路径.

gcc test.c ./say.so -o test2 
strings test2 

我们发现原来的输出中原来的say.so已经变成了./say.so. 运行一下./test2, 可以运行了! 好,找到库了,这里用的相对路径,无疑,我们将say.so移动到非当前文件夹.那test就又不能运行了.这样无疑是把我们用到的库硬编码进了程序里.我不喜欢硬编码,太死板.那不硬编码系统怎么找到我们需要的文件呢.

在程序没有把库地址硬编码经进去的前提下,系统会寻找LD_LIBRARY_PATH环境变量中的地址.
如果系统在这一步也没发现我们需要的库呢.
/etc/ld.so.cache这个由ldconfig生成的文件,记载着在/etc/ld.so.conf文件中指明的所有库路径加上/lib, /usr/lib里的所有库的信息.
其实以上这句话只是在大多数情况下是正确的, 是否是这个文件由ld-linux.so.2决定. 如过你的LFS中的第一遍工具链/tools还在的话,

strings /tools/lib/ld-linux.so.2 |grep etc 

输出很可能是/tools/etc/ld.so.cache. 那么它用的哪个文件我们就清楚了吧.
可这个路径前面的/tools到底和什么有关呢?首先我们可能会想到与ld-linux所在的位置有关. 还好我们有3套glib, 感谢LFS, 现在我们拿第二遍的工具链下手. 假设我们的LFS在/lfsroot

strings /lfsroot/lib/ld-linux.so.2 

很奇怪的是输出竟然是/etc/ld.so.cache! 那这到底和什么有关呢,没错就是我们编译时候的--prefix有关.
现在再看这个/etc/ld.so.conf, 和/lib, /usr/lib这些默认ldconfig路径. 也都要加上个这个prefix了.

strings /tools/sbin/ldconfig |grep etc 
strings /tools/sbin/ldconfig |grep /lib 

验证一下吧.
那要是ld.so.cache里也没有记载这个库的地址怎么办呢.
最后在默认路径里找.这个路径一般是/lib, /usr/lib, 但也不全是.

strings /tools/lib/ld-linux.so.2 |grep /lib 

还是要加个prefix.
现在我们反过来思考,不用程序中硬编码的/lib/ld-linux.so.2做动态加载器了.这也可以?!是的!虽然不一定成功.

LD_TRACE_LOADED_OBJECTS=y /tools/lib/ld-linux.so.2 /bin/test 
LD_TRACE_LOADED_OBJECTS=y /lib/ld-linux.so.2 /bin/test 
LD_TRACE_LOADED_OBJECTS=y /lfsroot/lib/ld-linux.so.2 /bin/test 

为了说明顺序,我们做如下很危险的实验:

ldconfig /lfsroot/lib; 
ldconfig -p 

会出现很多内容,但不要试着过滤,因为这时的系统应该很多程序不能运行了.先踏下心来观察.你会发现很多库出现两次/lfsroot/lib, 和/lib而且/lfsroot/lib在前, 说明ldconfig先处理参数给出的地址,最后是默认地址.但顺序也不一定,应该还和编译glibc时我们的参数--enable-kernel有关(我根据种种表现猜测).
加上export LD_LIBRARY_PATH=/lib 环境变量在前面,不能运行的程序又能运行了,说明LD_LIBRARY_PATH变量的优先级优于ld.so.cache

unset LD_LIBRARY_PATH 
echo >/etc/ld.so.cache 
ldconfig -p 

应该什么都不出现,可大部分程序能运行.说明ld-linux.so.2决定的默认路径起了作用(注意,这里的ldconfig的默认路径没有作用)

ldconfig 

恢复系统正常.

PostgreSQL 入门

接触 MIMIC 数据库一小阵,勉强一边 Google 一边看 mimic-code 提供的脚本搞定了本地数据库并且把所有提供的 concepts 都建立好了。

过程中 R 配合 RPostgreSQL来连接和操作数据已经相对很容易了,然后还 tidyverse 强大的管道 + 数据清洗功能,但是每每涉及到要去看数据 mimic-code 没有提供的数据的时候都对数据库操作力不从心。

所以说落下的课终究是要补的,天道好轮回,苍天饶过谁。

1. 一些基础概念

postgreSQL ,或者说数据库,有个很重要的概念就是 Schema(模式)和 View(视图)、Materialized View(物化视图)。

Schema,模式

Schema 类似与分组,它可以将数据库对象组织到一起形成逻辑组,方便管理。

我们在 postgreSQL 数据库中创建的任何对象(表、索引、视图和物化视图)都会在一个模式下被创建。如果未指定模式,这些对象将会在默认的模式下被创建。这个模式叫做 public。每一个数据库在创建的时候就会有一个这样的模式。

创建一个新的 schema 就是 CREATE SCHEMA my_schema;,要在这个指定的 schema 里建立表格:

CREATE TABLE my_schema.mytable (
...
);

删除一个空 schema 是 DROP SCHEMA my_schema;,如果不是空的就得 DROP SCHEMA myschema CASCADE; 了。

假如我们进入一个数据库并执行一个命令操作一个叫 my_table 的表格的时候,默认情况下数据库会在 public 这个模式中找,找不到就报错,哪怕这个 my_table 本身在另一个模式(比如 my_schema)里已经存在。这个时候我们就要设置搜索路径了:

SET search_path TO my_schema, public;

这样就把 my_schema 放到了搜索路径里 public 的前面。这个有点像 Linux 的用户 PATH 这个环境变量。设置了这个之后我们在建立数据库不指定模式的建立对象时默认都会放到 my_schema。但是需要注意,SET search_path 这个设置不是永久的,只在当前会话有效。这有点像 Linux 下终端里 export 一个变量,关掉终端之后就没了。

View & Materialized View,视图与物化视图

视图和物化视图就没那么好解释了,我 Google 了一下找到这个博客我觉得比较好理解:It's a view, it's a table... no, it's a materialized view!,节选下重点 :

Let's start with TABLE – it's basically an organized storage for your data - columns and rows. You can easily query the TABLE using predicates on the columns. To simplify your queries or maybe to apply different security mechanisms on data being accessed you can use VIEWs – named queries – think of them as glasses through which you can look at your data.

So if TABLE is storage, a VIEW is just a way of looking at it, a projection of the storage you might say. When you query a TABLE, you fetch its data directly. On the other hand, when you query a VIEW, you are basically querying another query that is stored in the VIEW's definition. But the query planner is aware of that and can (and usually does) apply some "magic" to merge the two together.

Between the two there is MATERIALIZED VIEW - it's a VIEW that has a query in its definition and uses this query to fetch the data directly from the storage, but it also has it's own storage that basically acts as a cache in between the underlying TABLE(s) and the queries operating on the MATERIALIZED VIEW. It can be refreshed, just like an invalidated cache - a process that would cause its definition's query to be executed again against the actual data. It can also be truncated, but then it wouldn't behave like a TABLE nor a VIEW. It's worth noting that this dual nature has some interesting consequences; unlike simple "nominal" VIEWs their MATERIALIZED cousins are "real", meaning you can - for example - create indices on them. On the other hand, you should also take care of removing bloat from them.

视图的本质是查询语句,而不是实在的表格。而物化视图的介于两者之间.....好吧,物化视图的解释没有很看懂。
我这个需求大概应该懂就行了吧。实际上在 mimic-code 提供的代码里基本上也都是在导入的原始数据上建立物化视图的。

因为视图和物化视图都是建立在查询上的,所以在创建时也就必须得有查询语句:

CREATE [MATERIALIZED] VIEW view_name AS
	SELECT column1, column2..... FROM table_name
		WHERE [condition]; 

删除视图类似删除表 DROP VIEW IF EXISTS view_name;,非空视图要DROP VIEW IF EXISTS view_name CASCADE;

2. PostgreSQL 中的数据类型

参考这一篇介绍 PostgreSQL 数据类型的博文:PostgreSQL数据类型

数据类型指定要在表格中每一列存储哪种类型的数据。
创建表格时每列都必须使用数据类型。PotgreSQL中主要有三种数据类型:

  • 数值数据类型
  • 字符串数据类型
  • 日期/时间数据类型

数值

常见数值类型包括:

  • smallint:小范围整数;
  • integer:典型的整数类型;
  • bigint:可以存储大范围整数;
  • decimal,numeric:指定的精度的数字,精确数字;
  • real,double:可变精度数字,前者精度为 6 位,后者 15 位;

字符串

字符串类型包括

  • char(size),character(size):固定长度字符串,size 规定了需存储的字符数,由右边的空格补齐;
  • varchar(size),character varying(size):可变长度字符串,size 规定了需存储的字符数;
  • text:可变长度字符串。

日期/时间

表示日期或时间的数据类型有:

  • timestamp:日期和时间,有或无时区;
  • date:日期,无时间;
  • time:时间,有或无时区;
  • interval:时间间隔。

其他

其他数据类型类型还有布尔值 boolean (true 或 false),货币数额 money 和 几何数据等。

3. 入门命令

标准进入数据库的命令是 psql -U USER -d DB -h HOST -p PORT ,这样会要求密码然后进入 DB 数据库。但是我数据库是本地用,而且用户也添加到了数据库超级用户了,所以用户、主机、端口都可以省掉了。最后就是直接 psql -d DB 甚至 psql DB就行了。

极其常用命令列表一下:

命令 功能 命令 功能
\? 命令列表 \h cmd 获取命令解释
\l 列举所有数据库 \c db 连接到另一数据库
\d 列举当前数据库的所有对象 \d+ 列举当前数据库的所有对象及其额外信息
\d table 列出表格的元数据 \du 列出所有用户
\dn 列出所有 schema \e 编辑器
\r 重置当前的 query \i file 执行文件

基本表格操作

创建一个含 name 和 date 这样两列的表格,而且指定了列的数据类型:

CREATE TABLE tab (name VARCHAR, date DATE);

在上述表格中插入一行数据:

INSERT INTO tab (name, date) VALUES ('Jackie', '20180708');

改动:

UPDATE tab SET name='Jack' WHERE name='Jackie';

删除行:

DELETE FROM tab WHERE name='Jack';

添加列:

ALTER TABLE tab ADD email VARCHAR;

更改列名:

ALTER TABLE tab RENAME COLUMN date TO date_add;

删除列:

ALTER TABLE tab DROP COLUMN email;

更改表格名:

ALTER TABLE tab RENAME TO my_tab;

删除表格:

DROP TABLE IF EXISTS my_tab;

跟着 mimic-code 探索 MIMIC 数据之 notebooks CRRT (一)

cover

花了几天时间把 mimic-code/notebooks/crrt-notebook.ipynb 从头到尾看了一遍。虽然消化得还不是很好,但是觉得这一篇教程真的是干货满满。决定还是花点时间仔细再整理一下。和前面一样,我还是尽量放到 R 里做,R 不好做的我再到 Juputer 里做。R 的设置在上一篇里写过,这里我就只写 Python 里的准备工作了。需要的东西有:

  • PostgreSQL 运行,本地建立好 MIMIC-III 数据库
  • Python,我是 conda 环境的 Python 3.6。使用 Jupyter 的话当然还得搭配浏览器
  • R,最好搭配 RStudio

这个记事本(因为教程以 Jupyter Notebook 的形式存在,所以一直称为记事本)总体讲述如何在 MIMIC 数据中定义 CRRT。CRRT,Continuous renal replacement therapy,中文作连续性肾脏替代治疗,也被称作连续血液净化治疗 (continuous blood purification, CBP)。

CRRT 是临床出现一种新的代替肾脏治疗方法, 即每天持续 24 小时或接近 24 小时的一种长时间、连续体外血液净化疗法。

1.CRRT.ref
【邓青志,余阶洋,彭佳华.连续性肾脏替代治疗对ICU脓毒症患者的临床研究进展[J]. **医学工程, 2018, 26(04): 30-32.】

以及

1.CRRT.ref2
【马帅,丁峰.连续性肾脏替代治疗的过去、现在与未来[J].上海医药,2018,39(09):3-5+11.】

这个记事本主要目的是在 MIMIC-III v1.4 数据中定义病人 CRRT 的开始和结束时间;次要目的是展示如何从 MIMIC-III 数据中提取和整理临床数据。

框架

在 MIMIC-III 数据库中,定义一个临床概念包含一下几个关键步骤:

  1. 找出描述这一临床概念的关键词和语句
  2. d_items 表格中搜索这些关键词(如果是实验室检查的话要看d_labitems表格)。
  3. d_items 表格的 linksto这一列指定的表格中提取数据
  4. 用提取数据的规则在数据库中定义这一临床概念
  5. 通过逐个查看和聚合操作做验证

这整个过程是迭代进行的,也没有上面描述的那么清晰——验证时你可能又要回去修改数据提取的规则,等等。而且对于 MIMIC-III 数据,这整个过程还必须重复一次:一次是 MetaVision,一次是 CareVue。

MetaVision 和 CareVue

MIMIC-III 中的数据来自两个不同的 ICU 数据库系统。其结果就是,同一个临床概念的数据可能对应到多个不同的 itemid 。比如,病人心率数据算是一个比较容易提取的临床概念了,但是在 d_items 表格中匹配“heart rate”却可以发现至少两个 itemid

SELECT itemid, label, abbreviation, dbsource, linksto
FROM mimiciii.d_items
WHERE label='Heart Rate';

得到:

itemid label abbreviation dbsource linksto
211 Heart Rate carevue chartevents
220045 Heart Rate HR metavision chartevents

可以看到两个 itemid 都对应心率——但是一个是 CareVue 数据库系统使用的(dbsource = 'carevue')而另一个是 MetaVision 系统使用的(dbsource = 'metavision')。这也就是上面提到的,数据提取过程必须重复一次。通常来说,推荐先提取 MetaVision 数据,因为其数据组织形式更好,并且可以为后续到底需要纳入哪些数据提供了一些十分有用的信息。比如,MetaVision 里的 itemid 的每一个 label 都有一个相应的缩写,而这些缩写可以在后面用来在 CareVue 中搜索用。

Step 0: import libraries, connect to the database

由于是 Python 来做的,所以首先是载入包和一些设置。首先是所有要用到的包:

# Import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import psycopg2
from IPython.display import display, HTML # used to print out pretty pandas dataframes
import matplotlib.dates as dates
import matplotlib.lines as mlines

然后一些简单的设置和连接数据库:

%matplotlib inline
plt.style.use('ggplot') 

# specify user/password/where the database is
sqluser = 'postgres'
dbname = 'mimic'
schema_name = 'mimiciii'
host = 'localhost'

query_schema = 'SET search_path to ' + schema_name + ';'

# connect to the database
con = psycopg2.connect(dbname=dbname, user=sqluser, password=getpass.getpass(prompt='Password:'.format(user)), host=host)

我自己在连接数据库的时候每次都会出现报错:

OperationalError: could not connect to server: No such file or directory
	Is the server running locally and accepting
	connections on Unix domain socket "/tmp/.s.PGSQL.5432"?

Google 了一下就是这个文件放在不同的位置了,建立一个软链接就好:

ln -s /var/run/postgresql/.s.PGSQL.5432 /tmp/.s.PGSQL.5432

Step 1: Identification of key terms

我们感兴趣的是 CRRT,那么首先我们直接在 MetaVision 数据中搜索”CRRT“看看:

query = query_schema + """
select itemid, label, category, linksto
from d_items
where dbsource = 'metavision'
and lower(label) like '%crrt%'
"""
df = pd.read_sql_query(query,con)

df

可以得到:

# A tibble: 6 x 4
  itemid label                         category    linksto           
*  <int> <chr>                         <chr>       <chr>             
1 227290 CRRT mode                     Dialysis    chartevents       
2 225436 CRRT Filter Change            Dialysis    procedureevents_mv
3 227525 Calcium Gluconate (CRRT)      Medications inputevents_mv    
4 225802 Dialysis - CRRT               Dialysis    procedureevents_mv
5 227536 KCl (CRRT)                    Medications inputevents_mv    
6 225956 Reason for CRRT Filter Change Dialysis    chartevents  

然后我们就可以通过结果拓展我们开始的搜索方法了:

  • category = ‘Dialysis’
  • lower(label) like '%crrt%'

Step 2: Extraction of ITEMIDs from tables

Get list of itemid related to CRRT


(从这里开始为了贴结果方便我还是切到 R 里做了)


首先我们根据刚刚改进的搜索词来找到对应的 itemid

query("SELECT itemid, label, category, linksto FROM d_items di
       WHERE dbsource = 'metavision' 
          AND (lower(label) LIKE '%dialy%'
          OR category = 'Dialysis'
          OR lower(label) LIKE '%crrt%')
       ORDER BY linksto, category, label;")

# -------

# A tibble: 65 x 4
   itemid label                                          category                linksto    
 *  <int> <chr>                                          <chr>                   <chr>      
 1 225740 Dialysis Catheter Discontinued                 Access Lines - Invasive chartevents
 2 227357 Dialysis Catheter Dressing Occlusive           Access Lines - Invasive chartevents
 3 225776 Dialysis Catheter Dressing Type                Access Lines - Invasive chartevents
 4 226118 Dialysis Catheter placed in outside facility   Access Lines - Invasive chartevents
 5 227753 Dialysis Catheter Placement Confirmed by X-ray Access Lines - Invasive chartevents
 6 225323 Dialysis Catheter Site Appear                  Access Lines - Invasive chartevents
 7 225725 Dialysis Catheter Tip Cultured                 Access Lines - Invasive chartevents
 8 227124 Dialysis Catheter Type                         Access Lines - Invasive chartevents
 9 225126 Dialysis patient                               Adm History/FHPA        chartevents
10 224149 Access Pressure                                Dialysis                chartevents
# ... with 55 more rows

Manually label above itemid

上面得到的是所有有可能会用来提取 CRRT 数据的数据元素。所以下一步就是鉴别哪些元素可以用来定义治疗的开始和结束的时间。这个工作得依靠专业知识进行(而不是简单地编程的问题)。

通过 linksto 列把表格分开,人工查看所有 itemid 后我们得到下面这张表格,初步筛选后把所有 itemid 标记为 "consider for further review"(待商榷) 或者 "not relevant"(无关)。

Links to CHARTEVENTS

itemid label category linksto Included/comment
225740 Dialysis Catheter Discontinued Access Lines - Invasive chartevents No - access line
227357 Dialysis Catheter Dressing Occlusive Access Lines - Invasive chartevents No - access line
225776 Dialysis Catheter Dressing Type Access Lines - Invasive chartevents No - access line
226118 Dialysis Catheter placed in outside facility Access Lines - Invasive chartevents No - access line
227753 Dialysis Catheter Placement Confirmed by X-ray Access Lines - Invasive chartevents No - access line
225323 Dialysis Catheter Site Appear Access Lines - Invasive chartevents No - access line
225725 Dialysis Catheter Tip Cultured Access Lines - Invasive chartevents No - access line
227124 Dialysis Catheter Type Access Lines - Invasive chartevents No - access line
225126 Dialysis patient Adm History/FHPA chartevents No - admission information
224149 Access Pressure Dialysis chartevents Yes - CRRT setting
224404 ART Lumen Volume Dialysis chartevents Yes - CRRT setting
224144 Blood Flow (ml/min) Dialysis chartevents Yes - CRRT setting
228004 Citrate (ACD-A) Dialysis chartevents Yes - CRRT setting
227290 CRRT mode Dialysis chartevents Yes - CRRT setting
225183 Current Goal Dialysis chartevents Yes - CRRT setting
225977 Dialysate Fluid Dialysis chartevents Yes - CRRT setting
224154 Dialysate Rate Dialysis chartevents Yes - CRRT setting
224135 Dialysis Access Site Dialysis chartevents No - access line
225954 Dialysis Access Type Dialysis chartevents No - access line
224139 Dialysis Site Appearance Dialysis chartevents No - access line
225810 Dwell Time (Peritoneal Dialysis) Dialysis chartevents No - peritoneal dialysis
224151 Effluent Pressure Dialysis chartevents Yes - CRRT setting
224150 Filter Pressure Dialysis chartevents Yes - CRRT setting
226499 Hemodialysis Output Dialysis chartevents No - hemodialysis
225958 Heparin Concentration (units/mL) Dialysis chartevents Yes - CRRT setting
224145 Heparin Dose (per hour) Dialysis chartevents Yes - CRRT setting
224191 Hourly Patient Fluid Removal Dialysis chartevents Yes - CRRT setting
225952 Medication Added \#1 (Peritoneal Dialysis) Dialysis chartevents No - peritoneal dialysis
227638 Medication Added \#2 (Peritoneal Dialysis) Dialysis chartevents No - peritoneal dialysis
225959 Medication Added Amount \#1 (Peritoneal Dialysis) Dialysis chartevents No - peritoneal dialysis
227639 Medication Added Amount \#2 (Peritoneal Dialysis) Dialysis chartevents No - peritoneal dialysis
225961 Medication Added Units \#1 (Peritoneal Dialysis) Dialysis chartevents No - peritoneal dialysis
227640 Medication Added Units \#2 (Peritoneal Dialysis) Dialysis chartevents No - peritoneal dialysis
228005 PBP (Prefilter) Replacement Rate Dialysis chartevents Yes - CRRT setting
225965 Peritoneal Dialysis Catheter Status Dialysis chartevents No - peritoneal dialysis
225963 Peritoneal Dialysis Catheter Type Dialysis chartevents No - peritoneal dialysis
225951 Peritoneal Dialysis Fluid Appearance Dialysis chartevents No - peritoneal dialysis
228006 Post Filter Replacement Rate Dialysis chartevents Yes - CRRT setting
225956 Reason for CRRT Filter Change Dialysis chartevents Yes - CRRT setting
225976 Replacement Fluid Dialysis chartevents Yes - CRRT setting
224153 Replacement Rate Dialysis chartevents Yes - CRRT setting
224152 Return Pressure Dialysis chartevents Yes - CRRT setting
225953 Solution (Peritoneal Dialysis) Dialysis chartevents No - peritoneal dialysis
224146 System Integrity Dialysis chartevents Yes - CRRT setting
226457 Ultrafiltrate Output Dialysis chartevents Yes - CRRT setting
224406 VEN Lumen Volume Dialysis chartevents Yes - CRRT setting
225806 Volume In (PD) Dialysis chartevents No - peritoneal dialysis
227438 Volume not removed Dialysis chartevents No - peritoneal dialysis
225807 Volume Out (PD) Dialysis chartevents No - peritoneal dialysis

Links to DATETIMEEVENTS

itemid label category linksto Included/comment
225318 Dialysis Catheter Cap Change Access Lines - Invasive datetimeevents No - access lines
225319 Dialysis Catheter Change over Wire Date Access Lines - Invasive datetimeevents No - access lines
225321 Dialysis Catheter Dressing Change Access Lines - Invasive datetimeevents No - access lines
225322 Dialysis Catheter Insertion Date Access Lines - Invasive datetimeevents No - access lines
225324 Dialysis CatheterTubing Change Access Lines - Invasive datetimeevents No - access lines
225128 Last dialysis Adm History/FHPA datetimeevents No - admission information

Links to INPUTEVENTS_MV

itemid label category linksto Included/comment
227525 Calcium Gluconate (CRRT) Medications inputevents_mv Yes - CRRT setting
227536 KCl (CRRT) Medications inputevents_mv Yes - CRRT setting

Links to PROCEDUREEVENTS_MV

itemid label category linksto Included/comment
225441 Hemodialysis 4-Procedures procedureevents_mv No - hemodialysis
224270 Dialysis Catheter Access Lines - Invasive procedureevents_mv No - access lines
225436 CRRT Filter Change Dialysis procedureevents_mv Yes - CRRT setting
225802 Dialysis - CRRT Dialysis procedureevents_mv Yes - CRRT setting
225803 Dialysis - CVVHD Dialysis procedureevents_mv Yes - CRRT setting
225809 Dialysis - CVVHDF Dialysis procedureevents_mv Yes - CRRT setting
225955 Dialysis - SCUF Dialysis procedureevents_mv Yes - CRRT setting
225805 Peritoneal Dialysis Dialysis procedureevents_mv No - peritoneal dialysis

Reasons for inclusion/exclusion

筛选时的纳入和排除标准为:

  • CRRT Setting - 纳入,因为只有在病人正在接受 CRRT 治疗时才会记录。
  • Access lines - 排除,这些 itemid 被排除的原因是有 access line 并不一定保证病人正在接受 CRRT 治疗。虽然对于 CRRT 治疗 access line 确实必不可少,但是病人并未正在透析时也会有这些记录。(这一段不是很懂,原文:Access lines- no (excluded) - these ITEMIDs are not included as the presence of an access line does not guarantee that CRRT is being delivered. While having an access line is a requirement of performing CRRT, these lines are present even when a patient is not actively being hemodialysed. 主要问题在于 Access line 到底指的什么。是指数据中的记录呢?还是指做透析用的输液管留置管之类的什么东西?大概后者可能性更大)
  • Peritoneal dialysis - 排除,腹膜透析是另一种类型的透析,不是 CRRT。
  • Hemolysis - 排除,和腹膜透析类似,血液透析也是另一种类型的透析而不是 CRRT。

Step 3: Define rules based upon ITEMIDs

我们已经初步筛选得到应该纳入哪些数据了,现在就可以通过对应的 itemid 筛选到的数据来进一步定义 CRRT 的治疗时间了:这些数据表示 CRRT 开始、停止、继续还是其他什么呢?

我们直接根据上面的表格按照 CHARTEVENTS, INPUTEVENTS_MV, 以及 PROCEDUREEVENTS_MV 的顺序再来看看这些数据到底代表着 CRRT 的什么过程。注意这些 _MV 后缀就是表示这些表格数据来自于 MetaVision,而 _CV 就代表来自 CareVue。所以就像之前说的,等我们把 MetaVision 数据提取完了,还必须针对 CareVue 再做一次。

table 1 of 3: itemid from CHARTEVENTS

CHARTEVENTS 表格里纳入的 CRRT 有关的数据元素有:

itemid label param_type
224144 Blood Flow (ml/min) Numeric
224145 Heparin Dose (per hour) Numeric
224146 System Integrity Text
224149 Access Pressure Numeric
224150 Filter Pressure Numeric
224151 Effluent Pressure Numeric
224152 Return Pressure Numeric
224153 Replacement Rate Numeric
224154 Dialysate Rate Numeric
224191 Hourly Patient Fluid Removal Numeric
224404 ART Lumen Volume Numeric
224406 VEN Lumen Volume Numeric
225183 Current Goal Numeric
225956 Reason for CRRT Filter Change Text
225958 Heparin Concentration (units/mL) Text
225976 Replacement Fluid Text
225977 Dialysate Fluid Text
226457 Ultrafiltrate Output Numeric
227290 CRRT mode Text
228004 Citrate (ACD-A) Numeric
228005 PBP (Prefilter) Replacement Rate Numeric
228006 Post Filter Replacement Rate Numeric

我们先看看这些数字型的数据。根据专业人士的意见,这些数据应该是 CRRT 的关键参数并且接受 CRRT 的病人会每小时都有记录。

query("SELECT ce.icustay_id, di.label, ce.charttime, ce.value, ce.valueuom
       FROM chartevents ce INNER JOIN d_items di ON
          ce.itemid = di.itemid
       WHERE ce.icustay_id = 246866
       AND ce.itemid in
       (
          224404, -- | ART Lumen Volume
          224406, -- | VEN Lumen Volume
          228004, -- | Citrate (ACD-A)
          224145, -- | Heparin Dose (per hour)
          225183, -- | Current Goal
          224149, -- | Access Pressure
          224144, -- | Blood Flow (ml/min)
          224154, -- | Dialysate Rate
          224151, -- | Effluent Pressure
          224150, -- | Filter Pressure
          224191, -- | Hourly Patient Fluid Removal
          228005, -- | PBP (Prefilter) Replacement Rate
          228006, -- | Post Filter Replacement Rate
          224153, -- | Replacement Rate
          224152, -- | Return Pressure
          226457  -- | Ultrafiltrate Output
      )
      ORDER BY ce.icustay_id, ce.charttime, di.label;")

得到:

* icustay_id label charttime value valueuom
1 246866 ART Lumen Volume 2161-12-11 20:00:00 1.3 mL
2 246866 VEN Lumen Volume 2161-12-11 20:00:00 1.2 mL
3 246866 Access Pressure 2161-12-11 23:43:00 -87 mmHg
4 246866 Blood Flow (ml/min) 2161-12-11 2343::00 200 ml/min
5 246866 Citrate (ACD-A) 2161-12-11 23:43:00 0 ml/hr
6 246866 Current Goal 2161-12-11 23:43:00 0 mL
7 246866 Dialysate Rate 2161-12-11 23:43:00 500 ml/hr
8 246866 Effluent Pressure 2161-12-11 23:43:00 118 mmHg
9 246866 Filter Pressure 2161-12-11 23:43:00 197 mmHg
10 246866 Heparin Dose (per hour) 2161-12-11 23:43:00 0 units

从结果中可以看到 ART Lumen VolumeVEN Lumen Volume 的记录时间和其它数据记录时间差别很大。和专业人员讨论后他们认为这是合理的,这些液体流速参数意味着输液管是开着的,但是这并不代表 CRRT 正在进行(这一句不知道翻译是否正确,原文:as these volumes indicate settings to keep open the line and are not directly relevant to the administration of CRRT)—— 最好的情况是这些数据是冗余的,最坏的情况是引起对判断 CRRT 开始和停止的误判。因此最后我们把这两项去掉了。剩下的 itemid 有:

224149, -- Access Pressure
224144, -- Blood Flow (ml/min)
228004, -- Citrate (ACD-A)
225183, -- Current Goal
224154, -- Dialysate Rate
224151, -- Effluent Pressure
224150, -- Filter Pressure
224145, -- Heparin Dose (per hour)
224191, -- Hourly Patient Fluid Removal
228005, -- PBP (Prefilter) Replacement Rate
228006, -- Post Filter Replacement Rate
224153, -- Replacement Rate
224152, -- Return Pressure
226457 -- Ultrafiltrate Output

再来看剩下的字符型数据:

itemid label param_type
224146 System Integrity Text
225956 Reason for CRRT Filter Change Text
225958 Heparin Concentration (units/mL) Text
225976 Replacement Fluid Text
225977 Dialysate Fluid Text
227290 CRRT mode Text

我们一个一个 itemid 往下看。首先为了查看方便我们再来定义一个简单地函数:

query_item <- function(item_id){
  qur <- stringr::str_replace_all(paste("
         SELECT value
         , COUNT(distinct icustay_id) AS number_of_patients
         , COUNT(icustay_id) AS number_of_observations
         FROM chartevents
         WHERE itemid = '",item_id,
         "' GROUP BY value ORDER BY value;", sep = ""), "[\n]", "")

  query(qur)
}

224146 - System Integrity

用上面定义的偷懒函数直接 query_item(224146)得:

   value                      number_of_patients number_of_observations
 * <chr>                                   <dbl>                  <dbl>
 1 Active                                    539                  48072
 2 Clots Increasing                          245                   1419
 3 Clots Present                             427                  16836
 4 Clotted                                   233                    441
 5 Discontinued                              339                    771
 6 Line pressure inconsistent                127                    431
 7 New Filter                                357                   1040
 8 No Clot Present                           275                   2615
 9 Recirculating                             172                    466
10 Reinitiated                               336                   1207

和专业人员谈论后,我们得知这每一项都代表 CRRT 治疗的不同阶段。我们简单地分为三类:started,stopped 或者 active(即已开始,已停止和进行中)。既然 active 表明 CRRT 进行中,那么 active 首次出现也有可能指开始,因此我们直接归类为 ”active/started“。所以人工整理后得到:

value count interpretation
Active 539 CRRT active/started
Clots Increasing 245 CRRT active/started
Clots Present 427 CRRT active/started
Clotted 233 CRRT stopped
Discontinued 339 CRRT stopped
Line pressure inconsistent 127 CRRT active/started
New Filter 357 CRRT started
No Clot Present 275 CRRT active/started
Recirculating 172 CRRT stopped
Reinitiated 336 CRRT started

后面我们再写代码来合并这些 itemid

225956 - Reason for CRRT Filter Change

query_item(225956)

  value        number_of_patients number_of_observations
* <chr>                     <dbl>                  <dbl>
1 Clotted                      50                     69
2 Line changed                  9                     11
3 Procedure                    20                     31

这三项是 stop(即 CRRT 停止),因为这时候要更换滤器。随后的 CRRT 则为 restart(重新开始),而不是当前 CRRT 的延续。(这一段不是很懂是要表示什么,按理来说更换滤器之后开始应该是算作一次啊)

225958 - Heparin Concentration (units/mL)

query_item(225958)

  value          number_of_patients number_of_observations
* <chr>                       <dbl>                  <dbl>
1 100                            16                    995
2 1000                           41                     94
3 Not applicable                120                   8796

这些是 CRRT 的常规参数,可以和其他数字型字段放到一起。(这什么意思??)

225976 - Replacement Fluid

query_item(225976):

  value                   number_of_patients number_of_observations
* <chr>                                <dbl>                  <dbl>
1 None                                    14                     19
2 Normal Saline 0.9%                       1                     12
3 Prismasate K0                           78                    201
4 Prismasate K2                          459                  27603
5 Prismasate K4                          387                  30872
6 Sodium Bicarb 150/D5W                    2                      8
7 Sodium Bicarb 75/0.45NS                  6                     48

CRRT 的常规参数,可以和其他数字型字段放到一起。

225977 - Dialysate Fluid

query_item(225977):

  value         number_of_patients number_of_observations
* <chr>                      <dbl>                  <dbl>
1 None                          97                   6025
2 Normal Saline                 32                    695
3 Prismasate K0                 89                    231
4 Prismasate K2                438                  24271
5 Prismasate K4                357                  27320

CRRT 的常规参数,可以和其他数字型字段放到一起。

227290 - CRRT mode

query_item(227290):

  value  number_of_patients number_of_observations
* <chr>               <dbl>                  <dbl>
1 CVVH                   40                   1280
2 CVVHD                  24                    583
3 CVVHDF                498                  25533
4 SCUF                    1                      7

虽然看起来不错,但是有可能 CRRT mode(CRRT 模式)和真正 CRRT 治疗不是同时记录的。我们来看看是不是所有有 CRRT 参数记录的病人都记录了 CRRT mode

query("WITH t1 AS
(
  SELECT icustay_id,
  MAX(CASE WHEN
          itemid = 227290 THEN 1
      ELSE 0 END) AS HasMode
  FROM chartevents ce
  WHERE itemid IN
  (
  227290, --  CRRT mode
  228004, --  Citrate (ACD-A)
  225958, --  Heparin Concentration (units/mL)
  224145, --  Heparin Dose (per hour)
  225183, --  Current Goal -- always there
  224149, --  Access Pressure
  224144, --  Blood Flow (ml/min)
  225977, --  Dialysate Fluid
  224154, --  Dialysate Rate
  224151, --  Effluent Pressure
  224150, --  Filter Pressure
  224191, --  Hourly Patient Fluid Removal
  228005, --  PBP (Prefilter) Replacement Rate
  228006, --  Post Filter Replacement Rate
  225976, --  Replacement Fluid
  224153, --  Replacement Rate
  224152, --  Return Pressure
  226457  --  Ultrafiltrate Output
  )
  GROUP BY icustay_id
)
  SELECT COUNT(icustay_id) AS Num_ICUSTAY_ID
  , SUM(hasmode) AS Num_With_Mode
  FROM t1;")

结果:

num_icustay_id num_with_mode
784 533

或者现在进一步查询,有多少人没有记录其他 CRRT 参数而仅有 CRRT mode 呢?

query("
WITH t1 AS 
  (
    SELECT icustay_id, charttime
    , MAX(CASE WHEN
            itemid = 227290 THEN 1
          ELSE 0 END) AS HasCRRTMode
    , MAX(CASE WHEN
            itemid != 227290 THEN 1
          ELSE 0 END) AS OtherITEMID
    FROM chartevents ce
    WHERE itemid in
    (
      227290, --  CRRT mode
      228004, --  Citrate (ACD-A)
      225958, --  Heparin Concentration (units/mL)
      224145, --  Heparin Dose (per hour)
      225183, --  Current Goal -- always there
      224149, --  Access Pressure
      224144, --  Blood Flow (ml/min)
      225977, --  Dialysate Fluid
      224154, --  Dialysate Rate
      224151, --  Effluent Pressure
      224150, --  Filter Pressure
      224191, --  Hourly Patient Fluid Removal
      228005, --  PBP (Prefilter) Replacement Rate
      228006, --  Post Filter Replacement Rate
      225976, --  Replacement Fluid
      224153, --  Replacement Rate
      224152, --  Return Pressure
      226457  --  Ultrafiltrate Output
    )
    GROUP BY icustay_id, charttime
  )
  SELECT count(icustay_id) AS NumObs
  , SUM(CASE WHEN HasCRRTMode = 1 AND OtherITEMID = 1 THEN 1 ELSE 0 END) AS Both
  , SUM(CASE WHEN HasCRRTMode = 1 AND OtherITEMID = 0 THEN 1 ELSE 0 END) AS OnlyCRRTMode
  , SUM(CASE WHEN HasCRRTMode = 0 AND OtherITEMID = 1 THEN 1 ELSE 0 END) AS NoCRRTMode
  FROM t1;"
)

得到:

- numobs both onlycrrtmode nocrrtmode
0 81162 27446 1 53778

可以看到 CRRT mode 这个参数基本上很冗余(27446/81162 例既有 CRRT mode 的记录也有其他,而只有个别人只有 CRRT mode 记录而没有其他),并且也不能表示 CRRT 正在进行中,而且数据也不完全兼容(53778/81162 例接受 CRRT 治疗的病人其实并没有 CRRT mode 的记录。不知道后面这句话指的具体是什么,但是我注意到在上面的表格里 81162 != 27446 + 1 + 53778),最终我们决定从 item_id 里把它排除了。

CHARTEVENTS wrap up

稍稍总结下,最后 CHARTEVENTS 里剩下的表示 CRRT 的 started/ongoing 的 itemid 是这些:

224149, -- Access Pressure
224144, -- Blood Flow (ml/min)
228004, -- Citrate (ACD-A)
225183, -- Current Goal
225977, -- Dialysate Fluid
224154, -- Dialysate Rate
224151, -- Effluent Pressure
224150, -- Filter Pressure
225958, -- Heparin Concentration (units/mL)
224145, -- Heparin Dose (per hour)
224191, -- Hourly Patient Fluid Removal
228005, -- PBP (Prefilter) Replacement Rate
228006, -- Post Filter Replacement Rate
225976, -- Replacement Fluid
224153, -- Replacement Rate
224152, -- Return Pressure
226457 -- Ultrafiltrate Output

还有下面这些表示 CRRT 的 started/stopped/ongoing 但是还需要特别处理的:

224146, -- System Integrity
225956 -- Reason for CRRT Filter Change

table 2 of 3: INPUTEVENTS_MV

INPUTEVENT_MV 里的 item_id 有:

227525,-- Calcium Gluconate (CRRT)
227536 -- KCl (CRRT)

根据专业人士的意见,这些项目肯定是 CRRT 才会有的不需要特别去看了,我们直接把它们标记为 CRRT active/started。

table 3 of 3: PROCEDUREEVENTS_MV

PROCEDUREEVENTS_MV 里的 item_id 有:

itemid label
225436 CRRT Filter Change
225802 Dialysis - CRRT
225803 Dialysis - CVVHD
225809 Dialysis - CVVHDF
225955 Dialysis - SCUF

唯一有点争议的 item_id225436(CRRT Filter Change)。这个 item_id 代表 CRRT 中断,并且更换完成后 CRRT 再开始。原则上这可以作为结束时间,但是这一记录没有 100% 完整(不知道什么意思),专业人士的意见是相比把这个记录作为 CRRT 结束时间,可能直接忽略掉更好。

因此最终纳入的是:

225802, -- Dialysis - CRRT
225803, -- Dialysis - CVVHD
225809, -- Dialysis - CVVHDF
225955 -- Dialysis - SCUF

到这里第 3 步也是最繁琐的人工查看每个 item_id 并依据专业知识决定是否纳入以及纳入的元素如何分类就做完了。下面就是利用我们选好的 item_id 来定义 CRRT 的时间了。

下一篇继续。Cheers.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.