1. 不平衡的数据集

像金融欺诈等数据集为典型的不平衡的数据集

1.1. 评价指标陷阱

正确率、准确率等对于不平衡数据集存在陷阱

from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
# Remove 'id' and 'target' columns
labels = df_train.columns[2:]
X = df_train[labels]
y = df_train['target']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
model = XGBClassifier()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy: %.2f%%" % (accuracy * 100.0))
>>>
Accuracy: 96.36%

model = XGBClassifier()
model.fit(X_train[['ps_calc_01']], y_train)
y_pred = model.predict(X_test[['ps_calc_01']])
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy: %.2f%%" % (accuracy * 100.0))
>>>
Accuracy: 96.36%

1.2. 改善的评估指标

使用混合矩阵评估结果
使用归一化的基尼系数评估
> 0 for random guessing,
> to approximately 0.5 for a perfect score.

1.2.1. Confusion matrix

from sklearn.metrics import confusion_matrix
from matplotlib import pyplot as plt
conf_mat = confusion_matrix(y_true=y_test, y_pred=y_pred)
print('Confusion matrix:\n', conf_mat)
labels = ['Class 0', 'Class 1']
fig = plt.figure()
ax = fig.add_subplot(111)
cax = ax.matshow(conf_mat, cmap=plt.cm.Blues)
fig.colorbar(cax)
ax.set_xticklabels([''] + labels)
ax.set_yticklabels([''] + labels)
plt.xlabel('Predicted')
plt.ylabel('Expected')
plt.show()
Confusion matrix:
 [[114709      0]
 [  4334      0]]

1.2.2. Normalized Gini Coefficient

2. 重采样

通过对数据集采样降低时间复杂度
过采样/欠采样解决数分布不均的问题class-imbalance

2.1. 欠采样

2.1.1. 随机欠采样

# 2. Class count
count_class_0, count_class_1 = df_train.target.value_counts()
# 3. Divide by class
df_class_0 = df_train[df_train['target'] == 0]
df_class_1 = df_train[df_train['target'] == 1]
Random under-sampling
In [6]:
df_class_0_under = df_class_0.sample(count_class_1)
df_test_under = pd.concat([df_class_0_under, df_class_1], axis=0)
print('Random under-sampling:')
print(df_test_under.target.value_counts())
df_test_under.target.value_counts().plot(kind='bar', title='Count (target)

>>>
Random under-sampling:
1    21694
0    21694
Name: target, dtype: int64

from imblearn.under_sampling import RandomUnderSampler
rus = RandomUnderSampler(return_indices=True)
X_rus, y_rus, id_rus = rus.fit_sample(X, y)
print('Removed indexes:', id_rus)
plot_2d_space(X_rus, y_rus, 'Random under-sampling')
>>>
Removed indexes: [51 94 73  3 48 81 20 10 29 85  4  8  9 14 16 40 67 70 71 74]

2.1.2. EasyEnsemble （周志华）

https://imbalanced-learn.org/en/stable/generated/imblearn.ensemble.EasyEnsemble.html

2.1.3. Tomek links

from imblearn.under_sampling import TomekLinks
tl = TomekLinks(return_indices=True, ratio='majority')
X_tl, y_tl, id_tl = tl.fit_sample(X, y)
print('Removed indexes:', id_tl)
>>>
Removed indexes: [ 0  1  2  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
 26 27 28 29 30 31 32 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51
 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76
 77 78 79 80 81 82 83 84 85 86 87 88 90 91 92 93 94 95 97 98 99]

2.1.4. ClusterCentroids

from imblearn.under_sampling import ClusterCentroids
cc = ClusterCentroids(ratio={0: 10})
X_cc, y_cc = cc.fit_sample(X, y)
plot_2d_space(X_cc, y_cc, 'Cluster Centroids under-sampling')

2.2. 过采样

2.2.1. 随机过采样

df_class_1_over = df_class_1.sample(count_class_0, replace=True)
df_test_over = pd.concat([df_class_0, df_class_1_over], axis=0)
print('Random over-sampling:')
print(df_test_over.target.value_counts())
df_test_over.target.value_counts().plot(kind='bar', title='Count (target)');
>>>
Random over-sampling:
1    573518
0    573518
Name: target, dtype: int64

from imblearn.over_sampling import RandomOverSampler
ros = RandomOverSampler()
X_ros, y_ros = ros.fit_sample(X, y)
print(X_ros.shape[0] - X.shape[0], 'new random picked points')

2.2.2. SMOTE 算法

参考:https://arxiv.org/abs/1106.1813
https://www.jianshu.com/p/13fc0f7f5565

JAIR'2002的文章《SMOTE: Synthetic Minority Over-sampling Technique》提出了一种过采样算法SMOTE

本算法基于“插值”来为少数类合成新的样本。下面介绍如何合成新的样本。

设训练集的一个少数类的样本数为 T ，那么SMOTE算法将为这个少数类合成 NT 个新样本。这里要求 N 必须是正整数，如果给定的 N<1 那么算法将“认为”少数类的样本数 $T=NT$ ，并将强制 $N=1$ 。

考虑该少数类的一个样本 i ，其特征向量为 $xi,i∈{1,...,T}$ ：

首先从该少数类的全部 T 个样本中找到样本 xi 的 k 个近邻（例如用欧氏距离），记为 xi(near),near∈{1,...,k} ；
然后从这 k 个近邻中随机选择一个样本 xi(nn) ，再生成一个 0 到 1 之间的随机数 ζ1 ，从而合成一个新样本 xi1 ：

$$xi1=xi+ζ1⋅(xi(nn)−xi)$$

将步骤2重复进行 N 次，从而可以合成 N 个新样本：xinew,new∈1,...,N。

那么，对全部的 T 个少数类样本进行上述操作，便可为该少数类合成 NT 个新样本。

如果样本的特征维数是 2 维，那么每个样本都可以用二维平面上的一个点来表示。SMOTE算法所合成出的一个新样本 xi1 相当于是表示样本 xi 的点和表示样本 xi(nn) 的点之间所连线段上的一个点。所以说该算法是基于“插值”来合成新样本。

SMOTE算法的缺陷
1. 是在近邻选择时,存在一定的盲目性。从上面的算法流程可以看出,在算法执行过程中,需要确定K值,即选择多少个近邻样本,这需要用户自行解决。从K值的定义可以看出,K值的下限是M值(M值为从K个近邻中随机挑选出的近邻样本的个数,且有M< K),M的大小可以根据负类样本数量、正类样本数量和数据集最后需要达到的平衡率决定。但K值的上限没有办法确定,只能根据具体的数据集去反复测试。因此如何确定K值,才能使算法达到最优这是未知的。
2. 另外,该算法无法克服非平衡数据集的数据分布问题,容易产生分布边缘化问题。由于负类样本的分布决定了其可选择的近邻,如果一个负类样本处在负类样本集的分布边缘,则由此负类样本和相邻样本产生的“人造”样本也会处在这个边缘,且会越来越边缘化,从而模糊了正类样本和负类样本的边界,而且使边界变得越来越模糊。这种边界模糊性,虽然使数据集的平衡性得到了改善,但加大了分类算法进行分类的难度．

2.2.2.1. 改进

针对SMOTE算法的进一步改进针对SMOTE算法存在的边缘化和盲目性等问题,很多人纷纷提出了新的改进办法,在一定程度上改进了算法的性能,但还存在许多需要解决的问题。

Han等人Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning在SMOTE算法基础上进行了改进,提出了Borderhne.SMOTE算法,解决了生成样本重叠(Overlapping)的问题该算法在运行的过程中,查找一个适当的区域,该区域可以较好地反应数据集的性质,然后在该区域内进行插值,以使新增加的“人造”样本更有效。这个适当的区域一般由经验给定,因此算法在执行的过程中有一定的局限性。

Over-sampling followed by under-sampling
Now, we will do a combination of over-sampling and under-sampling, using the SMOTE and Tomek links techniques:
In [17]:

from imblearn.combine import SMOTETomek
smt = SMOTETomek(ratio='auto')
X_smt, y_smt = smt.fit_sample(X, y)
plot_2d_space(X_smt, y_smt, 'SMOTE + Tomek links')

2.2.2.2. 实践

from imblearn.over_sampling import SMOTE

smote = SMOTE(ratio='minority')

X_sm, y_sm = smote.fit_sample(X, y)

plot_2d_space(X_sm, y_sm, 'SMOTE over-sampling')

3. 参考资料

The imbalanced-learn documentation:
http://contrib.scikit-learn.org/imbalanced-learn/stable/index.html
The imbalanced-learn GitHub:
https://github.com/scikit-learn-contrib/imbalanced-learn
Comparison of the combination of over- and under-sampling algorithms:
http://contrib.scikit-learn.org/imbalanced-learn/stable/auto_examples/combine/plot_comparison_combine.html
Chawla, Nitesh V., et al. "SMOTE: synthetic minority over-sampling technique." Journal of artificial intelligence research 16 (2002):
https://www.jair.org/media/953/live-953-2037-jair.pdf
Haibo He, Edwardo A. Garcia的Learning from Imbalanced Data[重点推荐]