# 机器学习算法的基本原理-附Python和R语言代码

2年前 阅读 1261 点赞 3

1.监督学习

2.无监督学习

3.强化学习：

• 线性回归
• 对数几率回归
• 决策树
• SVM
• 朴素贝叶斯
• KNN
• 聚类
• 随机森林
• 降维算法
• 梯度提升算法

GBM（梯度提升机）

XGBOST

## 1.线性回归

Y——因变量

a——斜率

X——独立变量

b——截距

Python 代码

#Import Library
#Import other necessary libraries like pandas, numpy...
from sklearn import linear_model
#Identify feature and response variable(s) and values must be numeric and numpy arrays
x_train=input_variables_values_training_datasets
y_train=target_variables_values_training_datasets
x_test=input_variables_values_test_datasets
# Create linear regression object
linear = linear_model.LinearRegression()
# Train the model using the training sets and check score
linear.fit(x_train, y_train)
linear.score(x_train, y_train)
#Equation coefficient and Intercept
print('Coefficient: \n', linear.coef_)
print('Intercept: \n', linear.intercept_)
#Predict Output
predicted= linear.predict(x_test)


R 语言代码

#Load Train and Test datasets
#Identify feature and response variable(s) and values must be numeric and numpy arrays
x_train <- input_variables_values_training_datasets
y_train <- target_variables_values_training_datasets
x_test <- input_variables_values_test_datasets
x <- cbind(x_train,y_train)
# Train the model using the training sets and check score
linear <- lm(y_train ~ ., data = x)
summary(linear)
#Predict Output
predicted= predict(linear,x_test)


## 2.逻辑回归

odds= p/ (1-p) = probability of event occurrence / probability of not event occurrence
ln(odds) = ln(p/(1-p))
logit(p) = ln(p/(1-p)) = b0+b1X1+b2X2+b3X3....+bkXk


Python 代码

#Import Library
from sklearn.linear_model import LogisticRegression
#Assumed you have, X (predictor) and Y (target) for training data set and x_test(predictor) of test_dataset
# Create logistic regression object
model = LogisticRegression()
# Train the model using the training sets and check score
model.fit(X, y)
model.score(X, y)
#Equation coefficient and Intercept
print('Coefficient: \n', model.coef_)
print('Intercept: \n', model.intercept_)
#Predict Output
predicted= model.predict(x_test)


R 语言代码

x <- cbind(x_train,y_train)
# Train the model using the training sets and check score
logistic <- glm(y_train ~ ., data = x,family='binomial')
summary(logistic)
#Predict Output
predicted= predict(logistic,x_test)


• 包括交互项
• 去除特征
• 正则化技术
• 使用非线性模型

## 3.决策树

#### Python 代码

#Import Library
#Import other necessary libraries like pandas, numpy...
from sklearn import tree
#Assumed you have, X (predictor) and Y (target) for training data set and x_test(predictor) of test_dataset
# Create tree object
model = tree.DecisionTreeClassifier(criterion='gini') # for classification, here you can change the algorithm as gini or entropy (information gain) by default it is gini
# model = tree.DecisionTreeRegressor() for regression
# Train the model using the training sets and check score
model.fit(X, y)
model.score(X, y)
#Predict Output
predicted= model.predict(x_test)


R 语言代码

library(rpart)
x <- cbind(x_train,y_train)
# grow tree
fit <- rpart(y_train ~ ., data = x,method="class")
summary(fit)
#Predict Output
predicted= predict(fit,x_test)


## 4.支持向量机（SVM）

• 你可以在任何角度（不只是水平或垂直，如在经典游戏）画线/平面
• 游戏的目的是在不同的房间中分离不同颜色的球
• 球不动

#### Python代码

#Import Library
from sklearn import svm
#Assumed you have, X (predictor) and Y (target) for training data set and x_test(predictor) of test_dataset
# Create SVM classification object
model = svm.svc() # there is various option associated with it, this is simple for classification. You can refer link, for mo# re detail.
# Train the model using the training sets and check score
model.fit(X, y)
model.score(X, y)
#Predict Output
predicted= model.predict(x_test)


R语言代码

library(e1071)
x <- cbind(x_train,y_train)
# Fitting model
fit <-svm(y_train ~ ., data = x)
summary(fit)
#Predict Output
predicted= predict(fit,x_test)


## 5.朴素贝叶斯

Bayes定理为P（C）、P（X）和P（X，C）的后验概率p（C* x）的计算提供了一种途径。请看下面的方程式：

• P（C x）是给定（属性）的类（目标）的后验概率。
• P（C）是类的先验概率。
• P（x，c）是预测给定类的概率。
• P（x）是预测器的先验概率。

#### Python 代码

#Import Library
from sklearn.naive_bayes import GaussianNB
#Assumed you have, X (predictor) and Y (target) for training data set and x_test(predictor) of test_dataset
# Create SVM classification object model = GaussianNB() # there is other distribution for multinomial classes like Bernoulli Naive Bayes, Refer link
# Train the model using the training sets and check score
model.fit(X, y)
#Predict Output
predicted= model.predict(x_test)


R 语言代码

library(e1071)
x <- cbind(x_train,y_train)
# Fitting model
fit <-naiveBayes(y_train ~ ., data = x)
summary(fit)
#Predict Output
predicted= predict(fit,x_test)


## 6.KNN（k-近邻）

KNN可以很容易地映射到我们的真实生活中。如果你想了解一个你没有信息的人，你可能想了解他的密友和他所进入的圈子，并获得他/她的信息！

• KNN在计算上是昂贵的
• 变量应该归一化，否则较高的范围变量会对它产生偏差。
• 在KNN类离群点、去噪前进行预处理阶段的工作

#### Python 代码

#Import Library
from sklearn.neighbors import KNeighborsClassifier
#Assumed you have, X (predictor) and Y (target) for training data set and x_test(predictor) of test_dataset
# Create KNeighbors classifier object model
KNeighborsClassifier(n_neighbors=6) # default value for n_neighbors is 5
# Train the model using the training sets and check score
model.fit(X, y)
#Predict Output
predicted= model.predict(x_test)


R 代码

library(knn)
x <- cbind(x_train,y_train)
# Fitting model
fit <-knn(y_train ~ ., data = x,k=5)
summary(fit)
#Predict Output
predicted= predict(fit,x_test)


## 7.K-Means（K-均值算法）

k-均值如何形成集群：

1. k-均值为每个集群挑选K个点，称为质心。
2. 每个数据点形成具有最接近质心的簇，即K簇。
3. 根据现有的群集成员查找每个簇的质心。这里我们有新的质心。
4. 当我们有新的质心时，重复第2步和第3步，从新的质心找到每个数据点最近的距离，并与新的K簇关联。重复这个过程直到收敛发生，即质心不变。

#### Python 代码

#Import Library
from sklearn.cluster import KMeans
#Assumed you have, X (attributes) for training data set and x_test(attributes) of test_dataset
# Create KNeighbors classifier object model
k_means = KMeans(n_clusters=3, random_state=0)
# Train the model using the training sets and check score
model.fit(X)
#Predict Output
predicted= model.predict(x_test)


R 语言代码

library(cluster)
fit <- kmeans(X, 3) # 5 cluster solution


## 8.随机森林

1. 如果训练集中的病例数为N，则随机抽取N例样本，但进行替换。这个样本将是培养树木的训练工具。
2. 如果存在M个输入变量，则指定一个数字m＞M，使得在每个节点处，随机地从M中选择m个变量，并使用这些m上的最佳分割来分割节点。在森林生长过程中，M的值保持不变。
3. 每棵树的生长尽可能最大，没有修剪。

Python代码

#Import Library
from sklearn.ensemble import RandomForestClassifier
#Assumed you have, X (predictor) and Y (target) for training data set and x_test(predictor) of test_dataset
# Create Random Forest object
model= RandomForestClassifier()
# Train the model using the training sets and check score
model.fit(X, y)
#Predict Output
predicted= model.predict(x_test)


R语言代码

library(randomForest)
x <- cbind(x_train,y_train)
# Fitting model
fit <- randomForest(Species ~ ., x,ntree=500)
summary(fit)
#Predict Output
predicted= predict(fit,x_test)


## 9.降维算法

#### Python 代码

#Import Library
from sklearn import decomposition
#Assumed you have training and test data set as train and test
# Create PCA obeject pca= decomposition.PCA(n_components=k) #default value of k =min(n_sample, n_features)
# For Factor analysis
#fa= decomposition.FactorAnalysis()
# Reduced the dimension of training dataset using PCA
train_reduced = pca.fit_transform(train)
#Reduced the dimension of test dataset
test_reduced = pca.transform(test)


#### R 语言代码

library(stats)
pca <- princomp(train, cor = TRUE)
train_reduced  <- predict(pca,train)
test_reduced  <- predict(pca,test)


## 10.梯度提升算法

10.1 GBM

GBM（梯度提升机）是一种在处理大量数据以进行高预测的预测时使用的Boosting算法。Boosting实际上是一个学习算法的集合，它结合了几个基本估计量的预测，以便比单个估计量提高坚固性。它将多个弱或平均预测因子组合成一个强预测因子。这些提升算法在Kaggle、AV Hackthon、CrowdAnalytix等数据科学竞赛中总能表现得很好。

#### Python代码

#Import Library
#Assumed you have, X (predictor) and Y (target) for training data set and x_test(predictor) of test_dataset
# Create Gradient Boosting Classifier object
# Train the model using the training sets and check score
model.fit(X, y)
#Predict Output
predicted= model.predict(x_test)


#### R语言代码

library(caret)
x <- cbind(x_train,y_train)
# Fitting model
fitControl <- trainControl( method = "repeatedcv", number = 4, repeats = 4)
fit <- train(y ~ ., data = x, method = "gbm", trControl = fitControl,verbose = FALSE)
predicted= predict(fit,x_test,type= "prob")[,2]


10.2 XGBOST

XGBoost具有非常高的预测能力，这使它成为事件准确度的最佳选择，因为它同时具有线性模型和树学习算法，使得该算法比现有的梯度提升机技术快近10倍。

XGBoost最有趣的事情之一是它也被称为一种正规的提升技术。这有助于减少过拟合建模，并为Scala、Java、R语言、Python、Julia和C++等多种语言提供了大量支持。

Python 代码：

from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
X = dataset[:,0:10]
Y = dataset[:,10:]
seed = 1

X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.33, random_state=seed)

model = XGBClassifier()

model.fit(X_train, y_train)

#Make predictions for test data
y_pred = model.predict(X_test)


R 语言代码

require(caret)

x <- cbind(x_train,y_train)

# Fitting model

TrainControl <- trainControl( method = "repeatedcv", number = 10, repeats = 4)

model<- train(y ~ ., data = x, method = "xgbLinear", trControl = TrainControl,verbose = FALSE)

OR

model<- train(y ~ ., data = x, method = "xgbTree", trControl = TrainControl,verbose = FALSE)

predicted <- predict(model, x_test)


10.3 LightGBM

LightGBM是一种使用基于树的学习算法的梯度提升框架。它被设计成分布式和高效的，具有以下优点：

1. 更快的培训速度和更高的效率
2. 低内存使用
3. 较好精度
4. 并行GPU学习支持
5. 能够处理大规模数据

Python 代码

data = np.random.rand(500, 10) # 500 entities, each contains 10 features
label = np.random.randint(2, size=500) # binary target

train_data = lgb.Dataset(data, label=label)
test_data = train_data.create_valid('test.svm')

param = {'num_leaves':31, 'num_trees':100, 'objective':'binary'}
param['metric'] = 'auc'

num_round = 10
bst = lgb.train(param, train_data, num_round, valid_sets=[test_data])

bst.save_model('model.txt')

# 7 entities, each contains 10 features
data = np.random.rand(7, 10)
ypred = bst.predict(data)



R 语言代码

library(RLightGBM)
data(example.binary)
#Parameters

num_iterations <- 100
config <- list(objective = "binary",  metric="binary_logloss,auc", learning_rate = 0.1, num_leaves = 63, tree_learner = "serial", feature_fraction = 0.8, bagging_freq = 5, bagging_fraction = 0.8, min_data_in_leaf = 50, min_sum_hessian_in_leaf = 5.0)

#Create data handle and booster
handle.data <- lgbm.data.create(x)

lgbm.data.setField(handle.data, "label", y)

handle.booster <- lgbm.booster.create(handle.data, lapply(config, as.character))

#Train for num_iterations iterations and eval every 5 steps

lgbm.booster.train(handle.booster, num_iterations, 5)

#Predict
pred <- lgbm.booster.predict(handle.booster, x.test)

#Test accuracy
sum(y.test == (y.pred > 0.5)) / length(y.test)

lgbm.booster.save(handle.booster, filename = "/tmp/model.txt")


require(caret)
require(RLightGBM)
data(iris)

model <-caretModel.LGBM()

fit <- train(Species ~ ., data = iris, method=model, verbosity = 0)
print(fit)
y.pred <- predict(fit, iris[,1:4])

library(Matrix)
model.sparse <- caretModel.LGBM.sparse()

#Generate a sparse matrix
mat <- Matrix(as.matrix(iris[,1:4]), sparse = T)
fit <- train(data.frame(idx = 1:nrow(iris)), iris\$Species, method = model.sparse, matrix = mat, verbosity = 0)
print(fit)


10.4 CatBoost

CatBoost是俄罗斯最大搜索引擎公司Yandex开放源码的机器学习算法。它可以很容易地与谷歌的Tensorflow和苹果的 Core ML等深度学习框架相结合。

CatBoost可以在不显示类型转换错误的情况下自动处理分类变量，这有助于您集中精力更好地调优模型，而不是解决一些小错误。

Python 代码

import pandas as pd
import numpy as np

from catboost import CatBoostRegressor

#Imputing missing values for both train and test
train.fillna(-999, inplace=True)
test.fillna(-999,inplace=True)

#Creating a training set for modeling and validation set to check model performance
X = train.drop(['Item_Outlet_Sales'], axis=1)
y = train.Item_Outlet_Sales

from sklearn.model_selection import train_test_split

X_train, X_validation, y_train, y_validation = train_test_split(X, y, train_size=0.7, random_state=1234)
categorical_features_indices = np.where(X.dtypes != np.float)[0]

#importing library and building model
from catboost import CatBoostRegressormodel=CatBoostRegressor(iterations=50, depth=3, learning_rate=0.1, loss_function='RMSE')

model.fit(X_train, y_train,cat_features=categorical_features_indices,eval_set=(X_validation, y_validation),plot=True)

submission = pd.DataFrame()

submission['Item_Identifier'] = test['Item_Identifier']
submission['Outlet_Identifier'] = test['Outlet_Identifier']
submission['Item_Outlet_Sales'] = model.predict(test)


R 语言代码

set.seed(1)

require(titanic)

require(caret)

require(catboost)

tt <- titanic::titanic_train[complete.cases(titanic::titanic_train),]

data <- as.data.frame(as.matrix(tt), stringsAsFactors = TRUE)

drop_columns = c("PassengerId", "Survived", "Name", "Ticket", "Cabin")

x <- data[,!(names(data) %in% drop_columns)]y <- data[,c("Survived")]

fit_control <- trainControl(method = "cv", number = 4,classProbs = TRUE)

grid <- expand.grid(depth = c(4, 6, 8),learning_rate = 0.1,iterations = 100, l2_leaf_reg = 1e-3,            rsm = 0.95, border_count = 64)

report <- train(x, as.factor(make.names(y)),method = catboost.caret,verbose = TRUE, preProc = NULL,tuneGrid = grid, trControl = fit_control)

print(report)

importance <- varImp(report, scale = FALSE)

print(importance)


2年前