1.answer the question:

  • Did you specify the type of data analytic question (e.g. exploration, association causality) before touching the data?
  • 在接触数据之前,您是否指定了数据分析问题的类型(例如探究、关联因果关系)?
  • Did you define the metric for success before beginning?
  • 你在开始之前定义了衡量成功的标准吗?
  • Did you understand the context for the question and the scientific or business application?
  • 您是否理解问题的背景以及科学或商业应用程序?
  • Did you record the experimental design?
  • Did you consider whether the question could be answered with the available data?

Thinking about and documenting the problem we're working on is an important step to performing effective data analysis that often goes overlooked.Don't skip it.


 

Step 2: Checking the data

  • Is there anything wrong with the data?
  • Are there any quirks with the data?
  • Do I need to fix or remove any of the data?

Let's start by reading the data into a pandas DataFrame.

import pandas as pd

iris_data = pd.read_csv('iris-data.csv')
iris_data.head()
iris_data = pd.read_csv('iris-data.csv', na_values=['NA'])
#iris_data.fillna(1,inplace=True)需要inplace一下
iris_data.describe()
# This line tells the notebook to show plots inside of the notebook
%matplotlib inline

import matplotlib.pyplot as plt
import seaborn as sb
sb.pairplot(iris_data.dropna(), hue='class')#hue:用一个特征来显示图像上的颜色,类似于打标签

From the scatterplot matrix, we can already see some issues with the data set:

  1. There are five classes when there should only be three, meaning there were some coding errors.

  2. There are some clear outliers in the measurements that may be erroneous: one sepal_width_cm entry for Iris-setosa falls well outside its normal range, and several sepal_length_cm entries for Iris-versicolor are near-zero for some reason.

  3. We had to drop those rows with missing values.

In all of these cases, we need to figure out what to do with the erroneous data. Which takes us to the next step...


 

Step 3: Tidying the data

Now that we've identified several errors in the data set, we need to fix them before we proceed with the analysis.

Let's walk through the issues one-by-one.

There are five classes when there should only be three, meaning there were some coding errors.

After talking with the field researchers, it sounds like one of them forgot to add Iris- before their Iris-versicolor entries. The other extraneous class, Iris-setossa, was simply a typo that they forgot to fix.

Let's use the DataFrame to fix these errors.

 

iris_data.loc[iris_data['class'] == 'versicolor', 'class'] = 'Iris-versicolor'
iris_data.loc[iris_data['class'] == 'Iris-setossa', 'class'] = 'Iris-setosa'

iris_data['class'].unique()

sb.pairplot(iris_data.dropna(), hue='class')#画出了所有的图,根据不同类别的不同属性

机器学习解决问题的方法流程


# This line drops any 'Iris-setosa' rows with a separal width less than 2.5 cm
iris_data = iris_data.loc[(iris_data['class'] != 'Iris-setosa') | (iris_data['sepal_width_cm'] >= 2.5)]
iris_data.loc[iris_data['class'] == 'Iris-setosa', 'sepal_width_cm'].hist()
;

机器学习解决问题的方法流程

The next data issue to address is the several near-zero sepal lengths for the Iris-versicolor rows. Let's take a look at those rows.

 


机器学习解决问题的方法流程

How about that? All of these near-zero sepal_length_cm entries seem to be off by two orders of magnitude, as if they had been recorded in meters instead of centimeters.

After some brief correspondence with the field researchers, we find that one of them forgot to convert those measurements to centimeters. Let's do that for them.

iris_data.loc[(iris_data['class'] == 'Iris-versicolor') &
(iris_data['sepal_length_cm'] < 1.0),
'sepal_length_cm'] *= 100.0

iris_data.loc[iris_data['class'] == 'Iris-versicolor', 'sepal_length_cm'].hist()


Let's take a look at the rows with missing values:

iris_data.loc[(iris_data['sepal_length_cm'].isnull()) |

(iris_data['sepal_width_cm'].isnull()) |
(iris_data['petal_length_cm'].isnull()) |
(iris_data['petal_width_cm'].isnull())]

机器学习解决问题的方法流程

It's not ideal that we had to drop those rows, especially considering they're all `Iris-setosa` entries. Since it seems like the missing data is systematic — all of the missing values are in the same column for the same *Iris* type — this error could potentially bias our analysis.

One way to deal with missing data is **mean imputation**: If we know that the values for a measurement fall in a certain range, we can fill in empty values with the average of that measurement.

Let's see if we can do that here.

iris_data.loc[iris_data['class'] == 'Iris-setosa', 'petal_width_cm'].hist()

机器学习解决问题的方法流程

 

Most of the petal widths for Iris-setosa fall within the 0.2-0.3 range, so let's fill in these entries with the average measured petal width.

average_petal_width = iris_data.loc[iris_data['class'] == 'Iris-setosa', 'petal_width_cm'].mean()

iris_data.loc[(iris_data['class'] == 'Iris-setosa') &
(iris_data['petal_width_cm'].isnull()), 'petal_width_cm'] = average_petal_width

iris_data.loc[(iris_data['class'] == 'Iris-setosa') &
(iris_data['petal_width_cm'] == average_petal_width)]


iris_data.loc[(iris_data['sepal_length_cm'].isnull()) |
(iris_data['sepal_width_cm'].isnull()) |
(iris_data['petal_length_cm'].isnull()) |
(iris_data['petal_width_cm'].isnull())]

Great! Now we've recovered those rows and no longer have missing data in our data set.

Note: If you don't feel comfortable imputing your data, you can drop all rows with missing data with the dropna() call:

iris_data.dropna(inplace=True)

After all this hard work, we don't want to repeat this process every time we work with the data set. Let's save the tidied data file as a separate file and work directly with that data file from now on.

iris_data.to_csv('iris-data-clean.csv', index=False)

iris_data_clean = pd.read_csv('iris-data-clean.csv')


Of course, I purposely inserted numerous errors into this data set to demonstrate some of the many possible scenarios you may face while tidying your data.

The general takeaways here should be:

  • Make sure your data is encoded properly

  • Make sure your data falls within the expected range, and use domain knowledge whenever possible to define that expected range

  • Deal with missing data in one way or another: replace it if you can or drop it

  • Never tidy your data manually because that is not easily reproducible

  • Use code as a record of how you tidied your data

  • Plot everything you can about the data at this stage of the analysis so you can visually confirm everything looks correct


机器学习解决问题的方法流程


 

Step 4: Exploratory analysis

机器学习解决问题的方法流程

sb.pairplot(iris_data_clean)
;

sb.pairplot(iris_data_clean, hue='class')#根据类别涂以不同的颜色用以区分

;

机器学习解决问题的方法流程

plt.figure(figsize=(10, 10))

for column_index, column in enumerate(iris_data_clean.columns):
  if column == 'class':
    continue
  plt.subplot(2, 2, column_index + 1)
  sb.violinplot(x='class', y=column, data=iris_data_clean)

机器学习解决问题的方法流程


 

Step 5: Classification

Wow, all this work and we still haven't modeled the data!

As tiresome as it can be, tidying and exploring our data is a vital component to any data analysis. If we had jumped straight to the modeling step, we would have created a faulty classification model.

Remember: Bad data leads to bad models. Always check your data first.


Assured that our data is now as clean as we can make it — and armed with some cursory knowledge of the distributions and relationships in our data set — it's time to make the next big step in our analysis: Splitting the data into training and testing sets.

training set is a random subset of the data that we use to train our models.

testing set is a random subset of the data (mutually exclusive from the training set) that we use to validate our models on unforseen data.

Especially in sparse data sets like ours, it's easy for models to overfit the data: The model will learn the training set so well that it won't be able to handle most of the cases it's never seen before. This is why it's important for us to build the model with the training set, but score it with the testing set.

Note that once we split the data into a training and testing set, we should treat the testing set like it no longer exists: We cannot use any information from the testing set to build our model or else we're cheating.

Let's set up our data first.

 

iris_data_clean = pd.read_csv('iris-data-clean.csv')

# We're using all four measurements as inputs
# Note that scikit-learn expects each entry to be a list of values, e.g.,
# [ [val1, val2, val3],
# [val1, val2, val3],
# ... ]
# such that our input data set is represented as a list of lists

# We can extract the data in this format from pandas like this:
all_inputs = iris_data_clean[['sepal_length_cm', 'sepal_width_cm',
'petal_length_cm', 'petal_width_cm']].values

# Similarly, we can extract the class labels
all_labels = iris_data_clean['class'].values

# Make sure that you don't mix up the order of the entries
# all_inputs[5] inputs should correspond to the class in all_labels[5]

# Here's what a subset of our inputs looks like:
all_inputs[:5]

 

#切分数据,分为训练集和测试集,

from sklearn.model_selection import train_test_split

(training_inputs,
testing_inputs,
training_classes,
testing_classes) = train_test_split(all_inputs, all_labels, test_size=0.25, random_state=1)#切分为了一个四元组,前两个是输入数据(训练、测试),后两个是真实类别

Notice how the classifier asks Yes/No questions about the data — whether a certain feature is <= 1.75, for example — so it can differentiate the records. This is the essence of every decision tree.

The nice part about decision tree classifiers is that they are scale-invariant, i.e., the scale of the features does not affect their performance, unlike many Machine Learning models. In other words, it doesn't matter if our features range from 0 to 1 or 0 to 1,000; decision tree classifiers will work with them just the same.

There are several parameters that we can tune for decision tree classifiers, but for now let's use a basic decision tree classifier.

 

from sklearn.tree import DecisionTreeClassifier

# Create the classifier
decision_tree_classifier = DecisionTreeClassifier()

# Train the classifier on the training set
decision_tree_classifier.fit(training_inputs, training_classes)

# Validate the classifier on the testing set using classification accuracy
decision_tree_classifier.score(testing_inputs, testing_classes)

0.9736842105263158
 

Heck yeah! Our model achieves 97% classification accuracy without much effort.

However, there's a catch: Depending on how our training and testing set was sampled, our model can achieve anywhere from 80% to 100% accuracy:

 

model_accuracies = []

for repetition in range(1000):
(training_inputs,
testing_inputs,
training_classes,
testing_classes) = train_test_split(all_inputs, all_labels, test_size=0.25)

decision_tree_classifier = DecisionTreeClassifier()
decision_tree_classifier.fit(training_inputs, training_classes)
classifier_accuracy = decision_tree_classifier.score(testing_inputs, testing_classes)
model_accuracies.append(classifier_accuracy)

plt.hist(model_accuracies)
;

机器学习解决问题的方法流程

Cross-validation

[ go back to the top ]

This problem is the main reason that most data scientists perform k-fold cross-validation on their models: Split the original data set into k subsets, use one of the subsets as the testing set, and the rest of the subsets are used as the training set. This process is then repeated k times such that each subset is used as the testing set exactly once.

10-fold cross-validation is the most common choice, so let's use that here. Performing 10-fold cross-validation on our data set looks something like this:

(each square is an entry in our data set)

 

import numpy as np
from sklearn.model_selection import StratifiedKFold

def plot_cv(cv, features, labels):
masks = []
for train, test in cv.split(features, labels):
mask = np.zeros(len(labels), dtype=bool)
mask[test] = 1
masks.append(mask)

plt.figure(figsize=(15, 15))
plt.imshow(masks, interpolation='none', cmap='gray_r')
plt.ylabel('Fold')
plt.xlabel('Row #')

plot_cv(StratifiedKFold(n_splits=10), all_inputs, all_labels)


from sklearn.model_selection import cross_val_score

decision_tree_classifier = DecisionTreeClassifier()

# cross_val_score returns a list of the scores, which we can visualize
# to get a reasonable estimate of our classifier's performance
cv_scores = cross_val_score(decision_tree_classifier, all_inputs, all_labels, cv=10)
plt.hist(cv_scores)
plt.title('Average score: {}'.format(np.mean(cv_scores)))
;


Parameter tuning

[ go back to the top ]

Every Machine Learning model comes with a variety of parameters to tune, and these parameters can be vitally important to the performance of our classifier. For example, if we severely limit the depth of our decision tree classifier:


decision_tree_classifier = DecisionTreeClassifier(max_depth=1)

cv_scores = cross_val_score(decision_tree_classifier, all_inputs, all_labels, cv=10)
plt.hist(cv_scores)
plt.title('Average score: {}'.format(np.mean(cv_scores)))
机器学习解决问题的方法流程

the classification accuracy falls tremendously.

Therefore, we need to find a systematic method to discover the best parameters for our model and data set.

The most common method for model parameter tuning is Grid Search. The idea behind Grid Search is simple: explore a range of parameters and find the best-performing parameter combination. Focus your search on the best range of parameters, then repeat this process several times until the best parameters are discovered.

Let's tune our decision tree classifier. We'll stick to only two parameters for now, but it's possible to simultaneously explore dozens of parameters if we want.


from sklearn.model_selection import GridSearchCV

decision_tree_classifier = DecisionTreeClassifier()

parameter_grid = {'max_depth': [1, 2, 3, 4, 5],
'max_features': [1, 2, 3, 4]}

cross_validation = StratifiedKFold(n_splits=10)

grid_search = GridSearchCV(decision_tree_classifier,
param_grid=parameter_grid,
cv=cross_validation)

grid_search.fit(all_inputs, all_labels)
print('Best score: {}'.format(grid_search.best_score_))
print('Best parameters: {}'.format(grid_search.best_params_))


Best score: 0.9664429530201343
Best parameters: {'max_depth': 4, 'max_features': 3}
Now let's visualize the grid search to see how the parameters interact.


grid_visualization = grid_search.cv_results_['mean_test_score']
grid_visualization.shape = (5, 4)
sb.heatmap(grid_visualization, cmap='Blues', annot=True)
plt.xticks(np.arange(4) + 0.5, grid_search.param_grid['max_features'])
plt.yticks(np.arange(5) + 0.5, grid_search.param_grid['max_depth'])
plt.xlabel('max_features')
plt.ylabel('max_depth')
;

机器学习解决问题的方法流程

Now we have a better sense of the parameter space: We know that we need a max_depth of at least 2 to allow the decision tree to make more than a one-off decision.

max_features doesn't really seem to make a big difference here as long as we have 2 of them, which makes sense since our data set has only 4 features and is relatively easy to classify. (Remember, one of our data set's classes was easily separable from the rest based on a single feature.)

Let's go ahead and use a broad grid search to find the best settings for a handful of parameters.


decision_tree_classifier = DecisionTreeClassifier()

parameter_grid = {'criterion': ['gini', 'entropy'],
'splitter': ['best', 'random'],
'max_depth': [1, 2, 3, 4, 5],
'max_features': [1, 2, 3, 4]}

cross_validation = StratifiedKFold(n_splits=10)

grid_search = GridSearchCV(decision_tree_classifier,
param_grid=parameter_grid,
cv=cross_validation)

grid_search.fit(all_inputs, all_labels)
print('Best score: {}'.format(grid_search.best_score_))
print('Best parameters: {}'.format(grid_search.best_params_))

Best score: 0.9664429530201343
Best parameters: {'criterion': 'entropy', 'max_depth': 5, 'max_features': 4, 'splitter': 'random'}
Now we can take the best classifier from the Grid Search and use that:


decision_tree_classifier = grid_search.best_estimator_
decision_tree_classifier

DecisionTreeClassifier(class_weight=None, criterion='entropy', max_depth=5,
            max_features=4, max_leaf_nodes=None, min_impurity_decrease=0.0,
            min_impurity_split=None, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='random')

We can even visualize the decision tree with GraphViz to see how it's making the classifications:

import sklearn.tree as tree
from sklearn.externals.six import StringIO

with open('iris_dtc.dot', 'w') as out_file:
  out_file = tree.export_graphviz(decision_tree_classifier, out_file=out_file)

机器学习解决问题的方法流程

 

dt_scores = cross_val_score(decision_tree_classifier, all_inputs, all_labels, cv=10)

sb.boxplot(dt_scores)
sb.stripplot(dt_scores, jitter=True, color='black')
;

机器学习解决问题的方法流程

 

Hmmm... that's a little boring by itself though. How about we compare another classifier to see how they perform?

We already know from previous projects that Random Forest classifiers usually work better than individual decision trees. A common problem that decision trees face is that they're prone to overfitting: They complexify to the point that they classify the training set near-perfectly, but fail to generalize to data they have not seen before.

Random Forest classifiers work around that limitation by creating a whole bunch of decision trees (hence "forest") — each trained on random subsets of training samples (drawn with replacement) and features (drawn without replacement) — and have the decision trees work together to make a more accurate classification.

Let that be a lesson for us: Even in Machine Learning, we get better results when we work together!

Let's see if a Random Forest classifier works better here.

The great part about scikit-learn is that the training, testing, parameter tuning, etc. process is the same for all models, so we only need to plug in the new classifier.


from sklearn.ensemble import RandomForestClassifier

random_forest_classifier = RandomForestClassifier()

parameter_grid = {'n_estimators': [10, 25, 50, 100],
'criterion': ['gini', 'entropy'],
'max_features': [1, 2, 3, 4]}

cross_validation = StratifiedKFold(n_splits=10)

grid_search = GridSearchCV(random_forest_classifier,
param_grid=parameter_grid,
cv=cross_validation)

grid_search.fit(all_inputs, all_labels)
print('Best score: {}'.format(grid_search.best_score_))
print('Best parameters: {}'.format(grid_search.best_params_))

grid_search.best_estimator_


Best score: 0.9664429530201343
Best parameters: {'criterion': 'gini', 'max_features': 2, 'n_estimators': 25}
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features=2, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=25, n_jobs=None,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)
Now we can compare their performance:

random_forest_classifier = grid_search.best_estimator_

rf_df = pd.DataFrame({'accuracy': cross_val_score(random_forest_classifier, all_inputs, all_labels, cv=10),
'classifier': ['Random Forest'] * 10})
dt_df = pd.DataFrame({'accuracy': cross_val_score(decision_tree_classifier, all_inputs, all_labels, cv=10),
'classifier': ['Decision Tree'] * 10})
both_df = rf_df.append(dt_df)

sb.boxplot(x='classifier', y='accuracy', data=both_df)
sb.stripplot(x='classifier', y='accuracy', data=both_df, jitter=True, color='black')
;

机器学习解决问题的方法流程

How about that? They both seem to perform about the same on this data set. This is probably because of the limitations of our data set: We have only 4 features to make the classification, and Random Forest classifiers excel when there's hundreds of possible features to look at. In other words, there wasn't much room for improvement with this data set.


Step 6: Reproducibility

[ go back to the top ]

Ensuring that our work is reproducible is the last and — arguably — most important step in any analysis. As a rule, we shouldn't place much weight on a discovery that can't be reproduced. As such, if our analysis isn't reproducible, we might as well not have done it.

Notebooks like this one go a long way toward making our work reproducible. Since we documented every step as we moved along, we have a written record of what we did and why we did it — both in text and code.

Beyond recording what we did, we should also document what software and hardware we used to perform our analysis. This typically goes at the top of our notebooks so our readers know what tools to use.

Sebastian Raschka created a handy notebook tool for this:


Finally, let's extract the core of our work from Steps 1-5 and turn it into a single pipeline.

%matplotlib inline
import pandas as pd
import seaborn as sb
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, cross_val_score

# We can jump directly to working with the clean data because we saved our cleaned data set
iris_data_clean = pd.read_csv('iris-data-clean.csv')

# Testing our data: Our analysis will stop here if any of these assertions are wrong

# We know that we should only have three classes
assert len(iris_data_clean['class'].unique()) == 3

# We know that sepal lengths for 'Iris-versicolor' should never be below 2.5 cm
assert iris_data_clean.loc[iris_data_clean['class'] == 'Iris-versicolor', 'sepal_length_cm'].min() >= 2.5

# We know that our data set should have no missing measurements
assert len(iris_data_clean.loc[(iris_data_clean['sepal_length_cm'].isnull()) |
(iris_data_clean['sepal_width_cm'].isnull()) |
(iris_data_clean['petal_length_cm'].isnull()) |
(iris_data_clean['petal_width_cm'].isnull())]) == 0

all_inputs = iris_data_clean[['sepal_length_cm', 'sepal_width_cm',
'petal_length_cm', 'petal_width_cm']].values

all_labels = iris_data_clean['class'].values

# This is the classifier that came out of Grid Search
random_forest_classifier = RandomForestClassifier(criterion='gini', max_features=3, n_estimators=50)

# All that's left to do now is plot the cross-validation scores
rf_classifier_scores = cross_val_score(random_forest_classifier, all_inputs, all_labels, cv=10)
sb.boxplot(rf_classifier_scores)
sb.stripplot(rf_classifier_scores, jitter=True, color='black')

# ...and show some of the predictions from the classifier
(training_inputs,
testing_inputs,
training_classes,
testing_classes) = train_test_split(all_inputs, all_labels, test_size=0.25)

random_forest_classifier.fit(training_inputs, training_classes)

for input_features, prediction, actual in zip(testing_inputs[:10],
random_forest_classifier.predict(testing_inputs[:10]),
testing_classes[:10]):
print('{}\t-->\t{}\t(Actual: {})'.format(input_features, prediction, actual))

 机器学习解决问题的方法流程

There we have it: We have a complete and reproducible Machine Learning pipeline to demo to our company's Head of Data. We've met the success criteria that we set from the beginning (>90% accuracy), and our pipeline is flexible enough to handle new inputs or flowers when that data set is ready. Not bad for our first week on the job!

 

 


 

Further reading

[ go back to the top ]

This notebook covers a broad variety of topics but skips over many of the specifics. If you're looking to dive deeper into a particular topic, here's some recommended reading.

Data Science: William Chen compiled a list of free books for newcomers to Data Science, ranging from the basics of R & Python to Machine Learning to interviews and advice from prominent data scientists.

Machine Learning: /r/MachineLearning has a useful Wiki page containing links to online courses, books, data sets, etc. for Machine Learning. There's also a curated list of Machine Learning frameworks, libraries, and software sorted by language.

Unit testing: Dive Into Python 3 has a great walkthrough of unit testing in Python, how it works, and how it should be used

pandas has several tutorials covering its myriad features.

scikit-learn has a bunch of tutorials for those looking to learn Machine Learning in Python. Andreas Mueller's scikit-learn workshop materials are top-notch and freely available.

matplotlib has many books, videos, and tutorials to teach plotting in Python.

Seaborn has a basic tutorial covering most of the statistical plotting features.