Handwritten Digits Recognition using Machine Learning

 

   Recognizing Handwritten Digits with scikit-learn

                            In this project we will try to recognize handwritten digits using machine learning models like support vector machine and random forest classifier with the help of scikit-learn library.

Handwriting Recognition

                           Handwritten text recognition is one of the most challenging task so to address this issue scikit-learn library plays important role to better understanding this technique. But people often think about OCR (Optical Character Recognition) software that can read text, pdf or other electronic documents. But this problem by choosing statistical approach may be optimal solution. You can read about scikit-learn library from this scikit-learn: machine learning in Python — scikit-learn 1.1.1 documentation about its usages for different applications like regression, classification and clustering etc.
Installation of important machine learning libraries as follows:

For Windows users:
First open command prompt and type below commands for specific library-

NumPy: pip install numpy
Pandas: pip insall pandas
seaborn: pip install seaborn
matplotlib: pip install matplotlib
scikit-learn: pip install scikit-learn

for other users please check out these libraries' official websites for installation.

Creating Project

          To create project, open Jupyter notebook and open new notebook in that and import all necessary libraries as follows:
import numpy as np
import pandas as pd
from sklearn import svm, metrics
from sklearn.metrics import accuracy_score, confusion_matrix
import seaborn as sns
print("Libraries imported successfully!")

An estimator is useful in this case is sklearn.svm.SVC which uses the technique of support vector classification. we can use estimator like  
svc = svm.SVC(gamma=0.001, C=100.)
we have created estimator of SVC type and choose initial setting, assigning the values of C and gamma generic values.

Loading Dataset

    Loading Dataset from sklearn and printing all information about that using DESCR(describe) attribute.
from sklearn import datasets
digits = datasets.load_digits()
print(digits.DESCR)    

This above loads digits dataset from sklearn as it provides numerous dataset that are useful for testing many problems of data analysis. Here also we took image dataset called digits. DESCR function provides all information about that dataset as how many attributes are there or how many instances are.
In our dataset there are 1797 images of 8x8 pixels in size with grayscale.

digits.image contains images in array of 8x8 pixel size and we can check visually using matplotplib library as 
import matplotlib.pyplot as plt
%matplotlib inline
plt.imshow(digits.images[0], cmap=plt.cm.gray_r, interpolation='nearest')

Here we are seeing first digit image of our dataset as following output.


numerical value representation of our dataset can be found by this
digits.target
array([0, 1, 2, ..., 8, 9, 8])

Learning and Predicting 

This dataset contains images of handwritten data so you can consider the dataset, first 1791 for training and last six as validation set. You can see following six digits using matplotlib library

plt.subplot(321)
plt.imshow(digits.images[1791], cmap=plt.cm.gray_r, 
           interpolation='nearest')
plt.subplot(322)
plt.imshow(digits.images[1792], cmap=plt.cm.gray_r, 
           interpolation='nearest')
plt.subplot(323)
plt.imshow(digits.images[1793], cmap=plt.cm.gray_r, 
            interpolation='nearest')
plt.subplot(324)
plt.imshow(digits.images[1794], cmap=plt.cm.gray_r, 
            interpolation='nearest')
plt.subplot(325)
plt.imshow(digits.images[1795], cmap=plt.cm.gray_r, 
            interpolation='nearest')
plt.subplot(326)
plt.imshow(digits.images[1796], cmap=plt.cm.gray_r,
           interpolation='nearest')

output:
Now you can train svc estimator by fitting the data
svc.fit(digits.data[1:1790], digits.target[1:1790])

Now you have to test your estimator and validate the last six digits and predict the data
svc.predict(digits.data[1791: 1797])
array([4, 9, 0, 8, 9, 8])
As we can see that it predicts 4, 9, 0, 8, 9, 8 for last six digits images of our dataset.
Comparing the actual data gives us 100% accuracy. 
digits.target[1791: 1797]
output: array([4, 9, 0, 8, 9, 8])

Same way we can check or predict for another range of dataset.

Another way by support vector machine using train_test_split data
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    digits.data, digits.target, test_size=0.5, shuffle=False
)
svc.fit(X_train, y_train)

output: SVC(C=100.0, gamma=0.001)
Now predicting our full dataset
predicted = svc.predict(X_test)
predicted

output: 
array([8, 8, 4, 9, 0, 8, 9, 8, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3, 4,
       5, 6, 7, 8, 9, 0, 1, 2, 3, 4, 9, 6, 7, 8, 9, 0, 9, 5, 5, 6, 5, 0,
       9, 8, 9, 8, 4, 1, 7, 7, 3, 9, 1, 2, 7, 8, 2, 0, 1, 2, 6, 3, 3, 7,
       3, 3, 4, 6, 6, 6, 4, 9, 1, 5, 0, 9, 5, 2, 8, 2, 0, 0, 1, 7, 6, 3,
       2, 1, 4, 6, 3, 1, 3, 9, 1, 7, 6, 8, 4, 3, 1, 4, 0, 5, 3, 6, 9, 6,
       1, 7, 5, 4, 4, 7, 2, 8, 2, 2, 9, 7, 9, 5, 4, 4, 9, 0, 8, 9, 8, 0,
       1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2,
       3, 4, 5, 6, 7, 8, 9, 0, 9, 5, 5, 6, 5, 0, 9, 8, 9, 8, 4, 1, 7, 7,
       3, 5, 1, 0, 0, 7, 8, 2, 0, 1, 2, 6, 3, 3, 7, 3, 3, 4, 6, 6, 6, 9,
       9, 1, 5, 0, 9, 5, 2, 8, 2, 0, 0, 1, 7, 6, 3, 2, 1, 5, 4, 6, 3, 1,
       7, 9, 1, 7, 6, 8, 4, 3, 1, 4, 0, 5, 3, 6, 9, 6, 1, 7, 5, 4, 4, 7,
       2, 8, 2, 2, 5, 7, 9, 5, 4, 8, 8, 4, 9, 0, 8, 9, 8, 0, 1, 2, 3, 4,
       5, 6, 7, 8, 9, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 8, 2, 3, 4, 5, 6,
       7, 8, 9, 0, 9, 5, 5, 6, 5, 0, 9, 8, 9, 8, 4, 1, 7, 7, 3, 5, 1, 0,
       0, 2, 2, 7, 8, 2, 0, 1, 2, 6, 3, 3, 7, 3, 3, 4, 6, 6, 6, 4, 9, 1,
       5, 0, 9, 5, 2, 8, 2, 0, 0, 1, 7, 6, 3, 2, 2, 7, 4, 6, 3, 1, 3, 9,
       1, 7, 6, 8, 4, 3, 1, 4, 0, 5, 3, 6, 9, 6, 8, 7, 5, 4, 4, 7, 2, 8,
       2, 2, 5, 7, 9, 5, 4, 8, 8, 4, 9, 0, 8, 9, 8, 0, 9, 2, 3, 4, 5, 6,
       7, 8, 9, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3, 4, 5, 6, 7, 8,
       9, 0, 9, 5, 5, 6, 5, 0, 9, 8, 9, 8, 4, 1, 7, 7, 3, 5, 1, 0, 0, 2,
       2, 7, 8, 2, 0, 1, 2, 6, 3, 3, 7, 3, 3, 4, 6, 6, 6, 4, 9, 1, 5, 0,
       9, 6, 2, 8, 3, 0, 0, 1, 7, 6, 3, 2, 1, 7, 4, 6, 3, 1, 3, 9, 1, 7,
       6, 8, 4, 3, 1, 4, 0, 5, 3, 6, 9, 6, 1, 7, 5, 4, 4, 7, 2, 8, 2, 2,
       5, 7, 9, 5, 4, 8, 8, 4, 9, 0, 8, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0,
       1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 9, 5,
       5, 6, 5, 0, 9, 8, 9, 8, 4, 1, 7, 7, 3, 5, 1, 0, 0, 2, 2, 7, 8, 2,
       0, 1, 2, 6, 3, 3, 7, 3, 3, 4, 6, 6, 6, 4, 9, 1, 5, 0, 9, 5, 2, 8,
       2, 0, 0, 1, 7, 6, 3, 2, 1, 7, 4, 6, 3, 1, 3, 9, 1, 7, 6, 8, 4, 3,
       1, 4, 0, 5, 3, 6, 9, 6, 1, 7, 5, 4, 4, 7, 2, 8, 2, 2, 5, 7, 9, 5,
       4, 8, 8, 4, 9, 0, 8, 9, 8, 0, 1, 2, 3, 4, 5, 1, 7, 8, 9, 0, 1, 2,
       3, 4, 5, 6, 9, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 4, 9, 5, 5, 6, 5, 0,
       9, 8, 9, 8, 4, 1, 7, 7, 3, 5, 1, 0, 0, 2, 2, 7, 8, 2, 0, 1, 2, 6,
       8, 3, 7, 7, 3, 4, 6, 6, 6, 9, 9, 1, 5, 0, 9, 5, 2, 8, 0, 1, 7, 6,
       3, 2, 1, 7, 9, 6, 3, 1, 3, 9, 1, 7, 6, 8, 4, 3, 1, 4, 0, 5, 3, 6,
       9, 6, 1, 7, 5, 4, 4, 7, 2, 2, 5, 7, 3, 5, 9, 4, 5, 0, 8, 9, 8, 0,
       1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2,
       5, 4, 5, 6, 7, 8, 9, 0, 9, 5, 5, 6, 5, 0, 9, 8, 9, 8, 4, 1, 7, 7,
       3, 5, 1, 0, 0, 2, 2, 7, 8, 2, 0, 1, 2, 6, 8, 8, 7, 5, 8, 4, 6, 6,
       6, 4, 9, 1, 5, 0, 9, 5, 2, 8, 2, 0, 0, 1, 7, 6, 3, 2, 1, 7, 4, 6,
       3, 1, 3, 9, 1, 7, 6, 8, 4, 5, 1, 4, 0, 5, 3, 6, 9, 6, 1, 7, 5, 4,
       4, 7, 2, 8, 2, 2, 5, 7, 9, 5, 4, 8, 8, 4, 9, 0, 8, 9, 8])

It shows all the predicted digits of images.
For first 8 images we will predict the following way

_, axes = plt.subplots(nrows=1, ncols=8, figsize=(20, 7))
for ax, image, prediction in zip(axes, X_test, predicted):
    ax.set_axis_off()
    image = image.reshape(8, 8)
    ax.imshow(image, cmap=plt.cm.gray_r, interpolation="nearest")
    ax.set_title(f"Prediction: {prediction}")

output:

we can check classification report of our data like f1score, precision, support and recall etc.
print(
    f"Classification report for classifier {svc}:\n"
    f"{metrics.classification_report(y_test, predicted)}\n"
)
output:
Classification report for classifier SVC(C=100.0, gamma=0.001):
              precision    recall  f1-score   support

           0       1.00      0.99      0.99        88
           1       0.99      0.96      0.97        91
           2       0.99      0.99      0.99        86
           3       0.98      0.90      0.94        91
           4       0.99      0.96      0.97        92
           5       0.95      0.96      0.95        91
           6       0.99      0.99      0.99        91
           7       0.98      0.99      0.98        89
           8       0.94      1.00      0.97        88
           9       0.92      0.98      0.95        92

    accuracy                           0.97       899
   macro avg       0.97      0.97      0.97       899
weighted avg       0.97      0.97      0.97       899

Here we can create confusion matrix as follows
disp = metrics.ConfusionMatrixDisplay.from_predictions(y_test, predicted)
disp.figure_.suptitle("Confusion Matrix")
print(f"Confusion matrix:\n{disp.confusion_matrix}")

plt.show()

Confusion matrix:
[[87  0  0  0  1  0  0  0  0  0]
 [ 0 87  1  0  0  0  0  0  2  1]
 [ 0  0 85  1  0  0  0  0  0  0]
 [ 0  0  0 82  0  3  0  2  4  0]
 [ 0  0  0  0 88  0  0  0  0  4]
 [ 0  0  0  0  0 87  1  0  0  3]
 [ 0  1  0  0  0  0 90  0  0  0]
 [ 0  0  0  0  0  1  0 88  0  0]
 [ 0  0  0  0  0  0  0  0 88  0]
 [ 0  0  0  1  0  1  0  0  0 90]]

acc_score = accuracy_score(y_test, predicted)
print(f"Support Vector Machine has accuracy of {round(acc_score*100, 2)} %.")
Support Vector Machine has accuracy of 97.0 %.
Here we can see this way gives us accuracy of 97% that is less than previous one.

Random Forest Classifier

Now we will use another classifier Random Forest:
importing RandomForestclassifier
from sklearn.ensemble import RandomForestClassifier

Fitting and estimating the random forest classifier
rf = RandomForestClassifier(n_estimators=100)
rf.fit(X_train, y_train)
RandomForestClassifier()

predict_rf = rf.predict(X_test)
predict_rf

output:
array([8, 3, 4, 9, 0, 8, 9, 1, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3, 4,
       5, 6, 7, 8, 9, 0, 1, 2, 3, 4, 9, 6, 7, 8, 9, 0, 9, 5, 5, 6, 5, 0,
       9, 8, 9, 8, 4, 1, 7, 7, 3, 9, 1, 2, 7, 8, 2, 0, 1, 2, 6, 3, 3, 7,
       3, 3, 4, 6, 6, 6, 4, 9, 1, 5, 0, 9, 5, 2, 8, 2, 0, 0, 1, 7, 6, 0,
       2, 1, 4, 6, 3, 1, 3, 9, 1, 7, 6, 8, 1, 3, 1, 4, 0, 5, 3, 6, 9, 6,
       1, 7, 5, 4, 4, 7, 2, 8, 2, 2, 9, 7, 9, 5, 4, 4, 9, 0, 8, 9, 8, 0,
       1, 2, 3, 4, 5, 6, 7, 8, 3, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2,
       3, 4, 5, 6, 7, 8, 3, 0, 9, 5, 5, 6, 5, 0, 9, 8, 9, 8, 9, 1, 7, 7,
       3, 5, 1, 0, 0, 7, 8, 2, 0, 1, 2, 6, 3, 3, 8, 3, 3, 4, 6, 6, 6, 9,
       9, 1, 5, 0, 8, 5, 2, 8, 2, 0, 0, 1, 7, 6, 3, 2, 1, 7, 4, 6, 3, 1,
       7, 9, 1, 7, 6, 8, 4, 3, 1, 4, 0, 5, 3, 6, 9, 6, 1, 7, 5, 4, 4, 7,
       2, 8, 2, 2, 5, 7, 9, 5, 4, 8, 8, 4, 9, 0, 8, 9, 8, 0, 1, 2, 3, 4,
       5, 6, 7, 8, 9, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3, 4, 5, 6,
       7, 8, 9, 0, 9, 5, 5, 6, 5, 0, 9, 8, 9, 5, 4, 1, 7, 7, 8, 5, 1, 0,
       0, 2, 2, 7, 7, 2, 0, 1, 2, 6, 3, 3, 7, 3, 3, 4, 6, 6, 6, 4, 9, 1,
       5, 0, 9, 5, 2, 5, 2, 0, 0, 1, 7, 6, 3, 2, 1, 7, 4, 6, 3, 1, 3, 9,
       1, 7, 6, 8, 4, 3, 1, 4, 0, 5, 3, 6, 9, 6, 4, 7, 5, 4, 4, 7, 2, 8,
       2, 2, 5, 7, 9, 5, 4, 8, 8, 4, 9, 0, 8, 9, 8, 0, 9, 2, 3, 4, 5, 6,
       7, 8, 9, 0, 1, 3, 3, 0, 5, 6, 7, 8, 9, 0, 1, 3, 3, 4, 5, 6, 7, 8,
       9, 0, 9, 5, 5, 6, 5, 0, 9, 8, 9, 8, 4, 1, 7, 7, 3, 5, 1, 0, 0, 3,
       9, 7, 8, 3, 0, 1, 3, 6, 3, 3, 7, 3, 3, 4, 6, 6, 6, 4, 9, 1, 5, 0,
       9, 6, 2, 8, 3, 0, 0, 1, 7, 6, 3, 2, 1, 7, 4, 6, 3, 1, 3, 9, 1, 7,
       6, 8, 4, 3, 1, 4, 0, 5, 3, 6, 9, 6, 1, 7, 5, 4, 4, 7, 2, 8, 2, 2,
       5, 7, 9, 5, 4, 8, 8, 4, 9, 0, 8, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0,
       1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 9, 5,
       5, 6, 5, 0, 9, 8, 9, 8, 4, 1, 7, 7, 3, 5, 1, 0, 0, 2, 2, 7, 9, 2,
       0, 9, 2, 6, 3, 3, 7, 3, 3, 4, 6, 6, 6, 4, 9, 9, 5, 0, 9, 5, 2, 8,
       2, 0, 0, 9, 7, 6, 3, 2, 3, 7, 4, 6, 3, 1, 3, 9, 9, 7, 6, 8, 4, 3,
       9, 4, 0, 5, 3, 6, 9, 6, 9, 7, 5, 4, 4, 7, 2, 8, 2, 2, 5, 7, 9, 5,
       4, 8, 8, 4, 9, 0, 8, 9, 8, 0, 1, 2, 3, 4, 5, 1, 8, 1, 9, 0, 1, 2,
       3, 4, 5, 6, 7, 0, 1, 2, 3, 4, 5, 6, 7, 5, 9, 4, 9, 5, 5, 6, 5, 0,
       9, 4, 5, 8, 4, 1, 7, 7, 3, 5, 1, 0, 0, 0, 2, 7, 8, 2, 0, 1, 2, 6,
       5, 3, 7, 7, 8, 4, 6, 6, 6, 7, 9, 1, 5, 0, 9, 5, 2, 8, 0, 1, 7, 6,
       3, 2, 1, 7, 7, 6, 3, 1, 3, 9, 1, 7, 6, 8, 4, 3, 1, 4, 0, 5, 3, 6,
       9, 6, 1, 7, 5, 4, 4, 7, 2, 2, 5, 7, 3, 5, 5, 4, 5, 0, 8, 9, 7, 0,
       1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2,
       5, 4, 5, 6, 7, 8, 9, 0, 9, 5, 5, 6, 5, 0, 9, 8, 9, 8, 4, 1, 7, 7,
       8, 5, 1, 0, 0, 2, 2, 7, 8, 2, 0, 1, 2, 6, 8, 8, 7, 3, 8, 4, 6, 6,
       6, 4, 9, 1, 5, 0, 9, 5, 2, 8, 2, 0, 0, 1, 7, 6, 3, 2, 1, 7, 4, 6,
       3, 1, 3, 9, 1, 7, 6, 8, 4, 5, 1, 4, 0, 5, 3, 6, 9, 6, 1, 7, 5, 4,
       4, 7, 2, 8, 2, 2, 5, 7, 9, 5, 4, 8, 1, 4, 9, 0, 8, 9, 8])

Confusion matrix of random forest predicted data we can get by this
conf_rf = confusion_matrix(y_test, predict_rf)
conf_rf
sns.heatmap(conf_rf, annot=True,fmt='d',cmap="YlGnBu")
output: array([[87,  0,  0,  0,  1,  0,  0,  0,  0,  0],
       [ 0, 82,  0,  1,  1,  0,  0,  0,  0,  7],
       [ 1,  0, 78,  6,  0,  0,  0,  0,  0,  1],
       [ 1,  0,  0, 79,  0,  3,  0,  2,  6,  0],
       [ 1,  1,  0,  0, 85,  1,  0,  2,  0,  2],
       [ 0,  0,  0,  0,  0, 87,  1,  0,  0,  3],
       [ 0,  1,  0,  0,  0,  0, 90,  0,  0,  0],
       [ 0,  0,  0,  0,  0,  0,  0, 87,  2,  0],
       [ 0,  3,  0,  1,  1,  3,  0,  2, 77,  1],
       [ 0,  0,  0,  3,  0,  2,  0,  1,  1, 85]], dtype=int64)


Let's check accuracy of random forest classifier
acc_rf = accuracy_score(y_test, predict_rf)
print(f"Random Forest Regression accuracy is {round(acc_rf*100, 2)} %.")
output: Random Forest Regression accuracy is 93.1 %.

As we can see random forest gives us 93.1% accuracy that is much less than support vector machine classifier.
We can see both estimator or model has learned correctly and preciously, and it is able to predict or recognize handwritten digits.

=> Full project code and other projects can be get here superohit (Rohit Kumar) (github.com)



Comments

Popular posts from this blog

Computer Viruses

Performing Analysis on Meteorological Data