Mr. Confusion Matrix

Niharika Dhanik
6 min readJun 6, 2021

A confusion matrix, hmmm…..sounds interesting. Is this matrix really what it’s name suggests? Why is it named confusion matrix? What does this matrix do? Who made this matrix? Where is this matrix used? matrix, matrix, matrix, aaaaaahhhhh…! Maybe its time to reveal this matrix forever, so be it!

This blog is whole and soul dedicated to Mr. Confusion Matrix. Let’s dig in the past and learn all that we could possibly collect about it.

The Birth Of Confusion Matrix —

The confusion matrix was invented at 1904 by Karl Pearson. He used the term Contingency Table. It appeared at Karl Pearson, F.R.S. (1904). Mathematical contributions to the theory of evolution (PDF). Dulau and Co.

A contingency table (also known as a cross tabulation or crosstab) is a type of table in a matrix format that displays the (multivariate) frequency distribution of the variables. They are heavily used in survey research, business intelligence, engineering, and scientific research. They provide a basic picture of the interrelation between two variables and can help find interactions between them. The term contingency table was first used by Karl Pearson in “On the Theory of Contingency and Its Relation to Association and Normal Correlation”.

During War World 2, Detection Theory was developed as investigation of the relations between stimulus and responds. The confusion matrix was used there. Due to detection theory, the term was used in psychology. From there the term reached machine learning. It seems that though the concept was invented in statistics, a field very related to machine learning, it reached machine learning after a detour in during a period of 100 years.

What is Confusion Matrix —

A confusion matrix is a table that is often used to describe the performance of a classification model based on a set of test data for which the true values are known.

It is a table with rows and columns that reports the number of false positives, false negatives, true positives, and true negatives. This allows more detailed analysis than mere proportion of correct accuracy. It is a very popular measure used while solving classification problems. It can be applied to binary classification as well as for multiclass classification problems. Confusion matrices represent counts from predicted and actual values.

  • True positives (TP): These are cases in which prediction was yes.
  • True negatives (TN): These are cases in which prediction was no.
  • False positives (FP): These are cases in which prediction was positive but the actual situation was negative aka “Type I error.”
  • False negatives (FN): These are cases in which prediction was negative but the actual situation was positive aka “Type II error.”

The Accuracy Factor —

One of the most commonly used metrics while performing classification is accuracy. The Accuracy of a model, using a confusion matrix, is calculated using the given formula below.

Accuracy

where,

  • TN” : stands for True Negative
  • TP” : stands for True Positive
  • FP” : shows False Positive values (number of actual negative examples)
  • FN” : means a False Negative values (number of actual positive examples)

Accuracy can be misleading if used with imbalanced datasets, and therefore there are other metrics based on confusion matrix which can be useful for evaluating performance.

What is Cyber Crime —

Cybercrime is criminal activity that either targets or uses a computer, a computer network or a networked device. Cybercrime is committed by cybercriminals or hackers who want to make money. Cybercrime is carried out by individuals or organizations. Some cybercriminals are organized, use advanced techniques and are highly technically skilled. Others are novice hackers. Rarely, cybercrime aims to damage computers for reasons other than profit. These could be political or personal.

Here are some examples of the different types of cybercrime:

  • Email and internet fraud.
  • Identity fraud (where personal information is stolen and used).
  • Theft of financial or card payment data.
  • Theft and sale of corporate data.
  • Cyberextortion (demanding money to prevent a threatened attack).
  • Ransomware attacks (a type of cyberextortion).
  • Cryptojacking (where hackers mine cryptocurrency using resources they do not own).
  • Cyberespionage (where hackers access government or company data).

Most cybercrime falls under two main categories:

  • Criminal activity that targets (involves infecting computers with viruses and other types of malware).
  • Criminal activity that uses computers to commit other crimes.

What is Cyber Security —

Cyber security is the application of technologies, processes and controls to protect systems, networks, programs, devices and data from cyber attacks. It aims to reduce the risk of Cyber attacks and protect against the unauthorised exploitation of systems, networks and technologies.

Elements of Cyber encompass the following:

  • Network security: The process of protecting the network from unwanted users, attacks and intrusions.
  • Application security: Apps require constant updates and testing to ensure these programs are secure from attacks.
  • Endpoint security: It is the process of protecting remote access to a company’s network.
  • Data Security: Protecting company and customer information is a separate layer of security.
  • Identity management: It is a process of understanding the access every individual has in an organization.
  • Database and infrastructure security: Protecting these devices is equally important.
  • Cloud security: Protecting cloud data in an online environment.
  • Mobile security: Providing security.
  • Disaster recovery/business continuity planning: Data protection and educating good habits (password changes, 2-factor authentication, etc.) is being planned.

Role Of Confusion Matrix in Cyber Detection —

From Rechtspraak.nl, an archive containing all court cases from 1913 until 2018 was downloaded. The data was grouped per year and each year was divided into 12 folders, in which court cases were grouped per month. In total, 7 classes remained, including the ‘other’ class. The classes and number of files are shown below. The classes are imbalanced because each one of them does not contain the same number of files as the other.

The top ten features for every label were extracted per class. The results are presented below. The features make sense as these words are often associated with these sort of court cases.

The confusion matrix that was obtained from the classifier is depicted below. It is in normalized form, since the classes are imbalanced. The darker the blue, the better the classifier is at predicting files for this class. The accuracies can also be read from the diagonal in the confusion matrix. It appears ‘child pornography’ can be determined with high accuracy.

Conclusion —

Distinctive features existing, determines the classification of a criminal court case. These features usually consist of those associated with the crime, which makes sense. The top 10 features per class were extracted from the classifier. The above diagram helped us explain the use of confusion matrix and the accuracy obtained. The model also predicts the factors were the predictions went wrong. But if viewed from a larger perspective and improving the model efficiency, this Machine Learning or specifically, the problem of Statistical Classification, in combination with the real cases, can be used to fill the existing loop holes of obtaining accuracy which in turn will prove beneficial in solving cases.

So that is all, about the prime suspect “CONFUSION MATRIX” a.k.a. ‘ The Error Matrix’.

Thank You!

--

--