F1 Score in Machine Learning

Mohit Uniyal

Machine Learning

In machine learning, evaluation metrics are essential to assess the effectiveness of models. Among these metrics, the F1 Score plays a crucial role, especially in classification tasks. It provides a balanced measure by considering both Precision and Recall, offering insights into a model’s overall accuracy in predicting the positive class.

The F1 Score is particularly useful when dealing with imbalanced datasets, where relying on accuracy alone may lead to misleading results. Understanding the F1 Score and its calculation helps practitioners evaluate model performance more effectively, ensuring robust predictions.

What is an F1 Score?

The F1 Score is a metric that provides a balance between Precision and Recall, making it especially useful for evaluating the performance of classification models. It is calculated as the harmonic mean of Precision and Recall, ensuring that both metrics are given equal importance.

F1 Score is particularly relevant in scenarios involving imbalanced datasets, where the ratio of positive to negative examples is skewed. In such cases, accuracy alone can be misleading. For example, in fraud detection, predicting “no fraud” most of the time might yield high accuracy but a low F1 Score, indicating poor detection of actual fraud cases.

The formula for F1 Score is:

$$F1 = 2 \times \left( \frac{Precision \times Recall}{Precision + Recall} \right)$$

1. Precision

Precision measures the proportion of true positive predictions out of all positive predictions made by the model. It evaluates the accuracy of the positive class predictions, helping avoid false positives.

The formula for Precision is:

$$Precision = \frac{TP}{TP + FP}$$

Where:

  • TP = True Positives
  • FP = False Positives

Precision is crucial in scenarios where false positives are costly, such as spam detection or medical diagnostics, where predicting a non-spam email or healthy individual incorrectly can have consequences.

2. Recall

Recall, also known as Sensitivity or True Positive Rate, measures how well the model identifies all relevant positive instances in the dataset. It indicates the ability of the model to avoid false negatives.

The formula for Recall is:

$$Recall = \frac{TP}{TP + FN}$$​

Where:

  • TP = True Positives
  • FN = False Negatives

Recall is essential in situations like disease detection, where identifying every positive case is critical. Missing positive cases (false negatives) could lead to significant consequences, such as undetected illnesses.

3. Why Use Harmonic Mean Instead of Simple Average?

The harmonic mean is used to calculate the F1 Score because it gives a more balanced measure when Precision and Recall are uneven. If one metric is much smaller than the other, the harmonic mean ensures that the F1 Score reflects the lower value, avoiding overestimation.

Using the simple average would inflate the F1 Score in cases where one of the metrics (Precision or Recall) is disproportionately high. By applying the harmonic mean, the F1 Score provides a more realistic evaluation of model performance, ensuring that both metrics are adequately represented.

How to Calculate F1 Score?

The F1 Score provides a comprehensive performance metric by considering both Precision and Recall. Below, we explain how to calculate the F1 Score in binary and multiclass classification scenarios.

General Formula for F1 Score

$$F1 = 2 \times \left( \frac{Precision \times Recall}{Precision + Recall} \right)$$

The calculation starts with identifying true positives (TP), false positives (FP), and false negatives (FN) from the model predictions.

1. F1 Score in Binary Classification

In binary classification, F1 Score measures how well the model distinguishes between two classes, typically positive and negative. Let’s walk through the process with two detailed examples.

Example 1: Binary Classification F1 Calculation

Imagine a spam detection model that predicts whether an email is spam or not. The confusion matrix is as follows:

  • TP (True Positives): 50 (Correctly predicted spam emails)
  • FP (False Positives): 10 (Non-spam emails predicted as spam)
  • FN (False Negatives): 5 (Spam emails predicted as non-spam)

Step 1: Calculate Precision

$$Precision = \frac{TP}{TP + FP} = \frac{50}{50 + 10} = 0.83$$

Step 2: Calculate Recall

$$Recall = \frac{TP}{TP + FN} = \frac{50}{50 + 5} = 0.91$$

Step 3: Calculate F1 Score

$$F1 = 2 \times \frac{0.83 \times 0.91}{0.83 + 0.91} = 0.87$$

This F1 Score of 0.87 indicates that the model has achieved a good balance between identifying spam and avoiding false positives.

Example 2: Binary Classification F1 Calculation

Let’s consider a disease detection model that identifies whether a patient has a specific illness. The confusion matrix for predictions is:

  • TP: 80 (Correctly predicted positive cases)
  • FP: 20 (Healthy individuals identified as ill)
  • FN: 10 (Ill individuals not identified)

Step 1: Calculate Precision

$$Precision = \frac{TP}{TP + FP} = \frac{80}{80 + 20} = 0.80$$

Step 2: Calculate Recall

$$Recall = \frac{TP}{TP + FN} = \frac{80}{80 + 10} = 0.89$$

Step 3: Calculate F1 Score

$$F1 = 2 \times \frac{0.80 \times 0.89}{0.80 + 0.89} = 0.84$$

This example demonstrates how the F1 Score accounts for both false positives and false negatives, offering a more nuanced evaluation of the model’s performance.

2. F1 Score in Multiclass Classification

In multiclass classification, the F1 Score measures performance across multiple classes. Since each class may have different Precision and Recall values, three averaging techniques are commonly used:

  • Macro Average: Calculates the average F1 Score for each class and then averages them.
  • Micro Average: Aggregates all true positives, false positives, and false negatives across classes and then calculates a global F1 Score.
  • Weighted Average: Averages the F1 Scores of each class, weighted by the number of true instances for each class.

Example of Multiclass F1 Score Calculation

Consider a classification model that predicts the category of items (e.g., fruits: apples, oranges, and bananas). The confusion matrix for predictions is:

ClassTPFPFN
Apples40105
Oranges301510
Bananas5058

Step 1: Calculate Precision and Recall for Each Class

  • Apples Precision: $\frac{40}{40 + 10} = 0.80$
  • Apples Recall: $\frac{40}{40 + 5} = 0.89$
  • Oranges Precision: $\frac{30}{30 + 15} = 0.67$
  • Oranges Recall: $\frac{30}{30 + 10} = 0.75$
  • Bananas Precision: $\frac{50}{50 + 5} = 0.91$
  • Bananas Recall: $\frac{50}{50 + 8} = 0.86$

Step 2: Calculate F1 Score for Each Class

  • Apples F1 Score: $2 \times \frac{0.80 \times 0.89}{0.80 + 0.89} = 0.84$
  • Oranges F1 Score: $2 \times \frac{0.67 \times 0.75}{0.67 + 0.75} = 0.71$
  • Bananas F1 Score: $2 \times \frac{0.91 \times 0.86}{0.91 + 0.86} = 0.88$

Step 3: Calculate Macro, Micro, and Weighted Averages

  • Macro Average: $(0.84 + 0.71 + 0.88) / 3 = 0.81$
  • Weighted Average: $0.84 \times 45/115 + 0.71 \times 40/115 + 0.88 \times 50/115 = 0.81$

These averages help assess model performance across multiple classes, ensuring balanced evaluations.

Calculating F1 Score in Python

Python provides libraries such as scikit-learn and TensorFlow that simplify the calculation of F1 Score. Below are implementations for binary and multiclass classification F1 Scores using scikit-learn.

Python Implementation for Binary Classification F1 Score

Below is a step-by-step example of calculating the F1 Score for a binary classification task. The example uses scikit-learn with a dataset for spam email detection.

# Step 1: Import necessary libraries

from sklearn.metrics import f1_score

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LogisticRegression

from sklearn.datasets import make_classification

# Step 2: Generate a sample binary classification dataset

X, y = make_classification(n_samples=1000, n_features=20, n_classes=2, random_state=42)

# Step 3: Split the data into training and test sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Step 4: Train a Logistic Regression model

model = LogisticRegression()

model.fit(X_train, y_train)

# Step 5: Make predictions on the test set

y_pred = model.predict(X_test)

# Step 6: Calculate the F1 Score

f1 = f1_score(y_test, y_pred)

print(f"F1 Score (Binary Classification): {f1:.2f}")

Explanation:

  • We import the necessary libraries and generate a sample binary dataset using make_classification().
  • The dataset is split into training and test sets with a 70-30 ratio.
  • A Logistic Regression model is trained on the training data, and predictions are made on the test set.
  • Finally, the F1 Score is calculated using f1_score() from scikit-learn.

This code demonstrates a typical binary classification use case with F1 Score evaluation. The result helps determine how well the model balances Precision and Recall.

Python Implementation for Multiclass Classification F1 Score

Below is an example of calculating the F1 Score for a multiclass classification problem using the Iris dataset.

# Step 1: Import necessary libraries

from sklearn.metrics import f1_score

from sklearn.model_selection import train_test_split

from sklearn.ensemble import RandomForestClassifier

from sklearn.datasets import load_iris

# Step 2: Load the Iris dataset

data = load_iris()

X, y = data.data, data.target

# Step 3: Split the data into training and test sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Step 4: Train a Random Forest Classifier

model = RandomForestClassifier()

model.fit(X_train, y_train)

# Step 5: Make predictions on the test set

y_pred = model.predict(X_test)

# Step 6: Calculate the F1 Score (macro, micro, and weighted)

f1_macro = f1_score(y_test, y_pred, average='macro')

f1_micro = f1_score(y_test, y_pred, average='micro')

f1_weighted = f1_score(y_test, y_pred, average='weighted')

print(f"Macro F1 Score: {f1_macro:.2f}")

print(f"Micro F1 Score: {f1_micro:.2f}")

print(f"Weighted F1 Score: {f1_weighted:.2f}")

Explanation:

  • The Iris dataset is loaded, which contains three classes (Setosa, Versicolor, Virginica).
  • The data is split into training and test sets, and a Random Forest Classifier is trained on the training data.
  • Predictions are made on the test data, and the F1 Score is calculated using three different averaging methods:
    • Macro Average: Takes the average F1 Score for each class without weighting.
    • Micro Average: Aggregates all TP, FP, and FN values before calculating F1.
    • Weighted Average: Weighs the F1 Score of each class based on the number of true instances.

Advantages and Limitations of Using F1 Score

Advantages of F1 Score and When It is Ideal to Use

The F1 Score is highly valuable in scenarios where Precision and Recall need to be balanced. It is particularly useful in imbalanced datasets, where relying on accuracy can be misleading. Some key use cases include:

  1. Spam Detection: F1 Score ensures spam emails are correctly identified while minimizing false positives.
  2. Medical Diagnostics: In healthcare, it balances identifying true positives (diseased individuals) with minimizing false negatives.
  3. Fraud Detection: Ensures fraud cases are not missed while limiting unnecessary investigations caused by false positives.

The F1 Score is ideal when both false positives and false negatives carry significant consequences. Unlike metrics such as accuracy, which can overlook class imbalance, the F1 Score offers a comprehensive view of a model’s classification performance.

Limitations of F1 Score and When It May Not Be Ideal

Despite its benefits, the F1 Score has some limitations:

  1. Interpretability: It can be less intuitive for non-technical stakeholders to understand compared to accuracy.
  2. Equal Weighting Issue: The F1 Score gives equal importance to Precision and Recall, which might not align with business priorities (e.g., focusing more on Recall in healthcare).
  3. Doesn’t Reflect True Negative Rates: The F1 Score does not account for true negatives, making it less suitable when TN is important, such as in credit risk modeling.

In cases where one metric is more critical (e.g., Recall in life-saving applications), or when the goal is minimizing false positives (Precision-focused tasks), alternative metrics like ROC-AUC or Precision-Recall curves may be better suited.

Conclusion

The F1 Score plays a crucial role in evaluating classification models, particularly when Precision and Recall need to be balanced. It ensures that imbalanced datasets are fairly assessed and offers a more nuanced alternative to traditional metrics like accuracy.

In this article, we explored the calculation methods for F1 Score in binary and multiclass classification, along with practical code examples. While the F1 Score is a powerful tool, understanding its limitations is essential for applying it effectively. By choosing the right metric based on problem context, practitioners can make better decisions and improve model performance.

References: