Data Engineering18 min read

Data Encryption & Decryption in Data Engineering

Learn Data Encryption and Decryption in Data Engineering with real-world Azure, PySpark, ETL, and enterprise security examples.

2026-05-15

Part of Series

Data Engineering Fundamentals

Progress

1/1

Current Article

Data Encryption & Decryption in Data Engineering

Part 1

Data Encryption & Decryption in Data Engineering

In modern Data Engineering projects, handling sensitive data securely is one of the most important responsibilities.

Organizations process:

  • Customer Names
  • Phone Numbers
  • Email IDs
  • Credit Card Information
  • Aadhaar Numbers
  • PAN Numbers
  • Banking Information
  • Healthcare Records
  • Employee Salary Data

If this data gets exposed, companies can face:

  • Security breaches
  • Financial losses
  • Legal penalties
  • Compliance violations
  • Loss of customer trust

That is why Data Encryption and Decryption become extremely important in real-world Data Engineering systems.

In this blog, we will understand:

  • What encryption and decryption are
  • Why they are important
  • Real-world implementation approaches
  • Symmetric vs Asymmetric encryption
  • Encryption in ETL pipelines
  • Azure implementation examples
  • Python implementation examples
  • Spark implementation examples
  • Best practices followed in production systems

This blog is beginner-friendly and focused on real-world Data Engineering scenarios.


What is Encryption?

Encryption is the process of converting readable data into unreadable format.

Readable data:

CustomerName = Soumya

Encrypted data:

gH72JkL92kdP0xY

The encrypted value becomes meaningless unless someone has the correct key.


What is Decryption?

Decryption is the reverse process.

It converts encrypted unreadable data back into original readable data.

Encrypted:

gH72JkL92kdP0xY

Decrypted:

Soumya

Why Encryption is Important in Data Engineering

In real-world Data Engineering projects, data moves across multiple systems.

Example:

Source System ADF Pipeline Data Lake Databricks Data Warehouse BI Dashboard

Sensitive information may travel through:

  • APIs
  • Files
  • Databases
  • Pipelines
  • Cloud Storage
  • Message Queues

Without encryption:

  • Anyone accessing storage can read data
  • Insider attacks become dangerous
  • Data leaks become easier
  • Compliance audits fail

That is why encryption is mandatory in enterprise systems.


Real-world Sensitive Data Examples

Personally Identifiable Information (PII)

Examples:

  • Name
  • Mobile Number
  • Email
  • Address
  • Aadhaar Number

Financial Data

Examples:

  • Credit Card Numbers
  • Bank Account Details
  • UPI IDs

Healthcare Data

Examples:

  • Medical Records
  • Patient History
  • Insurance Data

Enterprise Confidential Data

Examples:

  • Employee Salaries
  • Internal Reports
  • Client Information

Types of Encryption

There are mainly two types of encryption used in enterprise systems.


Symmetric Encryption

In symmetric encryption:

Same key is used for: - Encryption - Decryption

Example:

Plain Text → Encrypt using Key123 Encrypted Text → Decrypt using Key123

Advantages

  • Faster
  • Efficient for large datasets
  • Common in ETL pipelines

Disadvantages

  • Key sharing becomes risky

Asymmetric Encryption

In asymmetric encryption:

Two different keys are used: - Public Key - Private Key

Flow

Public Key → Encrypt Private Key → Decrypt

Advantages

  • More secure
  • Better for secure communication

Disadvantages

  • Slower than symmetric encryption

Most Common Encryption Algorithms

AlgorithmTypeUsage
AESSymmetricMost common in enterprises
RSAAsymmetricSecure communication
DESSymmetricOlder systems
SHAHashingPassword protection

Encryption vs Hashing

Many beginners confuse these concepts.

Encryption

Encryption can be decrypted back.

Example:

Original → Encrypt → Decrypt → Original

Used for:

  • Sensitive business data
  • API security
  • Secure storage

Hashing

Hashing cannot be reversed.

Example:

password123 → kjh78shdjh2

Used for:

  • Password storage
  • Data integrity
  • Checksums

Real-world Enterprise Tip

Most enterprise companies do NOT encrypt entire datasets. They usually encrypt only sensitive columns like: - email - salary - phone number - bank details

This helps balance:

  • security
  • performance
  • analytics efficiency

Real-world Data Engineering Example

Suppose an e-commerce company receives customer data.

Raw Data

CustomerCredit Card
John1234-5678-9999
Mike8888-4444-2222

Storing this directly is dangerous.

Encrypted Data

CustomerCredit Card
JohnXHG72KSLP
MikePLS82KDKS

Now even if someone accesses storage, the data is unreadable.


Where Encryption is Used in Data Engineering

Data in Transit

Data moving between systems.

Examples:

  • API communication
  • Database connections
  • ETL pipelines

Usually protected using:

HTTPS TLS SSL

Data at Rest

Data stored in:

  • Data Lakes
  • Blob Storage
  • Databases
  • Warehouses

Usually protected using:

  • AES encryption
  • Storage encryption
  • Managed keys

Column-level Encryption

Only sensitive columns are encrypted.

Example:

NameSalaryEmail
VisibleEncryptedEncrypted

Very common in enterprise systems.


Real-world Azure Example

Suppose we are using:

  • Azure Data Factory
  • Azure Key Vault
  • Azure Blob Storage
  • Azure Databricks

Typical Enterprise Flow

Source System ADF Pipeline Key Vault retrieves encryption key Encrypt sensitive columns Store encrypted data in Data Lake Authorized systems decrypt when needed

Common Beginner Mistake

Many beginners directly store secrets inside:

  • notebooks
  • GitHub repositories
  • pipeline code
  • config files

This is a major enterprise security issue.

Production systems always use:

  • Azure Key Vault
  • AWS Secrets Manager
  • GCP Secret Manager

Why Azure Key Vault is Important

One huge mistake beginners make:

key = "my-secret-key"

inside code.

This is extremely dangerous.

In production:

Keys should NEVER be hardcoded.

Python Encryption Example

Install Required Library

pip install cryptography

Generate Encryption Key

from cryptography.fernet import Fernet # Generate key key = Fernet.generate_key() print(key)

Encrypt Data

from cryptography.fernet import Fernet key = Fernet.generate_key() cipher = Fernet(key) text = "Soumya" encrypted_text = cipher.encrypt( text.encode() ) print(encrypted_text)

Decrypt Data

decrypted_text = cipher.decrypt( encrypted_text ) print( decrypted_text.decode() )

Spark Encryption Example

from pyspark.sql.functions import sha2, col encrypted_df = df.withColumn( "email_hash", sha2(col("email"), 256) )

Production Best Practices

Never Hardcode Keys

Wrong:

key = "abcd1234"

Correct:

Use:

  • Azure Key Vault
  • AWS Secrets Manager
  • GCP Secret Manager

Encrypt Only Sensitive Columns

Usually encrypt:

  • PAN
  • Aadhaar
  • Salary
  • Email
  • Phone

Use Role-based Access

Not everyone should decrypt data.

Use:

  • IAM
  • RBAC
  • AD Groups

Rotate Encryption Keys

Production systems rotate keys periodically.

Mask Data in Dashboards

Example:

XXXX-XXXX-1234

instead of full values.


Common Beginner Mistakes

MistakeProblem
Hardcoding keysHuge security risk
Encrypting everythingPerformance issues
No access controlData leaks
Logging raw sensitive dataSecurity violation
Sharing keys over emailDangerous

Scenario based Interview Questions

  1. You are building an ETL pipeline with customer name, email, phone number, PAN number, city, and order amount. Which columns would you encrypt and why?

  2. What is the difference between encryption, hashing, and masking in real-world data engineering?

  3. When would you use encryption instead of hashing?

  4. When would you use hashing instead of encryption?

  5. You need to store customer email IDs and later decrypt them for authorized users. Would you use hashing or encryption?

  6. You need to store user passwords. Would you encrypt them or hash them? Why?

  7. What is the difference between data encryption at rest and data encryption in transit?

  8. In an Azure data pipeline, how would you use Azure Key Vault with ADF or Databricks?

  9. Why should encryption keys never be hardcoded inside Python scripts, notebooks, or GitHub repositories?

  10. A developer stored an encryption key inside a Databricks notebook. What risks does this create, and how would you fix it?

  11. Why do companies usually encrypt only sensitive columns instead of encrypting the entire dataset?

  12. Your Power BI dashboard shows customer phone numbers. How would you mask them for business users?

  13. What is role-based access control, and why is it important for decryption access?

  14. How would you make sure only authorized users can decrypt sensitive customer data?

  15. What is key rotation, and why is it important in production systems?

  16. If an encryption key is compromised, what steps should the data engineering team take?

  17. In PySpark, how would you protect sensitive columns before writing data to a data lake?

  18. If you apply sha2() on an email column in Spark, can you get the original email back? Why or why not?

  19. Your ETL pipeline fails and stores error logs. How would you make sure sensitive data is not exposed in logs?

  20. Explain an end-to-end secure data engineering pipeline using ADF, Databricks, ADLS Gen2, and Azure Key Vault.


Visual Enterprise Architecture Flow

┌───────────────────┐ │ Source Systems │ │ API / DB / Files │ └─────────┬─────────┘ ┌───────────────────┐ │ Azure Data Factory│ │ ETL Pipeline │ └─────────┬─────────┘ ┌───────────────────────┐ │ Azure Key Vault │ │ Encryption Key Access │ └─────────┬─────────────┘ ┌───────────────────────┐ │ Azure Databricks │ │ Encrypt Sensitive Data│ └─────────┬─────────────┘ ┌───────────────────────┐ │ ADLS / Delta Lake │ │ Encrypted Storage │ └─────────┬─────────────┘ ┌───────────────────────┐ │ Authorized Consumers │ │ BI / Analytics Apps │ └───────────────────────┘

Real-world Project Implementation Idea

Project Title

Secure Customer Data Pipeline Using Azure + PySpark

Project Objective

Build an ETL pipeline that:

  • Ingests customer data
  • Encrypts sensitive columns
  • Stores encrypted data in Data Lake
  • Decrypts data only for authorized users
  • Masks sensitive fields in reporting layer

Technologies You Can Use

ComponentTechnology
StorageAzure Data Lake Gen2
ETLAzure Data Factory
ProcessingAzure Databricks
Secret ManagementAzure Key Vault
LanguagePython / PySpark
ReportingPower BI

End-to-end Architecture

CSV/API Source Azure Data Factory Azure Key Vault retrieves encryption key Azure Databricks encrypts sensitive columns Encrypted data stored in ADLS Gen2 Authorized users decrypt data Power BI displays masked information

Resume-ready Project Description

Built a secure Azure-based ETL pipeline using ADF, Databricks, ADLS Gen2, and Azure Key Vault to encrypt sensitive customer data and implement enterprise-grade security practices.

Conclusion

Encryption and decryption are extremely important in modern Data Engineering.

As a Data Engineer, your responsibility is not only moving data.

You must also ensure:

  • data security
  • privacy
  • compliance
  • secure access
  • enterprise governance

The most important takeaway:

Never treat security as optional.

Even beginner projects should follow:

  • secure credential handling
  • encrypted storage
  • proper access control
  • secret management

because these are real-world enterprise practices.

Soumya Ranjan Bisoyi

Written By

Soumya Ranjan Bisoyi

Data Engineer • Mentor • Educator

Helping aspiring Data Engineers learn SQL, Spark, Snowflake, Azure, and real-world Data Engineering concepts through practical, beginner-friendly content.