Data Encryption & Decryption in Data Engineering
In modern Data Engineering projects, handling sensitive data securely is one of the most important responsibilities.
Organizations process:
- Customer Names
- Phone Numbers
- Email IDs
- Credit Card Information
- Aadhaar Numbers
- PAN Numbers
- Banking Information
- Healthcare Records
- Employee Salary Data
If this data gets exposed, companies can face:
- Security breaches
- Financial losses
- Legal penalties
- Compliance violations
- Loss of customer trust
That is why Data Encryption and Decryption become extremely important in real-world Data Engineering systems.
In this blog, we will understand:
- What encryption and decryption are
- Why they are important
- Real-world implementation approaches
- Symmetric vs Asymmetric encryption
- Encryption in ETL pipelines
- Azure implementation examples
- Python implementation examples
- Spark implementation examples
- Best practices followed in production systems
This blog is beginner-friendly and focused on real-world Data Engineering scenarios.
What is Encryption?
Encryption is the process of converting readable data into unreadable format.
Readable data:
CustomerName = SoumyaEncrypted data:
gH72JkL92kdP0xYThe encrypted value becomes meaningless unless someone has the correct key.
What is Decryption?
Decryption is the reverse process.
It converts encrypted unreadable data back into original readable data.
Encrypted:
gH72JkL92kdP0xYDecrypted:
SoumyaWhy Encryption is Important in Data Engineering
In real-world Data Engineering projects, data moves across multiple systems.
Example:
Source System
↓
ADF Pipeline
↓
Data Lake
↓
Databricks
↓
Data Warehouse
↓
BI DashboardSensitive information may travel through:
- APIs
- Files
- Databases
- Pipelines
- Cloud Storage
- Message Queues
Without encryption:
- Anyone accessing storage can read data
- Insider attacks become dangerous
- Data leaks become easier
- Compliance audits fail
That is why encryption is mandatory in enterprise systems.
Real-world Sensitive Data Examples
Personally Identifiable Information (PII)
Examples:
- Name
- Mobile Number
- Address
- Aadhaar Number
Financial Data
Examples:
- Credit Card Numbers
- Bank Account Details
- UPI IDs
Healthcare Data
Examples:
- Medical Records
- Patient History
- Insurance Data
Enterprise Confidential Data
Examples:
- Employee Salaries
- Internal Reports
- Client Information
Types of Encryption
There are mainly two types of encryption used in enterprise systems.
Symmetric Encryption
In symmetric encryption:
Same key is used for:
- Encryption
- DecryptionExample:
Plain Text → Encrypt using Key123
Encrypted Text → Decrypt using Key123Advantages
- Faster
- Efficient for large datasets
- Common in ETL pipelines
Disadvantages
- Key sharing becomes risky
Asymmetric Encryption
In asymmetric encryption:
Two different keys are used:
- Public Key
- Private KeyFlow
Public Key → Encrypt
Private Key → DecryptAdvantages
- More secure
- Better for secure communication
Disadvantages
- Slower than symmetric encryption
Most Common Encryption Algorithms
| Algorithm | Type | Usage |
|---|---|---|
| AES | Symmetric | Most common in enterprises |
| RSA | Asymmetric | Secure communication |
| DES | Symmetric | Older systems |
| SHA | Hashing | Password protection |
Encryption vs Hashing
Many beginners confuse these concepts.
Encryption
Encryption can be decrypted back.
Example:
Original → Encrypt → Decrypt → OriginalUsed for:
- Sensitive business data
- API security
- Secure storage
Hashing
Hashing cannot be reversed.
Example:
password123 → kjh78shdjh2Used for:
- Password storage
- Data integrity
- Checksums
Real-world Enterprise Tip
Most enterprise companies do NOT encrypt entire datasets.
They usually encrypt only sensitive columns like:
- email
- salary
- phone number
- bank detailsThis helps balance:
- security
- performance
- analytics efficiency
Real-world Data Engineering Example
Suppose an e-commerce company receives customer data.
Raw Data
| Customer | Credit Card |
|---|---|
| John | 1234-5678-9999 |
| Mike | 8888-4444-2222 |
Storing this directly is dangerous.
Encrypted Data
| Customer | Credit Card |
|---|---|
| John | XHG72KSLP |
| Mike | PLS82KDKS |
Now even if someone accesses storage, the data is unreadable.
Where Encryption is Used in Data Engineering
Data in Transit
Data moving between systems.
Examples:
- API communication
- Database connections
- ETL pipelines
Usually protected using:
HTTPS
TLS
SSLData at Rest
Data stored in:
- Data Lakes
- Blob Storage
- Databases
- Warehouses
Usually protected using:
- AES encryption
- Storage encryption
- Managed keys
Column-level Encryption
Only sensitive columns are encrypted.
Example:
| Name | Salary | |
|---|---|---|
| Visible | Encrypted | Encrypted |
Very common in enterprise systems.
Real-world Azure Example
Suppose we are using:
- Azure Data Factory
- Azure Key Vault
- Azure Blob Storage
- Azure Databricks
Typical Enterprise Flow
Source System
↓
ADF Pipeline
↓
Key Vault retrieves encryption key
↓
Encrypt sensitive columns
↓
Store encrypted data in Data Lake
↓
Authorized systems decrypt when neededCommon Beginner Mistake
Many beginners directly store secrets inside:
- notebooks
- GitHub repositories
- pipeline code
- config files
This is a major enterprise security issue.
Production systems always use:
- Azure Key Vault
- AWS Secrets Manager
- GCP Secret Manager
Why Azure Key Vault is Important
One huge mistake beginners make:
key = "my-secret-key"inside code.
This is extremely dangerous.
In production:
Keys should NEVER be hardcoded.Python Encryption Example
Install Required Library
pip install cryptographyGenerate Encryption Key
from cryptography.fernet import Fernet
# Generate key
key = Fernet.generate_key()
print(key)Encrypt Data
from cryptography.fernet import Fernet
key = Fernet.generate_key()
cipher = Fernet(key)
text = "Soumya"
encrypted_text = cipher.encrypt(
text.encode()
)
print(encrypted_text)Decrypt Data
decrypted_text = cipher.decrypt(
encrypted_text
)
print(
decrypted_text.decode()
)Spark Encryption Example
from pyspark.sql.functions import sha2, col
encrypted_df = df.withColumn(
"email_hash",
sha2(col("email"), 256)
)Production Best Practices
Never Hardcode Keys
Wrong:
key = "abcd1234"Correct:
Use:
- Azure Key Vault
- AWS Secrets Manager
- GCP Secret Manager
Encrypt Only Sensitive Columns
Usually encrypt:
- PAN
- Aadhaar
- Salary
- Phone
Use Role-based Access
Not everyone should decrypt data.
Use:
- IAM
- RBAC
- AD Groups
Rotate Encryption Keys
Production systems rotate keys periodically.
Mask Data in Dashboards
Example:
XXXX-XXXX-1234instead of full values.
Common Beginner Mistakes
| Mistake | Problem |
|---|---|
| Hardcoding keys | Huge security risk |
| Encrypting everything | Performance issues |
| No access control | Data leaks |
| Logging raw sensitive data | Security violation |
| Sharing keys over email | Dangerous |
Scenario based Interview Questions
-
You are building an ETL pipeline with customer name, email, phone number, PAN number, city, and order amount. Which columns would you encrypt and why?
-
What is the difference between encryption, hashing, and masking in real-world data engineering?
-
When would you use encryption instead of hashing?
-
When would you use hashing instead of encryption?
-
You need to store customer email IDs and later decrypt them for authorized users. Would you use hashing or encryption?
-
You need to store user passwords. Would you encrypt them or hash them? Why?
-
What is the difference between data encryption at rest and data encryption in transit?
-
In an Azure data pipeline, how would you use Azure Key Vault with ADF or Databricks?
-
Why should encryption keys never be hardcoded inside Python scripts, notebooks, or GitHub repositories?
-
A developer stored an encryption key inside a Databricks notebook. What risks does this create, and how would you fix it?
-
Why do companies usually encrypt only sensitive columns instead of encrypting the entire dataset?
-
Your Power BI dashboard shows customer phone numbers. How would you mask them for business users?
-
What is role-based access control, and why is it important for decryption access?
-
How would you make sure only authorized users can decrypt sensitive customer data?
-
What is key rotation, and why is it important in production systems?
-
If an encryption key is compromised, what steps should the data engineering team take?
-
In PySpark, how would you protect sensitive columns before writing data to a data lake?
-
If you apply
sha2()on an email column in Spark, can you get the original email back? Why or why not? -
Your ETL pipeline fails and stores error logs. How would you make sure sensitive data is not exposed in logs?
-
Explain an end-to-end secure data engineering pipeline using ADF, Databricks, ADLS Gen2, and Azure Key Vault.
Visual Enterprise Architecture Flow
┌───────────────────┐
│ Source Systems │
│ API / DB / Files │
└─────────┬─────────┘
↓
┌───────────────────┐
│ Azure Data Factory│
│ ETL Pipeline │
└─────────┬─────────┘
↓
┌───────────────────────┐
│ Azure Key Vault │
│ Encryption Key Access │
└─────────┬─────────────┘
↓
┌───────────────────────┐
│ Azure Databricks │
│ Encrypt Sensitive Data│
└─────────┬─────────────┘
↓
┌───────────────────────┐
│ ADLS / Delta Lake │
│ Encrypted Storage │
└─────────┬─────────────┘
↓
┌───────────────────────┐
│ Authorized Consumers │
│ BI / Analytics Apps │
└───────────────────────┘Real-world Project Implementation Idea
Project Title
Secure Customer Data Pipeline Using Azure + PySpark
Project Objective
Build an ETL pipeline that:
- Ingests customer data
- Encrypts sensitive columns
- Stores encrypted data in Data Lake
- Decrypts data only for authorized users
- Masks sensitive fields in reporting layer
Technologies You Can Use
| Component | Technology |
|---|---|
| Storage | Azure Data Lake Gen2 |
| ETL | Azure Data Factory |
| Processing | Azure Databricks |
| Secret Management | Azure Key Vault |
| Language | Python / PySpark |
| Reporting | Power BI |
End-to-end Architecture
CSV/API Source
↓
Azure Data Factory
↓
Azure Key Vault retrieves encryption key
↓
Azure Databricks encrypts sensitive columns
↓
Encrypted data stored in ADLS Gen2
↓
Authorized users decrypt data
↓
Power BI displays masked informationResume-ready Project Description
Built a secure Azure-based ETL pipeline using ADF, Databricks, ADLS Gen2, and Azure Key Vault to encrypt sensitive customer data and implement enterprise-grade security practices.Conclusion
Encryption and decryption are extremely important in modern Data Engineering.
As a Data Engineer, your responsibility is not only moving data.
You must also ensure:
- data security
- privacy
- compliance
- secure access
- enterprise governance
The most important takeaway:
Never treat security as optional.Even beginner projects should follow:
- secure credential handling
- encrypted storage
- proper access control
- secret management
because these are real-world enterprise practices.


