Qburst Logo
Industries
Solutions
Services
Innovation & Insights
Company
Industries
Solutions
Services
Innovation & Insights
Company
Creating Realistic Test Data: A QA Engineer’s Approach
  1. Innovation & Insights
  2. Blog
|
Performance Testing

Creating Realistic Test Data: A QA Engineer’s Approach

Archana Sivankutty
Archana Sivankutty

Latest Posts

  • Connecting the Factory Floor to the Cloud for Real-Time Manufacturing Insights

  • How Our Self-Service AI Layer for CheckoAutomates Infrastructure Security

  • Re-Engineering Facilities Management with Dynamics 365

  • AI Can Generate Screens, But Who Designs Experiences?

  • What Spreadsheets Taught me About the Future of Agentic AI

As a QA engineer, ensuring the system has realistic data that reflects real-world scenarios is a critical challenge. But what happens when the data platform you rely on doesn’t have a development environment? That was exactly the situation I faced while testing a financial institution’s new tiering system.

Our goal was to generate 1 million records to validate the system’s performance and functionality, simulating real customer behavior. However, there was a challenge—the platform that sends the data didn’t have a development environment, meaning we couldn’t directly pull the test data we needed. 

Faced with this roadblock, we had to rethink our approach. Instead of relying on existing data, we delved into data generation and built a solution that worked on our terms.

The Real Challenge: Generating Test Data from Scratch

The challenge wasn’t just about generating data—it had to be realistic, interconnected, and scalable. This wasn’t a one-off test. To validate the system under realistic load, our data had to mimic complex production relationships.

Without accurate, interconnected test data, we couldn’t assess the system’s performance under pressure. And without a dev environment to pull data from, we had only one option: create it from scratch.

Once we understood the challenge, I collaborated with our data engineering experts to generate the required data using Python. The solution we came up with wasn’t just a generic data dump—it was a carefully crafted set of interconnected tables (Customers, Accounts, and Transactions) designed to reflect real-world scenarios. 

Building Relationships: The Key to Realistic Test Data

The foundation of the system we were testing was built on relationships—customers have accounts, and accounts have transactions. The customer ID had to link to its associated accounts, and transactions needed to be tied to the right accounts. I knew that maintaining these relationships was critical for realistic testing.

Step 1: Generating Realistic Customer Data with Faker

I used the Faker library to generate random but realistic customer information. The Faker library is a powerful tool for generating diverse, real-world data—names, addresses, emails, and phone numbers, making it a perfect fit for creating realistic test data. 

Here’s how I used the Faker library to generate realistic customer data:

1# Initialize Faker
2fake = Faker()
3# Generate a single customer’s data
4def generate_customer():
5    return {
6        ‘customer_id’: fake.uuid4(),
7        ‘first_name’: fake.first_name(),
8        ‘last_name’: fake.last_name(),
9        ’email’: fake.email(),
10        ‘phone_number’: fake.phone_number(),
11        ‘address’: fake.address(),
12        ‘account_balance’: fake.random_number(digits=5)
13    }
14# Example of generating one customer
15customer_data = generate_customer()
16print(customer_data)

This function creates a realistic customer with a unique customer ID, name, email, phone number, and even an address. The account_balance simulates the amount of money the customer may have in their account.

Step 2: Creating Accounts and Linking Them to Customers

Next, I needed to simulate the customer’s accounts. Each customer could have between 1 and 3 accounts. I used the customer’s customer_id to maintain the relationship between the customers and their accounts.

1import random
2# Generate account data
3def generate_account(customer_id):
4    num_accounts = random.randint(1, 3)  # Between 1 to 3 accounts per customer
5    accounts = []
6    for _ in range(num_accounts):
7        account_data = {
8            ‘account_id’: fake.uuid4(),
9            ‘customer_id’: customer_id,
10            ‘account_type’: random.choice([‘Savings’, ‘Checking’, ‘Business’]),
11            ‘balance’: fake.random_number(digits=4)
12        }
13        accounts.append(account_data)
14    return accounts
15# Generate 3 accounts for a customer
16accounts_data = generate_account(customer_data[‘customer_id’])
17print(accounts_data)

Here, each account is tied to a unique account_id, and the customer_id ensures that the accounts are linked back to the customer. The account type and balance are randomly chosen to simulate different account scenarios.

Step 3: Creating Transactions for Each Account

The last part of the puzzle was to generate transactions for each account. Each account needed between 5 to 10 transactions, and each transaction had a unique transaction ID and date. I ensured the transaction data was spread over the past decade to simulate realistic customer behavior.

1# Generate transaction data for an account
2def generate_transactions(account_id):
3    num_transactions = random.randint(5, 10)  # Between 5 to 10 transactions per account
4    transactions = []
5    for _ in range(num_transactions):
6        transaction_data = {
7            ‘transaction_id’: fake.uuid4(),
8            ‘account_id’: account_id,
9            ‘amount’: fake.random_number(digits=3),  # Random amount for each transaction
10            ‘date’: fake.date_this_decade(),  # Random date in the last decade
11            ‘transaction_type’: random.choice([‘Deposit’, ‘Withdrawal’, ‘Transfer’])
12        }
13        transactions.append(transaction_data)
14    return transactions
15# Generate transactions for an account
16transactions_data = generate_transactions(accounts_data[0][‘account_id’])
17print(transactions_data)

Here, each transaction is tied to the account ID, and I used fake.date_this_decade() to ensure that transactions are spread across a realistic timeline.

Step 4: Scaling and Automating the Data Generation

To generate 1 million records, I automated the process and scaled it using Databricks, a cloud-based platform that helped us process the data efficiently.

1import csv
2# Function to save data to CSV files
3def save_to_csv(customers, accounts, transactions):
4    with open(‘customers.csv’, ‘w’, newline=”) as f:
5        writer = csv.DictWriter(f, fieldnames=customers[0].keys())
6        writer.writeheader()
7        writer.writerows(customers)
8    with open(‘accounts.csv’, ‘w’, newline=”) as f:
9        writer = csv.DictWriter(f, fieldnames=accounts[0].keys())
10        writer.writeheader()
11        writer.writerows(accounts)
12    with open(‘transactions.csv’, ‘w’, newline=”) as f:
13        writer = csv.DictWriter(f, fieldnames=transactions[0].keys())
14        writer.writeheader()
15        writer.writerows(transactions)
16# Example data to save (a small subset for illustration)
17save_to_csv([customer_data], accounts_data, transactions_data)

This function exports the generated customer, account, and transaction data into CSV files, allowing easy import into the test environment.

The Results

By the end of the process, we had succeeded in generating:

  • Realistic, interconnected test data that mirrored actual customer behavior and relationships.
  • A scalable solution capable of handling 1 million records and beyond.
  • A streamlined, automated process that saved time and allowed the QA team to focus on optimizing the system’s performance.

With the data generated and loaded, we were able to test the system’s performance under realistic conditions. This gave us confidence that we were validating the system with diverse, well-structured data that accurately mimicked production.

Conclusion

While we didn’t use dbldatagen—a powerful library from Databricks for generating large-scale synthetic data—this is certainly a tool worth considering for generating big datasets. However, for our specific case, we needed more fine-grained control over the relationships between data entities (such as customers, accounts, and transactions). dbldatagen didn’t provide the level of customization we required, so we opted for a more tailored, Python-based solution that let us define custom data relationships and simulate real-world behavior more effectively. 

 

Latest Posts

  • Connecting the Factory Floor to the Cloud for Real-Time Manufacturing Insights

  • How Our Self-Service AI Layer for CheckoAutomates Infrastructure Security

  • Re-Engineering Facilities Management with Dynamics 365

  • AI Can Generate Screens, But Who Designs Experiences?

  • What Spreadsheets Taught me About the Future of Agentic AI

Recognized for Growth. Trusted for Impact.

Deloitte Technology Fast 50 India, Winner 2024

Deloitte Fast 50 India, Winner 2024

Dun & Bradstreet

Leading Mid-Corporates of India, 2024

RecognitionImage

Major Contender, QE Specialist Services


Qburst Logo
ISO
QBurst on LinkedIn
QBurst on X
QBurst on Facebook
QBurst on Instagram
Industries
RetailRealtyHigh-TechHealthcareManufacturing
Solutions
Digital ExperienceIntelligent EnterpriseProduct EngineeringManaged AgentsModernization
Services
Experience DesignDigital EngineeringDigital PlatformsData Engineering & AnalyticsApplied AICloudQuality EngineeringGlobal Capability CentersDigital Marketing
Innovation & Insights
BlogCase StudiesWhitepapersBrochures
Company
LeadershipClientsPartnersCorporate ResponsibilityNews & MediaCareersOur LocationsGrowth Referral
  • Industries
  • Solutions
  • Services
  • Innovation & Insights
  • Company

© QBurst 2026. All Rights Reserved.

Privacy Policy

Cookies & Management

Certifications