Distributed Data Analysis and Benchmarking of Highway Revenue Using Hadoop, MongoDB, and YCSB

Project Overview

This project was developed as part of the Data Storage Solutions module at CCT College Dublin.

The objective of the assignment was to select a real dataset, store it using distributed and NoSQL storage technologies, process it using Hadoop MapReduce, execute database queries, and benchmark database performance.

For this project, the dataset “Revenues Used by States for Highways” was selected. The dataset contains structured information about how U.S. states and Washington D.C. funded their highways, covering 15,912 records from 2000 to 2023 with key fields such as Year, State, Revenue Type, and Receipts.

The project was implemented using a practical storage pipeline that combined HDFS, Hadoop MapReduce, MongoDB, and YCSB. This allowed the dataset to be stored, processed, queried and benchmarked using different technologies within the same solution.

Project Objectives

The main goal of this project was to demonstrate how modern data storage technologies can be used together in a complete workflow.

The assignment required the following:

Selecting and preparing a suitable real-world dataset
Storing the dataset in HDFS
Processing the data using Hadoop MapReduce
Importing the dataset into a NoSQL database
Executing queries and aggregations in MongoDB
Benchmarking NoSQL database performance using YCSB
Presenting the final results in a poster and supporting materials

This project was designed to show both distributed data processing and database performance evaluation in a practical environment.

Dataset

The dataset used in this project is:

Revenues Used by States for Highways (2000–2023)

It was obtained from Data.gov and contains financial information about highway-related revenues across U.S. states and D.C.

Main attributes include:

Year
State
Revenue Type
Receipts

The dataset was first cleaned and prepared before being used in the storage and analysis workflow.

Data Preparation

Before storing and processing the data, the CSV dataset was cleaned using Python.

This stage involved preparing the original dataset so it could be used correctly in Hadoop and MongoDB.

The cleaned dataset was then saved as a new CSV file and used in all later stages of the project.

This preparation step helped ensure that the data structure was suitable for:

HDFS ingestion
MapReduce processing
MongoDB import
Query execution
Benchmark testing

Storage Framework

The project uses two main storage approaches:

HDFS

The cleaned CSV dataset was uploaded into HDFS (Hadoop Distributed File System).

HDFS was used to support distributed storage and make the dataset available for Hadoop processing.

This stage demonstrates how large structured datasets can be stored in a distributed environment for scalable processing tasks.

MongoDB

The same dataset was also imported into MongoDB, using:

Database: highwaysDB
Collection: revenues

MongoDB was used to provide flexible NoSQL querying and aggregation capabilities.

Using both HDFS and MongoDB allowed the project to combine:

distributed storage and processing
fast querying and aggregation

Hadoop MapReduce Implementation

A Hadoop MapReduce job was developed to process the dataset and calculate total revenue per state.

The implementation used two Python scripts:

Mapper

The mapper script reads each line of the CSV input, extracts the state and receipt amount, and emits them as key-value pairs.

The mapper output follows the structure:

key = state
value = receipt amount

Reducer

The reducer script receives the mapper output, groups values by state, and calculates the total sum of receipts for each state.

This produces the final aggregated output showing total highway revenue per state.

The MapReduce job completed successfully and generated output that summarised revenue totals across states.

MongoDB Query Execution

After importing the cleaned dataset into MongoDB, several queries were executed to explore and analyse the data.

These included:

Basic Filtering

Using find() to retrieve documents for a specific state such as Alabama.

Sorting

Using sort() to identify states with the highest receipts.

Aggregation

Using aggregate() to:

group by state and sum total receipts
group by state and sum receipts for a specific year such as 2023
group by revenue type and sum receipts

These queries demonstrated that MongoDB can efficiently handle filtering, sorting and grouped analysis on structured financial data.

Performance Benchmarking with YCSB

To evaluate NoSQL database performance, YCSB (Yahoo! Cloud Serving Benchmark) was used with 20,000 operations.

Three workloads were tested:

Workload A
Workload B
Workload C

The benchmark compared throughput and latency for read and update operations.

According to the project results, Workload C achieved the best throughput, with 9633.91 operations per second, and the lowest average latency of 80 microseconds. This happened because Workload C focuses only on read operations, with no updates, making it faster than the other workloads.

Overall, the benchmark showed that MongoDB handled the workloads well and remained stable under load.

System Architecture

The project followed a clear data flow:

Download the dataset
Clean the CSV file
Store the cleaned file in HDFS
Process the data with Hadoop MapReduce
Import the data into MongoDB
Run NoSQL queries and aggregations
Benchmark MongoDB using YCSB

This architecture allowed the project to demonstrate the complete path from raw dataset to storage, analysis and performance testing.

Technologies Used

The project was implemented using the following technologies:

Python

Used for:

dataset cleaning
Hadoop mapper script
Hadoop reducer script

Hadoop

Used for:

distributed storage with HDFS
batch processing with MapReduce

MongoDB

Used for:

NoSQL storage
data import
querying
aggregation

YCSB

Used for:

benchmarking MongoDB performance
comparing workload throughput and latency

CSV

Used as the source data format for ingestion and processing.

Results and Analysis

The project demonstrated that combining HDFS and MongoDB is effective for handling structured datasets in different ways.

Main conclusions include:

HDFS provided a suitable environment for distributed storage and MapReduce processing
MongoDB made it easy to run flexible queries and aggregations
The MapReduce job successfully calculated total revenue per state
MongoDB handled different types of reads and aggregations efficiently
YCSB benchmarking showed strong performance, especially for read-heavy operations
Workload C produced the best benchmark results because it involved only reads

Overall, the system stored, processed, queried and benchmarked the data successfully, showing that it worked well for both analytical and database performance tasks.

Learning Outcomes

This project demonstrates several important concepts in modern data storage and processing:

Distributed file storage using HDFS
Batch processing using Hadoop MapReduce
NoSQL document storage using MongoDB
Aggregation and query execution in MongoDB
Benchmarking with YCSB
Building a complete data storage workflow from ingestion to performance analysis

These skills are important in big data, data engineering, and modern storage architecture projects.

Author

This project was developed by Thiago Goncalves da Costa as part of the Bachelor of Science in Computing and Information Technology at CCT College Dublin.

During my studies, I used the institutional GitHub account associated with my student email:

2022161@student.cct.ie

Since institutional accounts and student emails may be deactivated after graduation, this repository was migrated to my personal GitHub account:

https://github.com/ThiagoGoncos

This ensures long-term preservation of the project, commit history, and academic work completed during the degree program.

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
Benchmarking with YCSB		Benchmarking with YCSB
DATASET ACQUISITION AND CLEANING		DATASET ACQUISITION AND CLEANING
FINAL DELIVERABLES		FINAL DELIVERABLES
MongoDB Queries		MongoDB Queries
Original Dataset		Original Dataset
System Architecture Diagram		System Architecture Diagram
Technologies checked		Technologies checked
Uploaded dataset to HDFS and executed MapReduce job		Uploaded dataset to HDFS and executed MapReduce job
.DS_Store		.DS_Store
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Distributed Data Analysis and Benchmarking of Highway Revenue Using Hadoop, MongoDB, and YCSB

Project Overview

Project Objectives

Dataset

Data Preparation

Storage Framework

HDFS

MongoDB

Hadoop MapReduce Implementation

Mapper

Reducer

MongoDB Query Execution

Basic Filtering

Sorting

Aggregation

Performance Benchmarking with YCSB

System Architecture

Technologies Used

Python

Hadoop

MongoDB

YCSB

CSV

Results and Analysis

Learning Outcomes

Author

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Distributed Data Analysis and Benchmarking of Highway Revenue Using Hadoop, MongoDB, and YCSB

Project Overview

Project Objectives

Dataset

Data Preparation

Storage Framework

HDFS

MongoDB

Hadoop MapReduce Implementation

Mapper

Reducer

MongoDB Query Execution

Basic Filtering

Sorting

Aggregation

Performance Benchmarking with YCSB

System Architecture

Technologies Used

Python

Hadoop

MongoDB

YCSB

CSV

Results and Analysis

Learning Outcomes

Author

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages