Skip to content

ThiagoGoncos/CCT-highway-revenue-hadoop-mongodb-analysis

Repository files navigation

Distributed Data Analysis and Benchmarking of Highway Revenue Using Hadoop, MongoDB, and YCSB

Project Overview

This project was developed as part of the Data Storage Solutions module at CCT College Dublin.

The objective of the assignment was to select a real dataset, store it using distributed and NoSQL storage technologies, process it using Hadoop MapReduce, execute database queries, and benchmark database performance.

For this project, the dataset “Revenues Used by States for Highways” was selected. The dataset contains structured information about how U.S. states and Washington D.C. funded their highways, covering 15,912 records from 2000 to 2023 with key fields such as Year, State, Revenue Type, and Receipts.

The project was implemented using a practical storage pipeline that combined HDFS, Hadoop MapReduce, MongoDB, and YCSB. This allowed the dataset to be stored, processed, queried and benchmarked using different technologies within the same solution.


Project Objectives

The main goal of this project was to demonstrate how modern data storage technologies can be used together in a complete workflow.

The assignment required the following:

  • Selecting and preparing a suitable real-world dataset
  • Storing the dataset in HDFS
  • Processing the data using Hadoop MapReduce
  • Importing the dataset into a NoSQL database
  • Executing queries and aggregations in MongoDB
  • Benchmarking NoSQL database performance using YCSB
  • Presenting the final results in a poster and supporting materials

This project was designed to show both distributed data processing and database performance evaluation in a practical environment.


Dataset

The dataset used in this project is:

Revenues Used by States for Highways (2000–2023)

It was obtained from Data.gov and contains financial information about highway-related revenues across U.S. states and D.C.

Main attributes include:

  • Year
  • State
  • Revenue Type
  • Receipts

The dataset was first cleaned and prepared before being used in the storage and analysis workflow.


Data Preparation

Before storing and processing the data, the CSV dataset was cleaned using Python.

This stage involved preparing the original dataset so it could be used correctly in Hadoop and MongoDB.

The cleaned dataset was then saved as a new CSV file and used in all later stages of the project.

This preparation step helped ensure that the data structure was suitable for:

  • HDFS ingestion
  • MapReduce processing
  • MongoDB import
  • Query execution
  • Benchmark testing

Storage Framework

The project uses two main storage approaches:

HDFS

The cleaned CSV dataset was uploaded into HDFS (Hadoop Distributed File System).

HDFS was used to support distributed storage and make the dataset available for Hadoop processing.

This stage demonstrates how large structured datasets can be stored in a distributed environment for scalable processing tasks.

MongoDB

The same dataset was also imported into MongoDB, using:

  • Database: highwaysDB
  • Collection: revenues

MongoDB was used to provide flexible NoSQL querying and aggregation capabilities.

Using both HDFS and MongoDB allowed the project to combine:

  • distributed storage and processing
  • fast querying and aggregation

Hadoop MapReduce Implementation

A Hadoop MapReduce job was developed to process the dataset and calculate total revenue per state.

The implementation used two Python scripts:

Mapper

The mapper script reads each line of the CSV input, extracts the state and receipt amount, and emits them as key-value pairs.

The mapper output follows the structure:

  • key = state
  • value = receipt amount

Reducer

The reducer script receives the mapper output, groups values by state, and calculates the total sum of receipts for each state.

This produces the final aggregated output showing total highway revenue per state.

The MapReduce job completed successfully and generated output that summarised revenue totals across states.


MongoDB Query Execution

After importing the cleaned dataset into MongoDB, several queries were executed to explore and analyse the data.

These included:

Basic Filtering

Using find() to retrieve documents for a specific state such as Alabama.

Sorting

Using sort() to identify states with the highest receipts.

Aggregation

Using aggregate() to:

  • group by state and sum total receipts
  • group by state and sum receipts for a specific year such as 2023
  • group by revenue type and sum receipts

These queries demonstrated that MongoDB can efficiently handle filtering, sorting and grouped analysis on structured financial data.


Performance Benchmarking with YCSB

To evaluate NoSQL database performance, YCSB (Yahoo! Cloud Serving Benchmark) was used with 20,000 operations.

Three workloads were tested:

  • Workload A
  • Workload B
  • Workload C

The benchmark compared throughput and latency for read and update operations.

According to the project results, Workload C achieved the best throughput, with 9633.91 operations per second, and the lowest average latency of 80 microseconds. This happened because Workload C focuses only on read operations, with no updates, making it faster than the other workloads.

Overall, the benchmark showed that MongoDB handled the workloads well and remained stable under load.


System Architecture

The project followed a clear data flow:

  1. Download the dataset
  2. Clean the CSV file
  3. Store the cleaned file in HDFS
  4. Process the data with Hadoop MapReduce
  5. Import the data into MongoDB
  6. Run NoSQL queries and aggregations
  7. Benchmark MongoDB using YCSB

This architecture allowed the project to demonstrate the complete path from raw dataset to storage, analysis and performance testing.


Technologies Used

The project was implemented using the following technologies:

Python

Used for:

  • dataset cleaning
  • Hadoop mapper script
  • Hadoop reducer script

Hadoop

Used for:

  • distributed storage with HDFS
  • batch processing with MapReduce

MongoDB

Used for:

  • NoSQL storage
  • data import
  • querying
  • aggregation

YCSB

Used for:

  • benchmarking MongoDB performance
  • comparing workload throughput and latency

CSV

Used as the source data format for ingestion and processing.


Results and Analysis

The project demonstrated that combining HDFS and MongoDB is effective for handling structured datasets in different ways.

Main conclusions include:

  • HDFS provided a suitable environment for distributed storage and MapReduce processing
  • MongoDB made it easy to run flexible queries and aggregations
  • The MapReduce job successfully calculated total revenue per state
  • MongoDB handled different types of reads and aggregations efficiently
  • YCSB benchmarking showed strong performance, especially for read-heavy operations
  • Workload C produced the best benchmark results because it involved only reads

Overall, the system stored, processed, queried and benchmarked the data successfully, showing that it worked well for both analytical and database performance tasks.


Learning Outcomes

This project demonstrates several important concepts in modern data storage and processing:

  • Distributed file storage using HDFS
  • Batch processing using Hadoop MapReduce
  • NoSQL document storage using MongoDB
  • Aggregation and query execution in MongoDB
  • Benchmarking with YCSB
  • Building a complete data storage workflow from ingestion to performance analysis

These skills are important in big data, data engineering, and modern storage architecture projects.


Author

This project was developed by Thiago Goncalves da Costa as part of the Bachelor of Science in Computing and Information Technology at CCT College Dublin.

During my studies, I used the institutional GitHub account associated with my student email:

2022161@student.cct.ie

Since institutional accounts and student emails may be deactivated after graduation, this repository was migrated to my personal GitHub account:

https://github.com/ThiagoGoncos

This ensures long-term preservation of the project, commit history, and academic work completed during the degree program.

About

Distributed data processing and NoSQL benchmarking project developed with Hadoop HDFS, MapReduce, MongoDB, and YCSB, featuring dataset ingestion, distributed storage, revenue aggregation by state, NoSQL querying, and database performance benchmarking for the Data Storage Solutions CA1 module at CCT College.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages