Cloud-ETL

“Alexa, can You Handle Big Data?”

Created by: Estela Perez

Objective

The purpose of this assignment was to use cloud ETL skills on big data from two of Amazon’s available public datasets on product reviews. The goal is to perfrom the ETL process completely in the cloud and upload a DataFrame to an RDS instance.

This project required the use of Amazon Web Service (AWS), Relational Database Service (RDS), pgAdmin, google Colab, PySpark, and Google Colab

AWS-RDS

Started by creating an AWS-RDS to connect to pgAdmin alt text

pgAdmin

Then registered our AWS-RDS server in pgAdmin - displaying our databases for both datasets

alt text

ETL Process: Pet Products Dataset

Extract

image

Transform

image image image image image

Load

One step before the loading - was to create the schema for our loading tables in pgAdmin

alt text

After the schema was created - we were able to move to the loading process

image

Checking that load was successful in pgAdmin (one example)

alt text

ETL Process: Digital Video Games Dataset

Extract

Transform

image image image image image

Load

One step before the loading - was to create the schema for our loading tables in pgAdmin

alt text

After the schema was created - we were able to move to the loading process

image

Checking that load was successful in pgAdmin (one example)

alt text