Date of Award
Master of Science in Information Technology Leadership (MS ITL)
The Oracle database currently used to mine data at PEGGY is approaching end-of-life and a new infrastructure overhaul is required. It has also been identified that a critical business requirement is the need to load and store very large historical data sets. These data sets contain raw electronic consumer events and interactions from a website such as page views, clicks, downloads, return visits, length of time spent on pages, and how they got to the site / originated.
This project will be focused on finding a tool to analyze and measure sessionized data, which is a unit of measurement in web analytics that captures either a user's actions within a particular time period, or the process of segmenting user activity of each user into sessions, each representing a single visit to the site. This sessionized data can be used as the input for a variety of data mining tasks such as clustering, association rule mining, sequence mining etc (Ansari. 2011) This sessionized data must be delivered in a reorganized and readable format timely enough to make informed go-to-market decisions as it relates to the current and existing industry trends. It is also pertinent to understand any development work required and the burden on the resources.
Legacy on-premise data warehouse solutions are becoming more expensive, less efficient, less dynamic, and unscalable when compared to current Cloud Infrastructure as a Service (IaaS) that offer real time, on-demand, pay-as-you-go solutions . Therefore, this study will examine the total cost of ownership (TCO) by considering, researching, and analyzing the following factors against a system wide upgrade of the current on-premise Oracle Real Application Cluster (RAC) System:
- High performance: real-time (or as close to as possible) query speed against sessionized data
- SQL compliance
- Cloud based or, at least a hybrid (read: on-premise paired with cloud)
- Security: encryption preferred
- Cost structure: cost-effective pay-as-you-go pricing model and resources required for the migration and operations.
These technologies analyzed against the current Oracle database are:
- Amazon Redshift
- Google Bigquery
- Hadoop + Hive
The cost of building an on-premise data warehouse is substantial. The project will determine the performance capabilities and affordability of Amazon Redshift, when compared to other emerging highly ranked solutions, for running e-commerce standard analytics queries on terabytes of sessionized data. Rather than redesigning, upgrading, or over purchasing infrastructure at a high cost for an on-premise data warehouse, this project considers data warehousing solutions through cloud based infrastructure as a service (IaaS) solutions. The proposed objective of this project is to determine the most cost-effective high performer between Amazon Redshift, Apache Hadoop, and Google BigQuery when running e-commerce standard analytics queries on terabytes of sessionized data.
McGinley, Robert and Etter, Jason, "Storage and Analysis of Big Data Tools for Sessionized Data" (2015). Mathematics and Computer Science Capstones. Paper 24.