High Performance Spatial Query Processing for Large Scale Spatial Data Warehousing Öffentlichkeit

Aji, Ablimit (2014)

Permanent URL: https://etd.library.emory.edu/concern/etds/g445cd830?locale=de
Published

Abstract

Support of high performance queries on large volumes of spatial data have become important in many domains, including geo-spatial problems in numerous fields and emerging scientific applications that are increasingly data- and compute-intensive. There are two major challenges for managing and querying massive spatial data: the explosion of spatial data, and the high computational complexity of spatial queries due to the multi-dimensional nature of spatial analytics.

MapReduce provides a highly effective, efficient and reliable tool for large scale data analysis. While MapReduce model is amenable to a wide range of real-world tasks, spatial queries and analytics are intrinsically complex to fit into the MapReduce model. Meanwhile, hybrid systems combining CPUs and GPUs are becoming widely available, but the computing capacity of such systems is often underutilized. Providing new spatial querying methods on such architecture requires us to answer several fundamental research questions that have practical implications.

The goal of this dissertation is to create a framework with new systematic methods to support high performance spatial queries for massive spatial data on MapReduce and CPU-GPU hybrid platforms, driven by real-world use cases. Specifically, our research includes:

  • create new spatial data processing methods and pipelines in MapReduce, and multi-level indexing methods to accelerate spatial data processing
  • develop two critical components to enable query parallelization: effective and scalable spatial partitioning and query normalization methods for partition effect
  • integrate GPU-based spatial operations into MapReduce query pipelines
  • investigate optimization methods for data skew mitigation, and CPU/GPU resource coordination in MapReduce
  • support declarative spatial queries for workload composition, and automated translation into MapReduce programs

Consequently, we have developed Hadoop-GIS -- a MapReduce based high performance spatial querying system for spatial data warehousing. The system supports multiple types of spatial queries on MapReduce through spatial partitioning, implicit parallel spatial query execution on MapReduce, and effective methods for amending query results through handling boundary objects. Hadoop-GIS utilizes global partition indexing and customizable on demand local spatial indexing to achieve efficient query processing. We have integrated Hadoop-GIS into Hive to support declarative spatial queries, and released the system and developed approaches as an open source software package.

Table of Contents

Contents

1 Introduction 1

1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Data Intensive Spatial Applications . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.1 Analysis of Derived Scientific Spatial Data . . . . . . . . . . . . . . . . . . 3
1.2.2 GIS and Social Media Applications . . . . . . . . . . . . . . . . . . . . . . . 5
1.3 Spatial Queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3.1 Characteristics of Modern Spatial Analytics Applications . . . . . . . . 7
1.4 Dissertation Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.4.1 Exploring the Principles of Spatial Query Processing on MapReduce .9
1.4.2 Exploring the System Architecture . . . . . . . . . . . . . . . . . . . . . . . 10
1.4.3 Improving the Query Processing Performance . . . . . . . . . . . . . . . 11
1.5 Dissertation Contributions and Outline . . . . . . . . . . . . . . . . . . . . . . . 14
1.5.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.5.2 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2 Approaches to Large Scale Spatial Data Analytics . . . . . . . . . . . . . . . 17

2.1 Spatially Extended Relational Database Systems . . . . . . . . . . . . . . . . 17

2.2 GIS systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.3 MapReduce based Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

2.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3 Hadoop-GIS: A Framework for Spatial Query Processing with MapReduce 24
3.1 Overview of the Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.2 Data Partitioning and Storage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3.3 MapReduce Based Parallel Query Execution . . . . . . . . . . . . . . . . . . . . 28

3.4 Boundary Object Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.5 Multi-Level Spatial Indexing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.6 Query Processing Cost Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.7 Related Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

3.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .34

4 MapReduce based Spatial Query Processing . . . . . . . . . . . . . . . . . . 36

4.1 Real-time Spatial Query Engine . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

4.1.1 Index Supported Spatial Queries . . . . . . . . . . . . . . . . . . . . . . . . 37

4.1.2 Spatial Query Workflows in RESQUE . . . . . . . . . . . . . . . . . . . . . 38

4.1.3 Query Engine Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

4.2 MapReduce Based Spatial Query Processing . . . . . . . . . . . . . . . . . . . 42

4.2.1 Spatial Join . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

4.2.2 Multiway Spatial Join . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

4.2.3 Nearest Neighbor Query . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

4.2.4 Spatial Selection and Aggregation . . . . . . . . . . . . . . . . . . . . . . . 50

4.3 Boundary Handling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

4.4 Query Optimization Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

4.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

4.5.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

4.5.2 Performance of Hadoop-GIS . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

4.5.3 Scalability of Hadoop-GIS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

4.5.4 Boundary Handling Overhead . . . . . . . . . . . . . . . . . . . . . . . . . . 64

4.5.5 Effects of Query Optimization Approaches . . . . . . . . . . . . . . . . . 66

4.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

5 Effective Spatial Data Partitioning for Scalable Query Processing . . . 69

5.1 Introduction and Related Approaches . . . . . . . . . . . . . . . . . . . . . . . . 69

5.1.1 Challenges in Spatial Partitioning . . . . . . . . . . . . . . . . . . . . . . . 70

5.1.2 Related Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

5.2 Classification of Spatial Partition Algorithms . . . . . . . . . . . . . . . . . . . 74

5.2.1 Partition Boundary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

5.2.2 Search Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

5.2.3 Partition Criterion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

5.3 Spatial Partition Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

5.3.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

5.3.2 Methods and Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

5.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

5.4.1 Parameters and Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

5.4.2 Comparison of Partition Quality . . . . . . . . . . . . . . . . . . . . . . . . . 84

5.4.3 Effects of Partitioning on Query Performance . . . . . . . . . . . . . . . 87

5.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

6 Efficiency Improvements for Spatial Data Partitioning . . . . . . . . . . . 90

6.1 Runtime Cost of Spatial Partitioning Algorithms . . . . . . . . . . . . . . . . . 90

6.2 Two Approaches for Improving Efficiency . . . . . . . . . . . . . . . . . . . . . 93

6.2.1 Parallel Partitioning with MapReduce . . . . . . . . . . . . . . . . . . . . . 93

6.2.2 Partitioning on Sampled Data . . . . . . . . . . . . . . . . . . . . . . . . . . 94

6.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

6.3.1 MapReduce based Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

6.3.2 Sampling based Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
7 Haggis: Hardware Acceleration of Hadoop-GIS . . . . . . . . . . . . . . . . 100

7.1 Introduction and Related Approaches . . . . . . . . . . . . . . . . . . . . . . . 100

7.2 GPU Accelerated Spatial Query Processing . . . . . . . . . . . . . . . . . . . . 102

7.2.1 Spatial Queries on CPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

7.2.2 Spatial Queries on GPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

7.3 Implementation Details of Haggis . . . . . . . . . . . . . . . . . . . . . . . . . . 105

7.3.1 Architectural Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

7.3.2 Task Assignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

7.3.3 Effects of Task Granularity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

7.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108

7.4.1 Effects of CPU for co-processing . . . . . . . . . . . . . . . . . . . . . . . . 108

7.4.2 Effects of MR Parallelization . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

7.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

8 HiveSP -- An Implementation of Hadoop-GIS . . . . . . . . . . . . . . . . . . 111

8.1 System Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

8.2 Query Language . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

8.3 Storage Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

8.4 Query Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

8.5 Software Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116

Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

About this Dissertation

Rights statement
  • Permission granted by the author to include this thesis or dissertation in this repository. All rights reserved by the author. Please contact the author for information regarding the reproduction and use of this thesis or dissertation.
School
Department
Degree
Submission
Language
  • English
Research Field
Stichwort
Committee Chair / Thesis Advisor
Committee Members
Zuletzt geändert

Primary PDF

Supplemental Files