AWS Glue

AWS Cloud Software aws AWS lambda aws-glue cloud etl serverless

AWS Glue

AWS Glue is a fully managed extract, transform, and load (ETL) service that makes it easy for customers to prepare and load their data for analytics. You can create and run an ETL job with a few clicks in the AWS Management Console. You simply point AWS Glue to your data stored on AWS, and AWS Glue discovers your data and stores the associated metadata (e.g. table definition and schema) in the AWS Glue Data Catalog. Once cataloged, your data is immediately searchable, queryable, and available for ETL. So what is ETL..
ETL stands for “extract, transform, and load.” The process of ETL plays a key role in data integration strategies. ETL allows businesses to gather data from multiple sources and consolidate it into a single, centralized location. ETL also makes it possible for different types of data to work together.

AWS Glue Components

AWS Glue provides a console and API operations to set up and manage your extract, transform, and load (ETL) workload. You can use API operations through several language-specific SDKs and the AWS Command Line Interface (AWS CLI). AWS Glue uses the AWS Glue Data Catalog to store metadata about data sources, transforms, and targets. The Data Catalog is a drop-in replacement for the Apache Hive Metastore. The AWS Glue Jobs system provides a managed infrastructure for defining, scheduling, and running ETL operations on your data.

AWS Glue Console

You use the AWS Glue console to define and orchestrate your ETL workflow. The console calls several API operations in the AWS Glue Data Catalog and AWS Glue Jobs system to perform the following tasks:

Define AWS Glue objects such as jobs, tables, crawlers, and connections.
Schedule when crawlers run.
Define events or schedules for job triggers.
Search and filter lists of AWS Glue objects.
Edit transformation scripts.

AWS Glue Data Catalog

The AWS Glue Data Catalog is your persistent metadata store. It is a managed service that lets you store, annotate, and share metadata in the AWS Cloud in the same way you would in an Apache Hive metastore.

AWS Glue Crawlers and Classifiers

AWS Glue also lets you set up crawlers that can scan data in all kinds of repositories, classify it, extract schema information from it, and store the metadata automatically in the AWS Glue Data Catalog. From there it can be used to guide ETL operations.

Which Data Stores Can I Crawl?

Crawlers can crawl both file-based and table-based data stores. Crawlers can crawl the following data stores through their respective native interfaces:

Amazon Simple Storage Service (Amazon S3)
Amazon DynamoDB

Crawlers can crawl the following data stores through a JDBC connection:

Amazon Redshift
Amazon Relational Database Service (Amazon RDS)
- Amazon Aurora
- MariaDB
- Microsoft SQL Server
- MySQL
- Oracle
- PostgreSQL
Publicly accessible databases
- Aurora
- MariaDB
- SQL Server
- MySQL
- Oracle
- PostgreSQL

What Happens When a Crawler Runs?

When a crawler runs, it takes the following actions to interrogate a data store:

Classifies data to determine the format, schema, and associated properties of the raw data – You can configure the results of classification by creating a custom classifier.
Groups data into tables or partitions – Data is grouped based on crawler heuristics.
Writes metadata to the Data Catalog – You can configure how the crawler adds, updates, and deletes tables and partitions.

AWS Glue ETL Operations

Using the metadata in the Data Catalog, AWS Glue can autogenerate Scala or PySpark (the Python API for Apache Spark) scripts with AWS Glue extensions that you can use and modify to perform various ETL operations. For example, you can extract, clean, and transform raw data, and then store the result in a different repository, where it can be queried and analyzed. Such a script might convert a CSV file into a relational form and save it in Amazon Redshift.

The AWS Glue Jobs System

The AWS Glue Jobs system provides managed infrastructure to orchestrate your ETL workflow. You can create jobs in AWS Glue that automate the scripts you use to extract, transform, and transfer data to different locations. Jobs can be scheduled and chained, or they can be triggered by events such as the arrival of new data.

AWS Glue’s Features

Easy – AWS Glue automates much of the effort in building, maintaining, and running ETL jobs. AWS Glue crawls your data sources, identifies data formats, and suggests schemas and transformations. AWS Glue automatically generates the code to execute your data transformations and loading processes.
Integrated – AWS Glue is integrated across a wide range of AWS services.
Serverless – AWS Glue is serverless. There is no infrastructure to provision or manage. AWS Glue handles provisioning, configuration, and scaling of the resources required to run your ETL jobs on a fully managed, scale-out Apache Spark environment. You pay only for the resources used while your jobs are running.
Developer Friendly – AWS Glue generates ETL code that is customizable, reusable, and portable, using familiar technology – Scala, Python, and Apache Spark. You can also import custom readers, writers and transformations into your Glue ETL code. Since the code AWS Glue generates is based on open frameworks, there is no lock-in. You can use it anywhere.

References

0 Comments

CloudNesil

What Happens When a Crawler Runs?

AWS Glue ETL Operations

The AWS Glue Jobs System

Post a Comment cancel reply

Would you like to take a closer look?

We’ll Be Glad To Assist You

Contact Info

Contact Info

CloudNesil

AWS Glue

AWS Glue Components

AWS Glue Console

AWS Glue Data Catalog

AWS Glue Crawlers and Classifiers

Which Data Stores Can I Crawl?

What Happens When a Crawler Runs?

AWS Glue ETL Operations

The AWS Glue Jobs System

AWS Glue’s Features

Related Posts

CloudNesil Awarded Top B2B Company in Turkey by Clutch

Top IT Outsourcing Companies in Turkey 2019

Serverless Framework

Post a Comment cancel reply