Glue

Glue

The Glue API in LocalStack Pro allows you to run ETL (Extract-Transform-Load) jobs locally, maintaining table metadata in the local Glue data catalog, and using the Spark ecosystem (PySpark/Scala) to run data processing workflows.

Note: In order to run Glue jobs, some additional dependencies have to be fetched from the network, including a Docker image of apprx. 1.5GB which includes Spark, Presto, Hive and other tools. These dependencies are automatically fetched when you start up the service, so please make sure you’re on a decent internet connection when pulling the dependencies for the first time.

Creating Databases and Table Metadata

The commands below illustrate the creation of some very basic entries (databases, tables) in the Glue data catalog:

  1. $ awslocal glue create-database --database-input '{"Name":"db1"}'
  2. $ awslocal glue create-table --database db1 --table-input '{"Name":"table1"}'
  3. $ awslocal glue get-tables --database db1
  4. {
  5. "TableList": [
  6. {
  7. "Name": "table1",
  8. "DatabaseName": "db1"
  9. }
  10. ]
  11. }

Running Scripts with Scala and PySpark

Assuming we would like to deploy a simple PySpark script job.py in the local folder, we can first copy the script to an S3 bucket:

  1. $ awslocal s3 mb s3://glue-test
  2. $ awslocal s3 cp job.py s3://glue-test/job.py

Next, we can create a job definition:

  1. $ awslocal glue create-job --name job1 --role r1 \
  2. --command '{"Name": "pythonshell", "ScriptLocation": "s3://glue-test/job.py"}'

… and finally start the job:

  1. $ awslocal glue start-job-run --job-name job1
  2. {
  3. "JobRunId": "733b76d0"
  4. }

The returned JobRunId can be used to query the status job the job execution, until it becomes SUCCEEDED:

  1. $ awslocal glue get-job-run --job-name job1 --run-id 733b76d0
  2. {
  3. "JobRun": {
  4. "Id": "733b76d0",
  5. "Attempt": 1,
  6. "JobRunState": "SUCCEEDED"
  7. }
  8. }

For a more detailed example illustrating how to run a local Glue PySpark job, please refer to this sample repository.

Importing Athena Tables into Glue Data Catalog

The Glue data catalog is integrated with Athena, and the database/table definitions can be imported via the import-catalog-to-glue API.

Assume you are running the following Athena queries to create databases and table definitions:

  1. CREATE DATABASE db2
  2. CREATE EXTERNAL TABLE db2.table1 (a1 Date, a2 STRING, a3 INT) LOCATION 's3://test/table1'
  3. CREATE EXTERNAL TABLE db2.table2 (a1 Date, a2 STRING, a3 INT) LOCATION 's3://test/table2'

Then this command will import these DB/table definitions into the Glue data catalog:

  1. $ awslocal glue import-catalog-to-glue

… and finally they will be available in Glue:

  1. $ awslocal glue get-databases
  2. {
  3. "DatabaseList": [
  4. ...
  5. {
  6. "Name": "db2",
  7. "Description": "Database db2 imported from Athena",
  8. "TargetDatabase": {
  9. "CatalogId": "000000000000",
  10. "DatabaseName": "db2"
  11. }
  12. }
  13. ]
  14. }
  15. $ awslocal glue get-tables --database-name db2
  16. {
  17. "TableList": [
  18. {
  19. "Name": "table1",
  20. "DatabaseName": "db2",
  21. "Description": "Table db2.table1 imported from Athena",
  22. "CreateTime": ...
  23. },
  24. {
  25. "Name": "table2",
  26. "DatabaseName": "db2",
  27. "Description": "Table db2.table2 imported from Athena",
  28. "CreateTime": ...
  29. }
  30. ]
  31. }

Crawlers

Glue crawlers allow extracting metadata from structured data sources. The example below illustrates crawling tables and partition metadata from S3 buckets.

First, we create an S3 bucket with a couple of items:

  1. $ awslocal s3 mb s3://test
  2. $ printf "1, 2, 3, 4\n5, 6, 7, 8" > /tmp/file.csv
  3. $ awslocal s3 cp /tmp/file.csv s3://test/table1/year=2021/month=Jan/day=1/file.csv
  4. $ awslocal s3 cp /tmp/file.csv s3://test/table1/year=2021/month=Jan/day=2/file.csv
  5. $ awslocal s3 cp /tmp/file.csv s3://test/table1/year=2021/month=Feb/day=1/file.csv
  6. $ awslocal s3 cp /tmp/file.csv s3://test/table1/year=2021/month=Feb/day=2/file.csv

Then we can create and trigger the crawler:

  1. $ awslocal glue create-database --database-input '{"Name":"db1"}'
  2. $ awslocal glue create-crawler --name c1 --database-name db1 --role r1 --targets '{"S3Targets": [{"Path": "s3://test/table1"}]}'
  3. $ awslocal glue start-crawler --name c1

Finally, we can query the table and partitions metadata that has been created by the crawler:

  1. $ awslocal glue get-tables --database-name db1
  2. {
  3. "TableList": [{
  4. "Name": "table1",
  5. "DatabaseName": "db1",
  6. "PartitionKeys": [ ... ]
  7. ...
  8. $ awslocal glue get-partitions --database-name db1 --table-name table1
  9. {
  10. "Partitions": [{
  11. "Values": ["2021", "Jan", "1"],
  12. "DatabaseName": "db1",
  13. "TableName": "table1",
  14. ...

Schema Registry

The Glue Schema Registry allows you to centrally discover, control, and evolve data stream schemas. With the Schema Registry, you can manage and enforce schemas and schema compatibilities in your streaming applications. It integrates nicely with Managed Streaming for Kafka (MSK).

Note: Currently, LocalStack supports the AVRO dataformat for the Glue Schema Registry. Support for other dataformats will be added in the future.

  1. $ awslocal glue create-registry --registry-name demo-registry
  2. {
  3. "RegistryArn": "arn:aws:glue:us-east-1:000000000000:file-registry/demo-registry",
  4. "RegistryName": "demo-registry"
  5. }
  6. $ awslocal glue create-schema --schema-name demo-schema --registry-id RegistryName=demo-registry --data-format AVRO --compatibility FORWARD \
  7. --schema-definition '{"type":"record","namespace":"Demo","name":"Person","fields":[{"name":"Name","type":"string"}]}'
  8. {
  9. "RegistryName": "demo-registry",
  10. "RegistryArn": "arn:aws:glue:us-east-1:000000000000:file-registry/demo-registry",
  11. "SchemaName": "demo-schema",
  12. "SchemaArn": "arn:aws:glue:us-east-1:000000000000:schema/demo-registry/demo-schema",
  13. "DataFormat": "AVRO",
  14. "Compatibility": "FORWARD",
  15. "SchemaCheckpoint": 1,
  16. "LatestSchemaVersion": 1,
  17. "NextSchemaVersion": 2,
  18. "SchemaStatus": "AVAILABLE",
  19. "SchemaVersionId": "546d3220-6ab8-452c-bb28-0f1f075f90dd",
  20. "SchemaVersionStatus": "AVAILABLE"
  21. }
  22. $ awslocal glue register-schema-version --schema-id SchemaName=demo-schema,RegistryName=demo-registry \
  23. --schema-definition '{"type":"record","namespace":"Demo","name":"Person","fields":[{"name":"Name","type":"string"}, {"name":"Address","type":"string"}]}'
  24. {
  25. "SchemaVersionId": "ee38732b-b299-430d-a88b-4c429d9e1208",
  26. "VersionNumber": 2,
  27. "Status": "AVAILABLE"
  28. }

You can find a more advanced sample in our localstack-pro-samples repository on GitHub, which showcases the integration with AWS MSK and automatic schema registrations (including schema rejections based on the compatibilities).

Further Reading

The AWS Glue API is a fairly comprehensive service - more details can be found in the official AWS Glue Developer Guide.

Current Limitations

Support for triggers is currently limited - the basic API endpoints are implemented, but triggers are currently still under development (more details coming soon).

Last modified July 12, 2022: fix Glue crawler example (by switching to a multiline CSV) (34f1c1e3)