Codementor Events

How to truncate/delete a database table from the AWS Glue job?

Published Oct 10, 2022

AWS Glue is an ETL tool developed & managed by Amazon, which uses PySpark underneath. An ETL job can have many types of data sources & destinations. It is pretty easy working with files as source/destination, just provide credentials, and a location to read & write.

Recently I was working on converting a SQL stored procedure to Glue job. My requirement was very simple. The interesting part was truncating the target table before inserting new data. Glue data frame doesn't provide overwrite mode while loading into the table. I research a bit but didn't find any resources handling this & neither getting any direction. So how can we do this?

Something came to my mind, yes I can do this. Let us use driver's capability to do this. For this, I used PyMySQL Python package.

import pymysql
import pymysql.cursors

# Connect to the database using db details or fetch these from Glue connections
connection = pymysql.connect(host='localhost',
                             user='user',
                             password='passwd',
                             database='db',
                             cursorclass=pymysql.cursors.DictCursor)

with connection:
    with connection.cursor() as cursor:
        # Create a new record
        table_to_truncate = "test_table"
        sql = f"truncate table {table_to_truncate}"
        cursor.execute(sql)
        
    # connection is not autocommit by default. So you must commit to save
    # your changes.
    connection.commit()

If you are working with Glue Studio to create a job in drag & drop way by selecting notes. Write a custom transformation & write this snippet inside this transformation. In other case, you can put it anywhere in the code file.

Still struggling to achieve this or stuck with something complex in this big data domain. Feel free to reach me.

Discover and read more posts from Rakesh Kumar
get started
post commentsBe the first to share your opinion
Sumant Kumar
a year ago

Hi Rakesh,

We have a requirement werein we have to backup data from RDS(MYSQL and Postgres) to S3 and to delete the already copied data from RDS.

please suggest me how can i acheive this thru AWS GLUE.

Rakesh Kumar
a year ago

Sure Sumant, DM me or schedule a session.

CRISTHIAM PAUL GOMEZ FLOREZ
a year ago

hi, i’m trying this but i got the next error:

IndentationError: expected an indented block (test_bookmark_2.py, line 25)

my line 25 is:
with connection.cursor() as cursor:

Rakesh Kumar
a year ago

You have problem in your code, missing indentation. Indentation is base requirement for python.

Show more replies