Links

Send dataframe to S3

Tags: #aws #cloud #storage #S3bucket #operations #snippet #dataframe
Author: Maxime Jublou​
Reference : AWS Data Wrangler​

Input

Import libraries

try:
import awswrangler as wr
except:
!pip install awswrangler --user
import awswrangler as wr
import pandas as pd
from datetime import date

Setup AWS

# Credentials
AWS_ACCESS_KEY_ID = "YOUR_AWS_ACCESS_KEY_ID"
AWS_SECRET_ACCESS_KEY = "YOUR_AWS_SECRET_ACCESS_KEY"
AWS_DEFAULT_REGION = "YOUR_AWS_DEFAULT_REGION"
​
# Bucket
BUCKET_PATH = f"s3://naas-data-lake/dataset/"

Setup Env

%env AWS_ACCESS_KEY_ID=$AWS_ACCESS_KEY_ID
%env AWS_SECRET_ACCESS_KEY=$AWS_SECRET_ACCESS_KEY
%env AWS_DEFAULT_REGION=$AWS_DEFAULT_REGION

Model

Get dataframe

df = pd.DataFrame({
"id": [1, 2],
"value": ["foo", "boo"],
"date": [date(2020, 1, 1), date(2020, 1, 2)]
})
​
# Display dataframe
df

Output

Send dataset to S3

Wrangler has 3 different write modes to store Parquet Datasets on Amazon S3.
  • append (Default) : Only adds new files without any delete.
  • overwrite : Deletes everything in the target directory and then add new files.
  • overwrite_partitions (Partition Upsert) : Only deletes the paths of partitions that should be updated and then writes the new partitions files. It's like a "partition Upsert".
wr.s3.to_parquet(
df=df,
path=BUCKET_PATH,
dataset=True,
mode="overwrite"
)