Overview

Amazon Simple Storage Service (s3) provides an out-of-box solution for storing and accessing data "at rest" in a standard way.

Below describes our service to interact with s3. AWS s3 is sometimes referred to as the "data lake".

We have various needs when interacting with s3. This service is a wrapper and provides utilities for working with s3. The features, design and configuration are outlined below.

Business Case

We need to store data on s3 where it can be shared and/or commonly accessed.

Access may be outside our normal RESTful APIs and infrastructure.
Use of this service must be "lightweight" and not require modifying the underlying microservice every time we have a new integration or use of s3.
With the wide usages of s3, we must provide a central place for the security and proper accessing of s3 resources.

API Gateway

Currently the Instructure Canvas data and Student Placements data use the s3 service for reading/writing data to s3. (Note: future uses of s3 may be added without updates to this list of integrations)

The Canvas sync from Instructure to s3 is done as a scheduled job each day where we 'sync' data from Instructure for each mis code to s3 in a folder for that mis code.

Student Placements are generated when students submit a student application in CCC Apply. The design around this is described more fully in StudentPlacement. In particular, the Placement Adaptor access the data by way of the API Gateway.

High Level Design

A Conductor Worker Task receives data from Conductor (via standard task polling).

The task inputs (below) are transformed for the microservice.

The microservice accepts a list of files, stores them into a set of buckets and folders. An example input :

PUT {{S3Sync_URL}}/sync/folder/ben?locationKey=develop

{
   "locationKey": "develop",
   "rootFolder": "420",
   "fileList": [
      {
         "filename": "account_dim-00000-2284f8da.gz",
         "folder": "account_dim",
         "url": "..."
      },
      {
         "filename": "requests-00009-807ac34d.gz",
         "folder": "requests",
         "url": "..."
      }
   ]
}

When sending a file, we'll utilize request parameters for the microservice as in:

PUT {{S3Sync_URL}}/sync/folder/ben?locationKey=develop&filename=test.csv

The body of the post could be:

23232,"Test data", "A"
23041,"more data", "B"

Additionally, the below endpoint is exposed to retrieve placement data for school mis 001

GET {{ROUTER}}/student-placements/001

In the above, the client credentials are checked to ensure authorization to the PLACEMENTS and 001 roles.

Worker Task

A Conductor Worker running in service-workers deployment will be activated with spring profile "shared"

Task Inputs

input	Description
headerMap	map of headers from original inbound message
locationKey	lookup key for s3 bucket name. This will serve as the root folder to put files
folder	sub folder under above bucket (e.g., /processing, /inbox, ... )
fileName	name of file on s3
url	HTTP/S URL to retrieve file contents
fileContents	when sending a single file, the file content

Conductor HTTP Task

We sometimes use dedicated workers as above, but sometimes use HTTP tasks in Conductor and call the microservice from Conductor as in below:

      {
        "description": "Gets student placement data and resets to start tracking new data",
        "taskReferenceName": "s3",
        "name": "http-generic",
        "type": "HTTP",
        "inputParameters": {
          "http_request": {
            "method": "GET",
            "headers": {
              "Accept": "text/plain",
            },
            "uri": "${S3-SYNC-URL}/readFolderContents?locationKey=placements&folder=/${workflow.input.misCode}"
          }
        }
      }

Design

The API Gateway component(s) are responsible for the following actions:

call s3-sync microservice, transform input/output as appropriate

CCCTC Docs