Blog /transform-data-etl-pipeline-node-js/

ETL: Transform Data with Node.js

Create the second phase for an ETL pipeline in Node.

ByMario Kandut

Posted April 04, 2021

Updated March 03, 2022

4 min read

node express javascript

This article is based on Node v16.14.0.

ETL is a process of extracting, transforming, and loading data from one or multiple sources into a destination. Have a look at the article ETL pipeline explained for a general overview on ETL pipelines.

💰 The Pragmatic Programmer: journey to mastery. 💰 One of the best books in software development, sold over 200,000 times.

This is the second article of a series of three articles, and it tries to explain the transform phase in an ETL pipeline.

Extract data
Transform (this article)
Load

Transform Data in ETL pipeline

The second phase in an ETL pipeline is to Transform the extracted data. The data can be completely reformatted in this phase, like renaming fields, adding new fields, filter data out, etc. The transform phase in an ETL pipeline is responsible for transforming the data in the desired format for its destination. In this step you can clean data, standardize values and fields, and aggregate values.

We are going to continue with the example used in the article ETL: Extract Data with Node.js.

1. Determine the new structure of the data

The first step in the Transform phase should be to determine what the new data structure should be. In the example we are extracting photo albums which are an array of photo objects. For the transformation, not needed data thumbnailUrl should be removed, a new property name with value Mario (or whatever string value you like) should be added to the photo object. As well, as a timestamp with the current time should be added to the array of photo albums.

Old photo objects interface:

interface Photo {
  albumId: number;
  id: number;
  title: string;
  url: string;
  thumbnailUrl: string;
}

Interface photo object transformed:

interface Photo {
  albumId: number;
  id: number;
  name: string;
  title: string;
  url: string;
}

The interface for the photo albums is currently an array with the photo object:

Array<Photo>

New Interface for photoAlbums:

interface PhotoAlbums {
  timestamp: Date;
  data: Array<Photo>;
}

2. Create a transform function

Create another file transform.js in the project folder, which is going to contain the transform functions.

touch transform.js

Create a transform function for transforming the photo object. It takes a photo object as an input, returns the needed properties and adds the name property with a string value.

function transformPhoto(photo) {
  return {
    albumId: photo.albumId,
    id: photo.id,
    name: 'Mario',
    title: photo.title,
    url: photo.url,
  };
}

module.exports = { transformPhoto };

A second function has to be created for transforming the photoAlbum, a timestamp with current time should be added and the array with photos should be moved into the new property data.

function addTimeStamp(photoAlbum) {
  return {
    data: photoAlbum,
    timeStamp: new Date(),
  };
}

module.exports = { transformPhoto, addTimeStamp };

3. Add transform phase in ETL orchestrate function

We are going to use the example with multiple requests for getting photos, since one request is boring. 😀 Now, we have to require both functions in the orchestrateEtlPipeline() in index.js and After the request is done, we map over each photo object in each photoAlbum to apply the transformation with the transformPhoto() function. Then we output the result.

const { getPhotos } = require('./extract');
const { addTimeStamp, transformPhoto } = require('./transform');

const orchestrateEtlPipeline = async () => {
  try {
    // EXTRACT
    const allPhotoAlbums = Promise.all([
      getPhotos(1),
      getPhotos(2),
      getPhotos(3),
    ]);
    const [
      photoAlbum1,
      photoAlbum2,
      photoAlbum3,
    ] = await allPhotoAlbums;

    // TRANSFORM
    let transformedPhotoAlbum1 = photoAlbum1.map(photo =>
      transformPhoto(photo),
    );
    let transformedPhotoAlbum2 = photoAlbum2.map(photo =>
      transformPhoto(photo),
    );
    let transformedPhotoAlbum3 = photoAlbum3.map(photo =>
      transformPhoto(photo),
    );

    console.log(
      transformedPhotoAlbum1[0],
      transformedPhotoAlbum2[0],
      transformedPhotoAlbum3[0],
    ); // log first photo object of each transformed photoAlbum

    // TODO - LOAD
  } catch (error) {
    console.error(error);
  }
};

orchestrateEtlPipeline();

The transformation of the photo object is complete, and the output should just contain the five properties albumId, id, name, title and url, the thumbnailUrl property should be removed. Now we have to transform the photoAlbum and add the timeStamp. We also output the timestamp.

const { getPhotos } = require('./extract');
const { addTimeStamp, transformPhoto } = require('./transform');

const orchestrateEtlPipeline = async () => {
  try {
    // EXTRACT
    const allPhotoAlbums = Promise.all([
      getPhotos(1),
      getPhotos(2),
      getPhotos(3),
    ]);
    const [
      photoAlbum1,
      photoAlbum2,
      photoAlbum3,
    ] = await allPhotoAlbums;

    // TRANSFORM
    let transformedPhotoAlbum1 = photoAlbum1.map(photo =>
      transformPhoto(photo),
    );
    let transformedPhotoAlbum2 = photoAlbum2.map(photo =>
      transformPhoto(photo),
    );
    let transformedPhotoAlbum3 = photoAlbum3.map(photo =>
      transformPhoto(photo),
    );

    console.log(
      transformedPhotoAlbum1[0],
      transformedPhotoAlbum2[0],
      transformedPhotoAlbum3[0],
    ); // log first photo object of each transformed photoAlbum

    transformedPhotoAlbum1 = addTimeStamp(transformedPhotoAlbum1);
    transformedPhotoAlbum2 = addTimeStamp(transformedPhotoAlbum2);
    transformedPhotoAlbum3 = addTimeStamp(transformedPhotoAlbum3);

    console.log(
      transformedPhotoAlbum1.timeStamp,
      transformedPhotoAlbum2.timeStamp,
      transformedPhotoAlbum3.timeStamp,
    ); // log timestamp
    console.log(transformedPhotoAlbum1);

    // TODO - LOAD
  } catch (error) {
    console.error(error);
  }
};

orchestrateEtlPipeline();

After the last step is finished, we are ready for the next phase of the ETL pipeline Load, which handles loading the transformed data in its destination.

TL;DR

Second phase in an ETL pipeline is to transform the data.
The first step in the transform phase is to determine what the new data structure should be.
The second step is to transform the data in the desired format.

Thanks for reading and if you have any questions, use the comment function or send me a message @mariokandut.

If you want to know more about Node, have a look at these Node Tutorials.

References (and Big thanks):

HeyNode, OsioLabs, MDN async/await