Continuing our 4-part series of posts about the AWS Machine Learning Certification, we will now look into two out of the four knowledge areas that are seen in the exam. We will start with Data Engineering, the basis for gathering, processing, and storing data, and then will learn more about Exploratory Data Analysis.
Data Engineering is a knowledge area that regards projecting and managing the data infrastructure of an organization. AWS has many services that will help professionals develop a complete data pipeline that can attend to multiple business’s needs. Here we will quickly summarize the most useful and important tools regarding Data Engineering made available by AWS.
Amazon Kinesis is composed of four different services that are built to do help your organization work with streaming data. All of them are serverless, which means that there is no need to provision and manage the infrastructure underlying any of the Kinesis services.
Firstly, Kinesis Video Streams helps you gather and store video streams from multiple sources and make it available for different consumer applications, such as Amazon Rekognition or Sagemaker’s models. Next is Kinesis Data Streams, a service that is designed to ingest streaming data to AWS and provide it to consumer applications in real-time. Then there is Kinesis Data Firehose, which’s purpose is to prepare and store data in near real-time to storage services, such as S3, Redshift, or Elasticsearch. Lastly, Kinesis Data Analytics lets you query and analyze streaming data, making it possible for you to use real-time data to create alarms or visualize current trends.
Another tool for processing and ingesting your data in AWS is Glue. Glue is a serverless service that makes it easy for you to quickly develop Extract Transform and Load (ETL) jobs in Python or Scala, using an underlying distributed infrastructure. It can catalog the data from multiple sources using its own Glue Data Catalog and then it can load the data into another series of AWS storage services such as S3, RDS, or Redshift.
Database Migration Service (or DMS) is a tool that enables you to do multiple types of operations between databases, such as a single database migration process or continuously replicate the database. Its jobs can all be done with almost no downtime from the source database and it works both for on-premises and cloud-based database replication.
After you created a data pipeline, you will need to be able to sanitize and explore this data to feed it into your ML models. This process is called Exploratory Data Analysis.
Amazon lets you explore data using services such as Athena, a serverless service that uses Glue’s Data Catalog to understand the structured data stored in S3 and then query it with SQL statements. This makes it easier for you to read the data that is loaded in S3 on the fly and it charges only for the data you query. Using partition strategies in your data and storing it in an optimized format, such as Parquet, can make Athena a very cost-effective solution. Besides that, Redshift also has a feature called Redshift Spectrum that works very similarly to Athena, making it possible for you to query S3 stored data on Redshift without the need to store it on Redshift.
For a visual approach to data exploration, Amazon also has a BI tool, very similar to Tableau or Power BI, called Quicksight, that lets you create dashboards and graphs to describe your data. Quicksight can connect to multiple data-sources such as Athena, Redshift, RDS databases, or even on-premises databases.
I hope you enjoyed our post about Data Engineering and Exploratory Data Analysis, two areas that compose approximately 45% of the AWS Machine Learning Certification. For our next post, Luiz will explore the third area of the test: Modeling.
Featured Image by Christopher Gower on Unsplash.