This Big Data course introduces the fundamental concepts and architectures underlying the creation of distributed Big Data environments and the tools used to manage large-scale data.
It covers the Hadoop ecosystem, including HDFS and the MapReduce programming model, allowing students to simulate the behavior of Hadoop for distributed storage and parallel data processing.
The course also explores Apache Spark for efficient in-memory computation, with a focus on micro-batch processing using Spark Streaming.
Additionally, students apply machine learning techniques, particularly Decision Tree models, using Spark ML to analyze and process large datasets in distributed environments.