This week’s assignment involves working with the actor_films dataset. Your task is to construct a series of SQL queries and table definitions that will allow us to model the actor_films dataset in a way that facilitates efficient analysis. This involves creating new tables, defining data types, and writing queries to populate these tables with data from the actor_films dataset
The actor_films dataset contains the following fields:
actor: The name of the actor.actorid: A unique identifier for each actor.film: The name of the film.year: The year the film was released.votes: The number of votes the film received.rating: The rating of the film.filmid: A unique identifier for each film.The primary key for this dataset is (actor_id, film_id).
actors table: Create a DDL for an actors table with the following fields:
films: An array of struct with the following fields:
quality_class: This field represents an actor’s performance quality, determined by the average rating of movies of their most recent year. It’s categorized as follows:
star: Average rating > 8.good: Average rating > 7 and ≤ 8.average: Average rating > 6 and ≤ 7.bad: Average rating ≤ 6.is_active: A BOOLEAN field that indicates whether an actor is currently active in the film industry (i.e., making films this year).Cumulative table generation query: Write a query that populates the actors table one year at a time.
actors_history_scd table: Create a DDL for an actors_history_scd table with the following features:
start_date and end_date fields).quality_class and is_active status for each actor in the actors table.Backfill query for actors_history_scd: Write a “backfill” query that can populate the entire actors_history_scd table in a single query.
actors_history_scd: Write an “incremental” query that combines the previous year’s SCD data with new incoming data from the actors table.