DataScience Hacks

dbt packages

November 20, 2022July 18, 2023 PavanLeave a comment

Previously, we looked at reusable piece of transformation by using dbt macros. Dbt packages are a way to modularize the entire code. One can create a reusable library (which may contain several reusable tests and transformations). One can add the packages into their existing projects. This way one can reuse the transformations and other reusable… Continue reading dbt packages

Prefect flows and tasks

November 6, 2022July 19, 2023 PavanLeave a comment

Tasks are the most basic unit of work in Prefect. A task performs certain function or operation. The way to enable a task in Prefect is by using the @task decorator. A flow is a combination of several tasks where one can specify a certain order of execution. Flow will contain one or more tasks… Continue reading Prefect flows and tasks

Prefect Installation: Part I

October 16, 2022July 19, 2023 PavanLeave a comment

Prefect has two editions — open source and Prefect cloud. To get going with Prefect open source, one can install Prefect from pip installer. It may be advised to create a separate conda environment or virtual environment. To use virtual environment, virtualenv {{name-of-the-virtual-environment}} To activate the virtual environment source {{name-of-the-virtual-environment}}/bin/activate One may choose to install… Continue reading Prefect Installation: Part I

dbt macros

September 18, 2022July 18, 2023 PavanLeave a comment

Just like any other programming/scripting language, dbt macros are reusable code. The idea is to capture the business logic that may occur in multiple instances and write macro for such transformations so that it can be reused in various models. Here is an example: {% macro mean(field, table) select avg({{ field }}) as mean from… Continue reading dbt macros

Prefect

September 18, 2022July 18, 2023 PavanLeave a comment

For a data professional, there are several tasks that need to be completed in addition to writing the data job. Some of them may include sending email notifications, using cron scheduler, write procedures that would re-try a specific operation upon failure, optimize the data flow, etc… Here is a workflow orchestration tool that would take… Continue reading Prefect

dbt jinja

September 4, 2022July 18, 2023 PavanLeave a comment

Dbt uses Jinja as templating language. Jinja enables dynamical generation of SQL code. One can create a SQL template and reuse it for multiple values. So that one can generate SQL dynamically that matches a condition, iterate over a list and use macros (reusable piece of code). Jinja has their own syntax. {{ something }}… Continue reading dbt jinja

dbt environment and deployment

August 14, 2022July 18, 2023 PavanLeave a comment

Dbt jobs can be scheduled to run at specific time intervals using an automated system. Dbt commands that are used to build, run and debug the project can be scheduled to run automatically. deployment is when the jobs in the analyst's workstation moves to production server. One can think of this as development and production… Continue reading dbt environment and deployment

dbt documentation

July 10, 2022July 18, 2023 PavanLeave a comment

dbt enables publishing technical documentation as a website. One can add detailed description about dbt models, relevant tests that may have been included, information about sources, table columns, data types, etc... Once these information are entererd, dbt then generates documentation for the given dbt project. The details are entered in the YAML file of dbt… Continue reading dbt documentation

dbt tests

July 4, 2022July 18, 2023 PavanLeave a comment

Dbt enables unit testing the models. There are two ways of defining tests, singular and generic test. A singular test is a simple SQL query that is stored in a SQL file, stored under the test directory. It returns failed rows. A generic test is a parameterized SQL query that accepts arguments within the SQL… Continue reading dbt tests

dbt seeds

June 5, 2022July 18, 2023 PavanLeave a comment

Dbt seeds are flat files that can be added in a dbt project. These flat files can be added into the seeds folder. The seeds can be version controlled. A common use case is to add crosswalk or validation tables within the seeds folder. For instance, S.No, ShortForm, LongForm 1, NH, New Hampshire 2, OH,… Continue reading dbt seeds