Cloud Architecture – DataScience Hacks

Prefect flows and tasks

November 6, 2022July 19, 2023 PavanLeave a comment

Tasks are the most basic unit of work in Prefect. A task performs certain function or operation. The way to enable a task in Prefect is by using the @task decorator. A flow is a combination of several tasks where one can specify a certain order of execution. Flow will contain one or more tasks… Continue reading Prefect flows and tasks

Prefect Installation: Part I

October 16, 2022July 19, 2023 PavanLeave a comment

Prefect has two editions — open source and Prefect cloud. To get going with Prefect open source, one can install Prefect from pip installer. It may be advised to create a separate conda environment or virtual environment. To use virtual environment, virtualenv {{name-of-the-virtual-environment}} To activate the virtual environment source {{name-of-the-virtual-environment}}/bin/activate One may choose to install… Continue reading Prefect Installation: Part I

dbt documentation

July 10, 2022July 18, 2023 PavanLeave a comment

dbt enables publishing technical documentation as a website. One can add detailed description about dbt models, relevant tests that may have been included, information about sources, table columns, data types, etc... Once these information are entererd, dbt then generates documentation for the given dbt project. The details are entered in the YAML file of dbt… Continue reading dbt documentation

dbt tests

July 4, 2022July 18, 2023 PavanLeave a comment

Dbt enables unit testing the models. There are two ways of defining tests, singular and generic test. A singular test is a simple SQL query that is stored in a SQL file, stored under the test directory. It returns failed rows. A generic test is a parameterized SQL query that accepts arguments within the SQL… Continue reading dbt tests

dbt seeds

June 5, 2022July 18, 2023 PavanLeave a comment

Dbt seeds are flat files that can be added in a dbt project. These flat files can be added into the seeds folder. The seeds can be version controlled. A common use case is to add crosswalk or validation tables within the seeds folder. For instance, S.No, ShortForm, LongForm 1, NH, New Hampshire 2, OH,… Continue reading dbt seeds

sourcing in data build tool

May 9, 2022July 18, 2023 PavanLeave a comment

Dbt sources is where data sources are mentioned. Sources are declared in a YAML file. One can have many sources. One can also write tests to the data being sourced. The YAML file enables one to provide detailed information about each source such that they appear on the documentation section. For instance, the source YAML… Continue reading sourcing in data build tool

dbt models: Part II

March 11, 2022July 18, 2023 PavanLeave a comment

We looked at SQL models. Similarly, python model is basically a python script that loads the data into a data frame, performs various transformation using specialist packages like pandas or pyspark and produces a data frame. Example of python model: import pandas as pd def function1(param1, param2): df = … … … resultant_df = …… Continue reading dbt models: Part II

dbt models: Part I

March 7, 2022July 18, 2023 PavanLeave a comment

Dbt models are such that they constitute a given SQL query. There are two types of dbt models. They are SQL models and python models. Dbt models is not a reference to relational or dimensional data models. Dbt models are code. The SQL model is basically a SQL file with exactly one SQL statement. It… Continue reading dbt models: Part I

Docker Networking

March 6, 2022July 18, 2023 PavanLeave a comment

By default, uses a type of network called bridge network. Within a given Docker Host, all the containers can easily communicate with each other. Their usual IP addresses start off with 172.17.xxx.xxx Port mapping is essential in case of a web application listening in on a port. This is because bridge network provides an internal… Continue reading Docker Networking

data build tool materialization: Part II

January 31, 2022July 18, 2023 PavanLeave a comment

We looked at tables and view. The other ones are incremental and ephemeral. Incremental loads implement a very elegant solution. during the first run, the table is populated in the data store. any subsequent runs from that point on, only the new rows will be inserted, existing rows that needs any changes will be changed… Continue reading data build tool materialization: Part II