by Team AHT | Jan 29, 2025 | How To
This Post is Collection of Handy Tricks and Snippets. Passing Parameters in Automation of Scripts using Python Python provides several ways to pass parameters in automation of scripts, mimicking SAS macro variables, macro modules, and macro scripting. Here are some...
by Team AHT | Jan 7, 2025 | Pyspark, Python
#1. create a sample dataframe # create a sample dataframe data = [ (“Sam”,”Sales”, 50000), (“Ram”,”Sales”, 60000), (“Dan”,”Sales”, 70000), (“Gam”,”Marketing”, 40000),...
by Team AHT | Jan 4, 2025 | SQL
What is Indexing? Indexing is a data structure technique that allows the database to quickly locate and access specific data. It’s similar to the index at the back of a book, which helps you find specific pages quickly. How Indexing Works Index Creation: The...
by Team AHT | Dec 28, 2024 | SQL
Spark SQL Operators Cheatsheet 1. Arithmetic Operators OperatorSyntaxDescriptionExample+a + bAdds two valuesSELECT 5 + 3;-a – bSubtracts one value from anotherSELECT 5 – 3;*a * bMultiplies two valuesSELECT 5 * 3;/a / bDivides one value by anotherSELECT 6 /...
by Team AHT | Dec 28, 2024 | How To
Syntax Rules for Pseudocode Natural Language: Use simple and clear natural language to describe steps. Keywords: Use standard control flow keywords such as: IF, ELSE, ENDIF FOR, WHILE, ENDWHILE FUNCTION, CALL INPUT, OUTPUT Indentation: Indent blocks within loops or...
by Team AHT | Dec 8, 2024 | Pyspark
A quick reference for date manipulation in PySpark:– FunctionDescriptionWorks OnExample (Spark SQL)Example (DataFrame API)to_dateConverts string to date.StringTO_DATE(‘2024-01-15’, ‘yyyy-MM-dd’)to_date(col(“date_str”),...
by Team AHT | Dec 5, 2024 | Pyspark
Window functions in PySpark allow you to perform operations on a subset of your data using a “window” that defines a range of rows. These functions are similar to SQL window functions and are useful for tasks like ranking, cumulative sums, and moving...
by Team AHT | Nov 25, 2024 | SQL
For Better understanding on Spark SQL windows Function and Best Usecases do refer our post Window functions in Oracle Pl/Sql and Hive explained and compared with examples. Window functions in Spark SQL are powerful tools that allow you to perform calculations across a...
by Team AHT | Nov 16, 2024 | Pyspark
PySpark Architecture Cheat Sheet 1. Core Components of PySpark ComponentDescriptionKey FeaturesSpark CoreThe foundational Spark component for scheduling, memory management, and fault tolerance.Task scheduling, data partitioning, RDD APIs.Spark SQLEnables interaction...
by Team AHT | Nov 7, 2024 | News
Yup! Scientists find a ‘Unique’ Black Hole that is hungier than ever in the Universe! Scientists have observed a fascinating phenomenon involving a supermassive black hole, AT2022dsb, which appears to be devouring a star in a “tidal disruption event”...
by Team AHT | Nov 7, 2024 | SQL
Here’s an enhanced Spark SQL cheatsheet with additional details, covering join types, union types, and set operations like EXCEPT and INTERSECT, along with options for table management (DDL operations like UPDATE, INSERT, DELETE, etc.). This comprehensive sheet...
by Team AHT | Nov 7, 2024 | SQL
Here’s a categorized Spark SQL function reference, which organizes common Spark SQL functions by functionality. This can help with selecting the right function based on the operation you want to perform. 1. Aggregate Functions FunctionDescriptionExampleavg()Calculates...
by Team AHT | Nov 6, 2024 | SQL
CRUD stands for Create, Read, Update, and Delete. It’s a set of basic operations that are essential for managing data in a database or any persistent storage system. It refers to the four basic functions that any persistent storage application needs to perform....
by Team AHT | Nov 2, 2024 | Tutorials
Comparative overview of partitions, bucketing, segmentation, and broadcasting in PySpark, Spark SQL, and Hive QL in tabular form, along with examples Here’s a comparative overview of partitions, bucketing, segmentation, and broadcasting in PySpark, Spark SQL,...
by Team AHT | Nov 1, 2024 | SQL
Data Definition Language (DDL) – to define and modify the structure of a database. Data Manipulation Language (DML) – to access, manipulate, and modify data in a database. Data Control Language (DCL) – to control user access to the data in the database...
by Team AHT | Oct 27, 2024 | Python
The pandas Series is a one-dimensional array-like data structure that can store data of any type, including integers, floats, strings, or even Python objects. Each element in a Series is associated with a unique index label, making it easy to perform data retrieval...
by Team AHT | Oct 24, 2024 | Python
This tutorial covers a wide range of pandas operations and advanced concepts with examples that are practical and useful in real-world scenarios. The key topics include: Creating DataFrames, Series from various sources. Checking and changing data types. Looping...
by Team AHT | Oct 22, 2024 | Pyspark
I have divided a pyspark big script in many steps –by using steps1=”’ some codes”’ till steps7, i want to execute all these steps one after another and also if needed some steps can be not be executed. if any steps fails then then next...
by Team AHT | Oct 22, 2024 | Pyspark
How to code in Pyspark a Complete ETL job using only Pyspark sql api not dataframe specific API? Here’s an example of a complete ETL (Extract, Transform, Load) job using PySpark SQL API: from pyspark.sql import SparkSession # Create SparkSession spark =...
by Team AHT | Oct 21, 2024 | Pyspark
PySpark supports various control statements to manage the flow of your Spark applications. PySpark supports using Python’s if-else-elif statements, but with limitations. Supported Usage Conditional statements within PySpark scripts. Controlling flow of Spark...
by Team AHT | Oct 20, 2024 | Pyspark
When working with PySpark, there are several common issues that developers face. These issues can arise from different aspects such as memory management, performance bottlenecks, data skewness, configurations, and resource contention. Here’s a guide on troubleshooting...
by Team AHT | Oct 11, 2024 | Pyspark
Q1.–We are working with large datasets in PySpark, such as joining a 30GB table with a 1TB table or Various Transformation on 30 GB Data, we have 100 cores limit to use per user , what can be best configuration and Optimization strategy to use in pyspark ? will...
by Team AHT | Oct 11, 2024 | Pyspark
To determine the optimal number of CPU cores, executors, and executor memory for a PySpark job, several factors need to be considered, including the size and complexity of the job, the resources available in the cluster, and the nature of the data being processed....
by Team AHT | Oct 2, 2024 | SQL
Partitioning in SQL, HiveQL, and Spark SQL is a technique used to divide large tables into smaller, more manageable pieces or partitions. These partitions are based on a column (or multiple columns) and help improve query performance, especially when dealing with...
by Team AHT | Oct 2, 2024 | SAS, SQL
PIVOT Clause in Spark sql or Mysql or Oracle Pl sql or Hive QL The PIVOT clause is a powerful tool in SQL that allows you to rotate rows into columns, making it easier to analyze and report data. Here’s how to use the PIVOT clause in Spark SQL, MySQL, Oracle...
by Team AHT | Sep 6, 2024 | SQL
SQL query flows through the Oracle engine in the following steps: Step 1: Parsing The SQL query is parsed to check syntax and semantics. The parser breaks the query into smaller components, such as keywords, identifiers, and literals. Step 2: Optimization The parsed...
by Team AHT | Aug 29, 2024 | Pyspark
PySpark is a powerful Python API for Apache Spark, a distributed computing framework that enables large-scale data processing. Spark History Spark was initially started by Matei Zaharia at UC Berkeley’s AMPLab in 2009, and open sourced in 2010 under a BSD...
by Team AHT | Aug 26, 2024 | Pyspark
Deploying a PySpark job can be done in various ways depending on your infrastructure, use case, and scheduling needs. Below are the different deployment methods available, including details on how to use them: 1. Running PySpark Jobs via PySpark Shell How it Works:...
by Team AHT | Aug 26, 2024 | Tutorials
Hive a Data warehouse infra Hive is an open-source data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis. It allows users to query and manage large datasets residing in distributed storage using a SQL-like language...
by Team AHT | Aug 24, 2024 | Pyspark
In PySpark, jobs, stages, and tasks are fundamental concepts that define how Spark executes distributed data processing tasks across a cluster. Understanding these concepts will help you optimize your Spark jobs and debug issues more effectively. At First Let us go...