Mysql or Pyspark SQL query- The placement of subqueries

Let’s list all possible places where subqueries in MySQL or Hive QL or Pyspark SQL Query can be used: 1. In the SELECT Clause Subqueries can compute a value for each row. SELECT employee_id, (SELECT COUNT(*) FROM project_assignments pa WHERE pa.employee_id =...

Lesson 3: Data Preprocessing

Data preprocessing is a crucial step in machine learning. It involves cleaning and transforming raw data into a format suitable for modeling. Data Cleaning Data cleaning involves identifying and correcting errors, inconsistencies, and inaccuracies in the data such as...

Lesson 2: Python for Machine Learning

In this lesson, we’ll cover essential Python libraries for machine learning: NumPy, Pandas, Matplotlib, and Scikit-Learn. NumPy NumPy is a library for numerical computations in Python. It provides support for arrays, matrices, and many mathematical functions....

Lesson 1: Introduction to AI and ML

What is AI? Artificial Intelligence (AI) is the simulation of human intelligence in machines that are programmed to think and learn like humans. AI systems can perform tasks such as visual perception, speech recognition, decision-making, and language translation. What...

I am Learning AI & ML

My Posts in this series will follow below said topics. Introduction to AI and ML What is AI? What is Machine Learning? Types of Machine Learning Supervised Learning Unsupervised Learning Reinforcement Learning Key Terminologies Python for Machine Learning Introduction...

Data Structures in Python: Linked Lists

Linked lists are a fundamental linear data structure where elements (nodes) are not stored contiguously in memory. Each node contains data and a reference (pointer) to the next node in the list, forming a chain-like structure. This dynamic allocation offers advantages...

String Manipulation on PySpark DataFrames

String manipulation is a common task in data processing. PySpark provides a variety of built-in functions for manipulating string columns in DataFrames. Below, we explore some of the most useful string manipulation functions and demonstrate how to use them with...

Python Project Alert:- Dynamic list of variables Creation

Let us go through the Project requirement:- 1.Let us create One or Multiple dynamic lists of variables and save it in dictionary or Array or other datastructre for further repeating use in python. Variable names are in form of dynamic names for example Month_202401 to...

Spark SQL Join Types- Syntax examples, Comparision

Spark SQL supports several types of joins, each suited to different use cases. Below is a detailed explanation of each join type, including syntax examples and comparisons. Types of Joins in Spark SQL Inner Join Left (Outer) Join Right (Outer) Join Full (Outer) Join...

Temporary Functions in PL/Sql Vs Spark Sql

Temporary functions allow users to define functions that are session-specific and used to encapsulate reusable logic within a database session. While both PL/SQL and Spark SQL support the concept of user-defined functions, their implementation and usage differ...

Are Dataframes in PySpark Lazy evaluated?

Yes, DataFrames in PySpark are lazily evaluated, similar to RDDs. Lazy evaluation is a key feature of Spark’s processing model, which helps optimize the execution of transformations and actions on large datasets. What is Lazy Evaluation? Lazy evaluation means...
BDL Ecosystem-HDFS and Hive Tables

BDL Ecosystem-HDFS and Hive Tables

Big Data Lake: Data Storage HDFS is a scalable storage solution designed to handle massive datasets across clusters of machines. Hive tables provide a structured approach for querying and analyzing data stored in HDFS. Understanding how these components work together...