Quantcast
Channel: Mike – Mike Decuir
Viewing all articles
Browse latest Browse all 9

A Join in Many Languages

$
0
0

For this month’s T-SQL Tuesday, hosted by Sebastian Meine, we are asked to talk about joins.

Joins in T-SQL are really easy and efficient compared to most other languages.  My goal here is to take a simple example and show how such a join could be represented in SQL and several other languages to help appreciate how easy it is to get a highly optimized join query using SQL.

Example case:

We will be building a query to return the name of each product and the name of its associated Subcategory from the Product and ProductSubcategory tables in the AdventureWorks sample database.

For the purposes of this example, we will treat these tables either as tables or as csv files (with no header)as appropriate for the language.

SQL

1
2
3
4
SELECT P.Name AS ProductName, PSC.Name AS SubcategoryName
FROM Product AS P
LEFT JOIN ProductSubcategory AS PS
ON P.ProductSubcategoryID = PS.ProductSubcategoryID

Python

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
import csv
product = []
with open("Product.csv") as fh:
	reader = csv.DictReader(fh,delimiter="\t")
	for line in reader:
		product.append(line)
productsubcategory = []
with open("ProductSubcategory.csv") as fh:
	reader = csv.DictReader(fh,delimiter="\t")
	for line in reader:
		productsubcategory.append(line)
results = []
for prod in product:
	prodname = prod.get("Name")
	subcategoryid = prod.get("ProductSubcategoryID")
	subcatname = None
	if subcategoryID != '':
		for subcat in productsubcategory:
			if subcategoryid == subcat.get("ProductSubCategory"):
				subcatname = subcat.get("Name")
				break
	results.append({"ProductName":prodname,"SubcategoryName":subcatname})
print results

R

1
2
3
4
5
6
product<-read.csv('Product.csv',sep="\t",header=TRUE)
productsubcategory<-read.csv('ProductSubcategory.csv',sep="\t",header=TRUE)
results.full<-merge(product,productsubcategory,by.x="ProductSubcategoryID",by.y="ProductSubcategoryID",all.x=TRUE)
results.full.subset<-subset(results.full,select=c("Name.x","Name.y"))
names(results.full.subset)<-c("ProductName","SubcategoryName")
print(results.full.subset)

Powershell

1
2
3
4
5
6
7
8
9
10
11
12
$product = import-csv "Product.csv" -Delimiter "`t"
$productsubcategory = import-csv "ProductSubcategory.csv" -Delimiter "`t"
$product | foreach-object {
    $prod = $_
    if ($prod.ProductSubcategoryID -ne '') {
        $productsubcategory | 
            where-object {$_.ProductSubcategoryID -eq $prod.ProductSubcategoryID} |
            select-object @{Name='ProductName';Expression={$prod.name}},@{Name='SubcategoryName';Expression={$_.name}}
    } else {
        $prod | select-object @{Name='ProductName';Expression={$_.name}},@{Name='SubcategoryName';Expression={$null}}
    }
}

Pig

1
2
3
4
5
6
7
8
9
10
--For this example, we will treat the file as if the headers are removed and the newline character is converted to \n rather than \r\n
product = LOAD 'Product.csv' USING PigStorage('\t');
subcategory = LOAD 'ProductSubCategory.csv' USING PigStorage('\t');
product_project = FOREACH product GENERATE $1 AS ProductName, $18 AS ProductSubcategoryID;
subcategory_project = FOREACH subcategory GENERATE $0 AS ProductSubcategoryID,$2 AS SubcategoryName;
joined = JOIN product_project BY ProductSubcategoryID LEFT, subcategory_project BY ProductSubcategoryID;
joined_project = FOREACH joined GENERATE ProductName,SubcategoryName;
--Executing the below line will explain the logical execution plan to get to the final results
--EXPLAIN joined_project;
DUMP joined_project;

Conclusion:

Of the demonstrated languages, SQL is the only language that is fully optimized to handle relational algebra, such as joins.  In this space, SQL offers three major advantages over the other languages:

  • SQL requires the least code
  • SQL can take advantage of indexes and other metadata
  • SQL does not require you to care about the execution plan

Of the remaining languages, both R and Pig offer capabilities that abstract away parts of the execution plan, but not the entire plan.  Python and Powershell require you to manually define the entire execution plan.  Fortunately, with the exception of Pig, all of these languages have libraries available that allow you to interact with a database using SQL.


Viewing all articles
Browse latest Browse all 9

Latest Images

Trending Articles





Latest Images