Data Science Asked by Gerardsson on August 2, 2020
Coming from a DWH-background I am used to putting subqueries almost everywhere in my queries. On a Hadoop project (with Hive version 1.1.0 on Cloudera), I noticed we can forego subqueries in some cases.
It made me wonder if there are similar SQL-dialect specific differences between what is used in Hadoop SQL and what you would use in a DWH-setting. So I would like to extend this question so that people can mention what they noticed as differences between Hadoop and DWH in when structuring their queries. I noticed there was very little reference to this topic for Hadoop.
There are some normal principles that apply to Hadoop like mentioned here: https://streever.atlassian.net/wiki/spaces/HADOOP/pages/3211279/Hive+SQL+Best+Practices
It would be nice to get a few of your best practices for working with Hadoop. E.g. You write your queries as neutral as possible so that it works in both Hive and Impala avoiding using language-specific functions such as left (Impala only)
The example I came across was that a group by
-query worked in two different ways. In classic RDBMS I know that only one of both would work.
Here is an example (columns are displayed in different order then in group by
):
Classic SQL (Oracle, SQL Server, etc)
select t2.b, t2.a
from (select t1.a,t1.b from table1 as t1
group by t1.a, t1.b) t2
Hive SQL
select t1.b, t1.a
from table1 as t1
group by t1.a, t1.b
Notice that everything is combined in 1 single query Hadoop. The classic SQL snippet first does the group by
in a subquery before the data is displayed in the right order.
I would be curious to know if you came across other subtleties like this one.
I know, based on my experience, that it is good practice to respect the following guidelines while coding:
Keep your code clean (hence avoid sub-queries if you can do without)
Consider what impact changes to settings might have. E.g. yarn.nodemanager.resource.memory-mb=24576
is great to use, but what happens if you are not allowed to change the node memory size. In an automated job it is not necessarily good practice. Look at the general Hadoop settings, e.g. file size.
Avoid using functions that are specific to a program. E.g. select left("Hello world", 3)
is useful in Impala but in Hive it has to be rewritten to select left("Hello world", 1,4)
. This can lead to problems when later on down the line it will run in a different program.
The biggest help to me was learning about how MapReduce works on a distributed environment. I don't have exact rules of thumb for you, but I might be able to help.
SELECT T1.*, T2.col1, T2.col2 FROM T1 LEFT OUTER JOIN (SELECT key, col1, col2 FROM T2) T2 ON T1.key = T2.key;
This is all due to how the execution engine works under-the-covers, so learning about that and the differences of the execution engines will really help.
Temp tables are your friend - prior to hadoop, I avoided staging data in temporary tables a lot and wrote more complex SQL to avoid it. But with Hadoop and big data, I've actually found it much faster to create tables and then subsequently joining to them helps in some cases with massive data.
Learn how the table's underlying data is stored. STORED AS TEXTFILE is a lot different than data stored as parquet
Impala vs Hive also has performance implications. I've found to find Hive more stable, but slower.
How the data is stored (skewed or not, bucketed or not, etc) makes a huge different. Keep an eye out for queries where you have 100 reducers and 99 of them finish super fast and 1 of them takes forever.
Like I mentioned, spend time learning how the execution engines do the work in a distributed environment. Once you understand how the execution happens behind the scenes, you can start to pick up on other subtleties like why this:
select count(distinct user) from table;
can be a low slower than this:
select count(*) from (select user from table group by user) q;
Answered by Josh on August 2, 2020
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP