Stack Overflow Asked by anu john on November 10, 2021
I have an Oracle query which is fetching 25 million records, there is no pk or no columns which is distributed properly to make as a split by column. So I have thought of making a sequence number using ROW_number() over () as RANGEGROUP
. But when I use this pseudo column its giving me an error saying
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:164)
Caused by: java.sql.SQLSyntaxErrorException: ORA-00904: "P"."RANGEGROUP": invalid identifier at oracle.jdbc.driver.SQLStateMapping.newSQLException(SQLStateMapping.java:91).
I am properly giving the alias, even I tried with out alias to the pseudo column, its still giving the same error.
Can we use derived columns in Sqoop split by, or the column should be physically present in table?
Use subquery to wrap row_number calculation, then use derived column in the split-by.
--query "select col1, ... colN, RANGEGROUP
from (select t.*, row_number() OVER (order by t.item_id ) AS RANGEGROUP
from table t ) s
where 1=1 and $CONDITIONS"
row_number
should be deterministic, it means when executed multiple times, it should assign exactly the same number to all rows. What can happen if ORDER BY in the OVER contains not unique column or combination: row_number
can return different numbers for the same rows. And if you are using it in the split-by, you will get duplication because the same row can be in split range 1, say 1-100, in mapper2 sqoop will execute same query with filter for range 2, say (101-200) the same row can appear also in that range. Sqoop runs the same query in different containers(mappers) with different condition to get split ranges in parallel.
If Id is int (and much better if it is evenly distributed), use that ID. Why you may need row_number is when it is STRING column. read this: https://stackoverflow.com/a/37389134/2700344, split-column is not necessarily a PK
Answered by leftjoin on November 10, 2021
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP