Apache CarbonData Dev Mailing List archive

Re: Query regarding behaviour of sort column

Posted by Pallavi Singh on
URL: http://apache-carbondata-dev-mailing-list-archive.168.s1.nabble.com/Query-regarding-behaviour-of-sort-column-tp12712p12769.html

Hi Community,

While working with the above problem I found two discussions regarding the
sort column

1. https://github.com/apache/carbondata/pull/635 which states : If the
table need be sorted by a measure, we should use dictionary_include to add
it to dimension list.

2. https://github.com/apache/carbondata/pull/757 : if a column of
SORT_COLUMNS is a measure before, now this column will be created as a
dimension. And this dimension is a no-dicitonary column(Better to use other
direct-dictionary).

Now if the columns in my sort_column be measures then I have to add the
same columns in the dictionary_include other wise in case of null value in
case of sort_column column the loading fails after the first null encounter
itself.

for example like this :

CREATE TABLE test_sort_col
| (id INT,
| name STRING,
| age INT
| )STORED BY 'org.apache.carbondata.format'
| TBLPROPERTIES('SORT_COLUMNS'='id,age','DICTIONARY_INCLUDE'='id,age')

and the csv has following data :

1,Pallavi,25
2,Geetika,24
3,Prabhat,twenty six
7,Neha,25
2,Geetika,22
3,Sangeeta,26

and the load gets successful like shown below :

+---+--------+----+
| id| name| age|
+---+--------+----+
| 1| Pallavi| 25|
| 2| Geetika| 22|
| 2| Geetika| 24|
| 3| Prabhat|null|
| 3|Sangeeta| 26|
| 7| Neha| 25|
+---+--------+----+

now if i remove the measures of the sort_column from the dictionary_include
in the query I get an error and partial data gets loaded, snapshot is
provided below

Data load request has been received for table default.test_sort_col
17/05/17 16:46:51 ERROR UnsafeBatchParallelReadMergeSorterImpl:
pool-20-thread-1
java.lang.ClassCastException: java.lang.String cannot be cast to [B
at
org.apache.carbondata.processing.newflow.sort.unsafe.UnsafeCarbonRowPage.addRow(UnsafeCarbonRowPage.java:89)
at
org.apache.carbondata.processing.newflow.sort.unsafe.UnsafeCarbonRowPage.addRow(UnsafeCarbonRowPage.java:74)
at
org.apache.carbondata.processing.newflow.sort.unsafe.UnsafeSortDataRows.addRowBatch(UnsafeSortDataRows.java:170)
at
org.apache.carbondata.processing.newflow.sort.impl.UnsafeBatchParallelReadMergeSorterImpl$SortIteratorThread.call(UnsafeBatchParallelReadMergeSorterImpl.java:150)
at
org.apache.carbondata.processing.newflow.sort.impl.UnsafeBatchParallelReadMergeSorterImpl$SortIteratorThread.call(UnsafeBatchParallelReadMergeSorterImpl.java:117)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
17/05/17 16:46:51 AUDIT UnsafeInmemoryHolder:
[pallavi][pallavi][Thread-83]Processing unsafe inmemory rows page with size
: 2
17/05/17 16:46:52 ERROR DataLoadExecutor: [Executor task launch
worker-0][partitionID:default_test_sort_col_296196d0-a469-4273-a382-c41531c32591]
Data Load is partially success for table test_sort_col
17/05/17 16:46:52 AUDIT CarbonDataRDDFactory$:
[pallavi][pallavi][Thread-1]Data load is partially successful for
default.test_sort_col
+---+-------+---+
| id| name|age|
+---+-------+---+
| 1|Pallavi| 25|
| 2|Geetika| 24|
+---+-------+---+

What is the correct behavior, should we add measures in sort_column to
dictionary_include or should we modify the load flow to handle null values?

On Tue, May 16, 2017 at 11:38 AM, Pallavi Singh <[hidden email]>
wrote:

> Hi Community,
>
> While working with the sort columns, I saw that if the column listed in
> the sort_column happens to have a null value in the data while sorting ,
> then the row corresponding to that null value is eliminated from the result
> set. Is this correct behavior? Ideally null values should be sorted and
> listed on the top in the result set.
>

--
Regards | Pallavi Singh
Software Consultant