Apache CarbonData Dev Mailing List archive

Query regarding behaviour of sort column

Classic

List

Threaded

2 messages Options

Pallavi Singh

Query regarding behaviour of sort column

This post was updated on .

Hi Community,

While working with the sort columns, I saw that if the column listed in the
sort_column happens to have a null value in the data while sorting , then
the row corresponding to that null value is eliminated from the result set.
Is this correct behavior? Ideally null values should be sorted and listed
on the top in the result set.

--
Regards | Pallavi Singh
Software Consultant

Pallavi Singh

Re: Query regarding behaviour of sort column

Hi Community,

While working with the above problem I found two discussions regarding the
sort column

1. https://github.com/apache/carbondata/pull/635 which states : If the
table need be sorted by a measure, we should use dictionary_include to add
it to dimension list.

2. https://github.com/apache/carbondata/pull/757 : if a column of
SORT_COLUMNS is a measure before, now this column will be created as a
dimension. And this dimension is a no-dicitonary column(Better to use other
direct-dictionary).

Now if the columns in my sort_column be measures then I have to add the
same columns in the dictionary_include other wise in case of null value in
case of sort_column column the loading fails after the first null encounter
itself.

for example like this :

CREATE TABLE test_sort_col
| (id INT,
| name STRING,
| age INT
| )STORED BY 'org.apache.carbondata.format'
| TBLPROPERTIES('SORT_COLUMNS'='id,age','DICTIONARY_INCLUDE'='id,age')

and the csv has following data :

1,Pallavi,25
2,Geetika,24
3,Prabhat,twenty six
7,Neha,25
2,Geetika,22
3,Sangeeta,26

and the load gets successful like shown below :

+---+--------+----+
| id| name| age|
+---+--------+----+
| 1| Pallavi| 25|
| 2| Geetika| 22|
| 2| Geetika| 24|
| 3| Prabhat|null|
| 3|Sangeeta| 26|
| 7| Neha| 25|
+---+--------+----+

now if i remove the measures of the sort_column from the dictionary_include
in the query I get an error and partial data gets loaded, snapshot is
provided below

Data load request has been received for table default.test_sort_col
17/05/17 16:46:51 ERROR UnsafeBatchParallelReadMergeSorterImpl:
pool-20-thread-1
java.lang.ClassCastException: java.lang.String cannot be cast to [B
at
org.apache.carbondata.processing.newflow.sort.unsafe.UnsafeCarbonRowPage.addRow(UnsafeCarbonRowPage.java:89)
at
org.apache.carbondata.processing.newflow.sort.unsafe.UnsafeCarbonRowPage.addRow(UnsafeCarbonRowPage.java:74)
at
org.apache.carbondata.processing.newflow.sort.unsafe.UnsafeSortDataRows.addRowBatch(UnsafeSortDataRows.java:170)
at
org.apache.carbondata.processing.newflow.sort.impl.UnsafeBatchParallelReadMergeSorterImpl$SortIteratorThread.call(UnsafeBatchParallelReadMergeSorterImpl.java:150)
at
org.apache.carbondata.processing.newflow.sort.impl.UnsafeBatchParallelReadMergeSorterImpl$SortIteratorThread.call(UnsafeBatchParallelReadMergeSorterImpl.java:117)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
17/05/17 16:46:51 AUDIT UnsafeInmemoryHolder:
[pallavi][pallavi][Thread-83]Processing unsafe inmemory rows page with size
: 2
17/05/17 16:46:52 ERROR DataLoadExecutor: [Executor task launch
worker-0][partitionID:default_test_sort_col_296196d0-a469-4273-a382-c41531c32591]
Data Load is partially success for table test_sort_col
17/05/17 16:46:52 AUDIT CarbonDataRDDFactory$:
[pallavi][pallavi][Thread-1]Data load is partially successful for
default.test_sort_col
+---+-------+---+
| id| name|age|
+---+-------+---+
| 1|Pallavi| 25|
| 2|Geetika| 24|
+---+-------+---+

What is the correct behavior, should we add measures in sort_column to
dictionary_include or should we modify the load flow to handle null values?

On Tue, May 16, 2017 at 11:38 AM, Pallavi Singh <[hidden email]>
wrote:

> Hi Community,
>
> While working with the sort columns, I saw that if the column listed in
> the sort_column happens to have a null value in the data while sorting ,
> then the row corresponding to that null value is eliminated from the result
> set. Is this correct behavior? Ideally null values should be sorted and
> listed on the top in the result set.
>

--
Regards | Pallavi Singh
Software Consultant