This post was updated on .
Hi Community,
While working with the sort columns, I saw that if the column listed in the sort_column happens to have a null value in the data while sorting , then the row corresponding to that null value is eliminated from the result set. Is this correct behavior? Ideally null values should be sorted and listed on the top in the result set. -- Regards | Pallavi Singh Software Consultant |
Hi Community,
While working with the above problem I found two discussions regarding the sort column 1. https://github.com/apache/carbondata/pull/635 which states : If the table need be sorted by a measure, we should use dictionary_include to add it to dimension list. 2. https://github.com/apache/carbondata/pull/757 : if a column of SORT_COLUMNS is a measure before, now this column will be created as a dimension. And this dimension is a no-dicitonary column(Better to use other direct-dictionary). Now if the columns in my sort_column be measures then I have to add the same columns in the dictionary_include other wise in case of null value in case of sort_column column the loading fails after the first null encounter itself. for example like this : CREATE TABLE test_sort_col | (id INT, | name STRING, | age INT | )STORED BY 'org.apache.carbondata.format' | TBLPROPERTIES('SORT_COLUMNS'='id,age','DICTIONARY_INCLUDE'='id,age') and the csv has following data : 1,Pallavi,25 2,Geetika,24 3,Prabhat,twenty six 7,Neha,25 2,Geetika,22 3,Sangeeta,26 and the load gets successful like shown below : +---+--------+----+ | id| name| age| +---+--------+----+ | 1| Pallavi| 25| | 2| Geetika| 22| | 2| Geetika| 24| | 3| Prabhat|null| | 3|Sangeeta| 26| | 7| Neha| 25| +---+--------+----+ now if i remove the measures of the sort_column from the dictionary_include in the query I get an error and partial data gets loaded, snapshot is provided below Data load request has been received for table default.test_sort_col 17/05/17 16:46:51 ERROR UnsafeBatchParallelReadMergeSorterImpl: pool-20-thread-1 java.lang.ClassCastException: java.lang.String cannot be cast to [B at org.apache.carbondata.processing.newflow.sort.unsafe.UnsafeCarbonRowPage.addRow(UnsafeCarbonRowPage.java:89) at org.apache.carbondata.processing.newflow.sort.unsafe.UnsafeCarbonRowPage.addRow(UnsafeCarbonRowPage.java:74) at org.apache.carbondata.processing.newflow.sort.unsafe.UnsafeSortDataRows.addRowBatch(UnsafeSortDataRows.java:170) at org.apache.carbondata.processing.newflow.sort.impl.UnsafeBatchParallelReadMergeSorterImpl$SortIteratorThread.call(UnsafeBatchParallelReadMergeSorterImpl.java:150) at org.apache.carbondata.processing.newflow.sort.impl.UnsafeBatchParallelReadMergeSorterImpl$SortIteratorThread.call(UnsafeBatchParallelReadMergeSorterImpl.java:117) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) 17/05/17 16:46:51 AUDIT UnsafeInmemoryHolder: [pallavi][pallavi][Thread-83]Processing unsafe inmemory rows page with size : 2 17/05/17 16:46:52 ERROR DataLoadExecutor: [Executor task launch worker-0][partitionID:default_test_sort_col_296196d0-a469-4273-a382-c41531c32591] Data Load is partially success for table test_sort_col 17/05/17 16:46:52 AUDIT CarbonDataRDDFactory$: [pallavi][pallavi][Thread-1]Data load is partially successful for default.test_sort_col +---+-------+---+ | id| name|age| +---+-------+---+ | 1|Pallavi| 25| | 2|Geetika| 24| +---+-------+---+ What is the correct behavior, should we add measures in sort_column to dictionary_include or should we modify the load flow to handle null values? On Tue, May 16, 2017 at 11:38 AM, Pallavi Singh <[hidden email]> wrote: > Hi Community, > > While working with the sort columns, I saw that if the column listed in > the sort_column happens to have a null value in the data while sorting , > then the row corresponding to that null value is eliminated from the result > set. Is this correct behavior? Ideally null values should be sorted and > listed on the top in the result set. > -- Regards | Pallavi Singh Software Consultant |
Free forum by Nabble | Edit this page |