[jira] [Updated] (CARBONDATA-3593) total_blocklets in query statistic always the same with valid_blocklets

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[jira] [Updated] (CARBONDATA-3593) total_blocklets in query statistic always the same with valid_blocklets

Akash R Nilugal (Jira)

     [ https://issues.apache.org/jira/browse/CARBONDATA-3593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hong Shen updated CARBONDATA-3593:
----------------------------------
    Description:
When I run sql on carbondata table with "enable.query.statistics=true", total_blocklets in query statistic always the same with valid_blocklets.  Below is an example.

Table test_table_hdfs_sort_city and test_table_hdfs_no_sort has the same data, the only different is test_table_hdfs_sort_city has SORT_COLUMN='city_name',  while  test_table_hdfs_no_sort with no sort column.

{code}
carbon.sql("select * from test_table_hdfs_sort_city where  city_name='city1' ")
{code}

|scan_blocks_num|total_blocklets|valid_blocklets|total_pages|scanned_pages|valid_pages|
|                            1|                       1|                      1 |             193|                        4|                  4|

{code}
carbon.sql("select * from test_table_hdfs_no_sort where  city_name='city1' ")
{code}

|scan_blocks_num|total_blocklets|valid_blocklets|total_pages|scanned_pages|valid_pages|
|                            1|                      3|                      3 |             193|                    193|            193|

After read the code,  I found both TOTAL_BLOCKLET_NUM and VALID_SCAN_BLOCKLET_NUM will plus 1 in BlockletFilterScanner.executeFilter(), BlockletFilterScanner.executeFilterForPages, BlockletFullScanner.scanBlocklet.  

I think total_blocklets should be the total blocklet, valid_blocklets should be the filtered blocklet. If it need to be modified. I will provide a patch, since I have modified it locally.

  was:
When I run sql on carbondata table with "enable.query.statistics=true", total_blocklets in query statistic always the same with valid_blocklets.  

{code}
Table test_table_hdfs_sort_city and test_table_hdfs_no_sort has the same data, the only different is test_table_hdfs_sort_city has SORT_COLUMN='city_name',  while  test_table_hdfs_no_sort with no sort column.

carbon.sql("select * from test_table_hdfs_sort_city where  city_name='city1' ")

|scan_blocks_num|total_blocklets|valid_blocklets|total_pages|scanned_pages|valid_pages|
|                            1|                       1|                      1 |             193|                        4|                  4|

carbon.sql("select * from test_table_hdfs_no_sort where  city_name='city1' ")
|scan_blocks_num|total_blocklets|valid_blocklets|total_pages|scanned_pages|valid_pages|
|                            1|                      3|                      3 |             193|                    193|            193|
{code}

After read the code,  I found both TOTAL_BLOCKLET_NUM and VALID_SCAN_BLOCKLET_NUM will plus 1 in BlockletFilterScanner.executeFilter(), BlockletFilterScanner.executeFilterForPages, BlockletFullScanner.scanBlocklet.  

I think total_blocklets should be the total blocklet, valid_blocklets should be the filtered blocklet. If it need to be modified. I will provide a patch, since I have modified it locally.


> total_blocklets in query statistic always the same with valid_blocklets
> -----------------------------------------------------------------------
>
>                 Key: CARBONDATA-3593
>                 URL: https://issues.apache.org/jira/browse/CARBONDATA-3593
>             Project: CarbonData
>          Issue Type: Improvement
>          Components: core
>            Reporter: Hong Shen
>            Priority: Major
>
> When I run sql on carbondata table with "enable.query.statistics=true", total_blocklets in query statistic always the same with valid_blocklets.  Below is an example.
> Table test_table_hdfs_sort_city and test_table_hdfs_no_sort has the same data, the only different is test_table_hdfs_sort_city has SORT_COLUMN='city_name',  while  test_table_hdfs_no_sort with no sort column.
> {code}
> carbon.sql("select * from test_table_hdfs_sort_city where  city_name='city1' ")
> {code}
> |scan_blocks_num|total_blocklets|valid_blocklets|total_pages|scanned_pages|valid_pages|
> |                            1|                       1|                      1 |             193|                        4|                  4|
> {code}
> carbon.sql("select * from test_table_hdfs_no_sort where  city_name='city1' ")
> {code}
> |scan_blocks_num|total_blocklets|valid_blocklets|total_pages|scanned_pages|valid_pages|
> |                            1|                      3|                      3 |             193|                    193|            193|
> After read the code,  I found both TOTAL_BLOCKLET_NUM and VALID_SCAN_BLOCKLET_NUM will plus 1 in BlockletFilterScanner.executeFilter(), BlockletFilterScanner.executeFilterForPages, BlockletFullScanner.scanBlocklet.  
> I think total_blocklets should be the total blocklet, valid_blocklets should be the filtered blocklet. If it need to be modified. I will provide a patch, since I have modified it locally.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)