[GitHub] carbondata pull request #2604: [CARBONDATA-2815][Doc] Add documentation for ...

classic Classic list List threaded Threaded
15 messages Options
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata pull request #2604: [CARBONDATA-2815][Doc] Add documentation for ...

qiuchenjian-2
GitHub user xuchuanyin opened a pull request:

    https://github.com/apache/carbondata/pull/2604

    [CARBONDATA-2815][Doc] Add documentation for spilling memory and datamap rebuild

    Add documentation for
    1. spilling unsafe memory for data loading
    2. datamap rebuild for index datamap
   
    Be sure to do all of the following checklist to help us incorporate
    your contribution quickly and easily:
   
     - [ ] Any interfaces changed?
     
     - [ ] Any backward compatibility impacted?
     
     - [ ] Document update required?
   
     - [ ] Testing done
            Please provide details on
            - Whether new unit test cases have been added or why no new tests are required?
            - How it is tested? Please attach test report.
            - Is it a performance related change? Please attach the performance test report.
            - Any additional information to help reviewers in testing this change.
           
     - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.
   


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/xuchuanyin/carbondata issue2815_doc_rebuild_spill

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/carbondata/pull/2604.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #2604
   
----

----


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata pull request #2604: [CARBONDATA-2815][Doc] Add documentation for ...

qiuchenjian-2
Github user chetandb commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2604#discussion_r207257539
 
    --- Diff: docs/configuration-parameters.md ---
    @@ -69,7 +69,8 @@ This section provides the details of all the configurations required for CarbonD
     | carbon.options.bad.record.path |  | Specifies the HDFS path where bad records are stored. By default the value is Null. This path must to be configured by the user if bad record logger is enabled or bad record action redirect. | |
     | carbon.enable.vector.reader | true | This parameter increases the performance of select queries as it fetch columnar batch of size 4*1024 rows instead of fetching data row by row. | |
     | carbon.blockletgroup.size.in.mb | 64 MB | The data are read as a group of blocklets which are called blocklet groups. This parameter specifies the size of the blocklet group. Higher value results in better sequential IO access.The minimum value is 16MB, any value lesser than 16MB will reset to the default value (64MB). |  |
    -| carbon.task.distribution | block | **block**: Setting this value will launch one task per block. This setting is suggested in case of concurrent queries and queries having big shuffling scenarios. **custom**: Setting this value will group the blocks and distribute it uniformly to the available resources in the cluster. This enhances the query performance but not suggested in case of concurrent queries and queries having big shuffling scenarios. **blocklet**: Setting this value will launch one task per blocklet. This setting is suggested in case of concurrent queries and queries having big shuffling scenarios. **merge_small_files**: Setting this value will merge all the small partitions to a size of (128 MB is the default value of "spark.sql.files.maxPartitionBytes",it is configurable) during querying. The small partitions are combined to a map task to reduce the number of read task. This enhances the performance. | |
    +| carbon.task.distribution | block | **block**: Setting this value will launch one task per block. This setting is suggested in case of concurrent queries and queries having big shuffling scenarios. **custom**: Setting this value will group the blocks and distribute it uniformly to the available resources in the cluster. This enhances the query performance but not suggested in case of concurrent queries and queries having big shuffling scenarios. **blocklet**: Setting this value will launch one task per blocklet. This setting is suggested in case of concurrent queries and queries having big shuffling scenarios. **merge_small_files**: Setting this value will merge all the small partitions to a size of (128 MB is the default value of "spark.sql.files.maxPartitionBytes",it is configurable) during querying. The small partitions are combined to a map task to reduce the number of read task. This enhances the performance. | |
    +| carbon.load.sortmemory.spill.percentage | 0 | If we use unsafe memory during data loading, this configuration will be used to control the behavior of spilling inmemory pages to disk. Internally in Carbondata, during sorting carbondata will sort data in pages and add them in unsafe memory. If the memory insufficient, carbondata will spill the pages to disk and generate sort temp file. This configuration controls how many pages in memory will be spilled to disk based size. The size can be calculated by multiply this configuration value with 'carbon.sort.storage.inmemory.size.inmb'. For example, default value 0 means that no pages in unsafe memory will be spilled and all the newly sorted data will be spilled to disk; Value 50 means that if the unsafe memory is insufficient, about half of pages in the unsafe memory will be spilled to disk while value 100 means that almost all pages in unsafe memory will be spilled. **Note**: This configuration only works for 'LOCAL_SORT' and 'BATC
 H_SORT' and the actual spilling behavior may slightly be different in each data loading. | Integer values between 0 and 100 |
    --- End diff --
   
    Change "memory insufficient" to "memory is insufficient".
    Change "multiply" to "multiplying"


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata pull request #2604: [CARBONDATA-2815][Doc] Add documentation for ...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user chetandb commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2604#discussion_r207259551
 
    --- Diff: docs/datamap/datamap-management.md ---
    @@ -22,13 +22,13 @@ Currently, there are 5 DataMap implementation in CarbonData.
     | timeseries       | time dimension rollup table.             | event_time, xx_granularity, please refer to [Timeseries DataMap](https://github.com/apache/carbondata/blob/master/docs/datamap/timeseries-datamap-guide.md) | Automatic        |
     | mv               | multi-table pre-aggregate table,         | No DMPROPERTY is required                | Manual           |
     | lucene           | lucene indexing for text column          | index_columns to specifying the index columns | Manual/Automatic |
    -| bloom            | bloom filter for high cardinality column, geospatial column | index_columns to specifying the index columns | Manual/Automatic |
    +| bloomfilter      | bloom filter for high cardinality column, geospatial column | index_columns to specifying the index columns | Manual/Automatic |
     
     ## DataMap Management
     
     There are two kinds of management semantic for DataMap.
     
    -1. Autmatic Refresh: Create datamap without `WITH DEFERED REBUILD` in the statement
    --- End diff --
   
    Change Autmatic to Automatic


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata pull request #2604: [CARBONDATA-2815][Doc] Add documentation for ...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user chetandb commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2604#discussion_r207260295
 
    --- Diff: docs/datamap/datamap-management.md ---
    @@ -51,15 +51,23 @@ If user do want to perform above operations on the main table, user can first dr
     
     If user drop the main table, the datamap will be dropped immediately too.
     
    +We do recommend you to use this management for index datamap.
    +
     ### Manual Refresh
     
     When user creates a datamap specifying maunal refresh semantic, the datamap is created with status *disabled* and query will NOT use this datamap until user can issue REBUILD DATAMAP command to build the datamap. For every REBUILD DATAMAP command, system will trigger a full rebuild of the datamap. After rebuild is done, system will change datamap status to *enabled*, so that it can be used in query rewrite.
     
    -For every new data loading, data update, delete, the related datamap will be made *disabled*.
    +For every new data loading, data update, delete, the related datamap will be made *disabled*,
    +which means that the following queries will not benefit from the datamap before it becomes *enable* again.
    --- End diff --
   
    Change enable to enabled


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata pull request #2604: [CARBONDATA-2815][Doc] Add documentation for ...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user chetandb commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2604#discussion_r207260702
 
    --- Diff: docs/datamap/datamap-management.md ---
    @@ -51,15 +51,23 @@ If user do want to perform above operations on the main table, user can first dr
     
     If user drop the main table, the datamap will be dropped immediately too.
     
    +We do recommend you to use this management for index datamap.
    +
     ### Manual Refresh
     
     When user creates a datamap specifying maunal refresh semantic, the datamap is created with status *disabled* and query will NOT use this datamap until user can issue REBUILD DATAMAP command to build the datamap. For every REBUILD DATAMAP command, system will trigger a full rebuild of the datamap. After rebuild is done, system will change datamap status to *enabled*, so that it can be used in query rewrite.
     
    -For every new data loading, data update, delete, the related datamap will be made *disabled*.
    +For every new data loading, data update, delete, the related datamap will be made *disabled*,
    +which means that the following queries will not benefit from the datamap before it becomes *enable* again.
     
     If the main table is dropped by user, the related datamap will be dropped immediately.
     
    -*Note: If you are creating a datamap on external table, you need to do manual managment of the datamap.*
    +**Note**:
    ++ If you are creating a datamap on external table, you need to do manual managment of the datamap.
    --- End diff --
   
    Change managment  to management


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata pull request #2604: [CARBONDATA-2815][Doc] Add documentation for ...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user xuchuanyin commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2604#discussion_r207265094
 
    --- Diff: docs/configuration-parameters.md ---
    @@ -69,7 +69,8 @@ This section provides the details of all the configurations required for CarbonD
     | carbon.options.bad.record.path |  | Specifies the HDFS path where bad records are stored. By default the value is Null. This path must to be configured by the user if bad record logger is enabled or bad record action redirect. | |
     | carbon.enable.vector.reader | true | This parameter increases the performance of select queries as it fetch columnar batch of size 4*1024 rows instead of fetching data row by row. | |
     | carbon.blockletgroup.size.in.mb | 64 MB | The data are read as a group of blocklets which are called blocklet groups. This parameter specifies the size of the blocklet group. Higher value results in better sequential IO access.The minimum value is 16MB, any value lesser than 16MB will reset to the default value (64MB). |  |
    -| carbon.task.distribution | block | **block**: Setting this value will launch one task per block. This setting is suggested in case of concurrent queries and queries having big shuffling scenarios. **custom**: Setting this value will group the blocks and distribute it uniformly to the available resources in the cluster. This enhances the query performance but not suggested in case of concurrent queries and queries having big shuffling scenarios. **blocklet**: Setting this value will launch one task per blocklet. This setting is suggested in case of concurrent queries and queries having big shuffling scenarios. **merge_small_files**: Setting this value will merge all the small partitions to a size of (128 MB is the default value of "spark.sql.files.maxPartitionBytes",it is configurable) during querying. The small partitions are combined to a map task to reduce the number of read task. This enhances the performance. | |
    +| carbon.task.distribution | block | **block**: Setting this value will launch one task per block. This setting is suggested in case of concurrent queries and queries having big shuffling scenarios. **custom**: Setting this value will group the blocks and distribute it uniformly to the available resources in the cluster. This enhances the query performance but not suggested in case of concurrent queries and queries having big shuffling scenarios. **blocklet**: Setting this value will launch one task per blocklet. This setting is suggested in case of concurrent queries and queries having big shuffling scenarios. **merge_small_files**: Setting this value will merge all the small partitions to a size of (128 MB is the default value of "spark.sql.files.maxPartitionBytes",it is configurable) during querying. The small partitions are combined to a map task to reduce the number of read task. This enhances the performance. | |
    +| carbon.load.sortmemory.spill.percentage | 0 | If we use unsafe memory during data loading, this configuration will be used to control the behavior of spilling inmemory pages to disk. Internally in Carbondata, during sorting carbondata will sort data in pages and add them in unsafe memory. If the memory insufficient, carbondata will spill the pages to disk and generate sort temp file. This configuration controls how many pages in memory will be spilled to disk based size. The size can be calculated by multiply this configuration value with 'carbon.sort.storage.inmemory.size.inmb'. For example, default value 0 means that no pages in unsafe memory will be spilled and all the newly sorted data will be spilled to disk; Value 50 means that if the unsafe memory is insufficient, about half of pages in the unsafe memory will be spilled to disk while value 100 means that almost all pages in unsafe memory will be spilled. **Note**: This configuration only works for 'LOCAL_SORT' and 'BATC
 H_SORT' and the actual spilling behavior may slightly be different in each data loading. | Integer values between 0 and 100 |
    --- End diff --
   
    fixed


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata pull request #2604: [CARBONDATA-2815][Doc] Add documentation for ...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user xuchuanyin commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2604#discussion_r207265338
 
    --- Diff: docs/datamap/datamap-management.md ---
    @@ -22,13 +22,13 @@ Currently, there are 5 DataMap implementation in CarbonData.
     | timeseries       | time dimension rollup table.             | event_time, xx_granularity, please refer to [Timeseries DataMap](https://github.com/apache/carbondata/blob/master/docs/datamap/timeseries-datamap-guide.md) | Automatic        |
     | mv               | multi-table pre-aggregate table,         | No DMPROPERTY is required                | Manual           |
     | lucene           | lucene indexing for text column          | index_columns to specifying the index columns | Manual/Automatic |
    -| bloom            | bloom filter for high cardinality column, geospatial column | index_columns to specifying the index columns | Manual/Automatic |
    +| bloomfilter      | bloom filter for high cardinality column, geospatial column | index_columns to specifying the index columns | Manual/Automatic |
     
     ## DataMap Management
     
     There are two kinds of management semantic for DataMap.
     
    -1. Autmatic Refresh: Create datamap without `WITH DEFERED REBUILD` in the statement
    --- End diff --
   
    fixed


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata pull request #2604: [CARBONDATA-2815][Doc] Add documentation for ...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user xuchuanyin commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2604#discussion_r207265502
 
    --- Diff: docs/datamap/datamap-management.md ---
    @@ -51,15 +51,23 @@ If user do want to perform above operations on the main table, user can first dr
     
     If user drop the main table, the datamap will be dropped immediately too.
     
    +We do recommend you to use this management for index datamap.
    +
     ### Manual Refresh
     
     When user creates a datamap specifying maunal refresh semantic, the datamap is created with status *disabled* and query will NOT use this datamap until user can issue REBUILD DATAMAP command to build the datamap. For every REBUILD DATAMAP command, system will trigger a full rebuild of the datamap. After rebuild is done, system will change datamap status to *enabled*, so that it can be used in query rewrite.
     
    -For every new data loading, data update, delete, the related datamap will be made *disabled*.
    +For every new data loading, data update, delete, the related datamap will be made *disabled*,
    +which means that the following queries will not benefit from the datamap before it becomes *enable* again.
    --- End diff --
   
    fixed


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata pull request #2604: [CARBONDATA-2815][Doc] Add documentation for ...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user xuchuanyin commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2604#discussion_r207265816
 
    --- Diff: docs/datamap/datamap-management.md ---
    @@ -51,15 +51,23 @@ If user do want to perform above operations on the main table, user can first dr
     
     If user drop the main table, the datamap will be dropped immediately too.
     
    +We do recommend you to use this management for index datamap.
    +
     ### Manual Refresh
     
     When user creates a datamap specifying maunal refresh semantic, the datamap is created with status *disabled* and query will NOT use this datamap until user can issue REBUILD DATAMAP command to build the datamap. For every REBUILD DATAMAP command, system will trigger a full rebuild of the datamap. After rebuild is done, system will change datamap status to *enabled*, so that it can be used in query rewrite.
     
    -For every new data loading, data update, delete, the related datamap will be made *disabled*.
    +For every new data loading, data update, delete, the related datamap will be made *disabled*,
    +which means that the following queries will not benefit from the datamap before it becomes *enable* again.
     
     If the main table is dropped by user, the related datamap will be dropped immediately.
     
    -*Note: If you are creating a datamap on external table, you need to do manual managment of the datamap.*
    +**Note**:
    ++ If you are creating a datamap on external table, you need to do manual managment of the datamap.
    --- End diff --
   
    fixed


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #2604: [CARBONDATA-2815][Doc] Add documentation for spillin...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2604
 
    Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/7745/



---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #2604: [CARBONDATA-2815][Doc] Add documentation for spillin...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2604
 
    Build Success with Spark 2.2.1, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/6470/



---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #2604: [CARBONDATA-2815][Doc] Add documentation for spillin...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user ravipesala commented on the issue:

    https://github.com/apache/carbondata/pull/2604
 
    SDV Build Fail , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/6133/



---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #2604: [CARBONDATA-2815][Doc] Add documentation for spillin...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user QiangCai commented on the issue:

    https://github.com/apache/carbondata/pull/2604
 
    LGTM


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #2604: [CARBONDATA-2815][Doc] Add documentation for spillin...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user ravipesala commented on the issue:

    https://github.com/apache/carbondata/pull/2604
 
    SDV Build Fail , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/6135/



---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata pull request #2604: [CARBONDATA-2815][Doc] Add documentation for ...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user asfgit closed the pull request at:

    https://github.com/apache/carbondata/pull/2604


---