[GitHub] carbondata pull request #2864: [CARBONDATA-3041] Optimize load minimum size ...

classic Classic list List threaded Threaded
41 messages Options
123
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata pull request #2864: [CARBONDATA-3041] Optimize load minimum size ...

qiuchenjian-2
Github user ndwangsen commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2864#discussion_r228708277
 
    --- Diff: docs/ddl-of-carbondata.md ---
    @@ -474,7 +475,22 @@ CarbonData DDL statements are documented here,which includes:
          be later viewed in table description for reference.
     
          ```
    -       TBLPROPERTIES('BAD_RECORD_PATH'='/opt/badrecords'')
    +       TBLPROPERTIES('BAD_RECORD_PATH'='/opt/badrecords')
    +     ```
    +    
    +   - ##### Load minimum data size
    +     This property determines whether to enable node minumun input data size allocation strategy
    +     for data loading.It will make sure that the node load the minimum amount of data there by
    +     reducing number of carbondata files. This property is useful if the size of the input data
    +     files are very small, like 1MB to 256MB. And This property can also be specified
    +     in the load option, the property value only int value is supported.
    +
    +     ```
    +       TBLPROPERTIES('LOAD_MIN_SIZE_INMB'='256 MB')
    --- End diff --
   
    Has been modified based on the review


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata pull request #2864: [CARBONDATA-3041] Optimize load minimum size ...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user ndwangsen commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2864#discussion_r228708281
 
    --- Diff: integration/spark2/src/main/scala/org/apache/carbondata/spark/rdd/CarbonDataRDDFactory.scala ---
    @@ -1171,12 +1171,27 @@ object CarbonDataRDDFactory {
           .ensureExecutorsAndGetNodeList(blockList, sqlContext.sparkContext)
         val skewedDataOptimization = CarbonProperties.getInstance()
           .isLoadSkewedDataOptimizationEnabled()
    -    val loadMinSizeOptimization = CarbonProperties.getInstance()
    -      .isLoadMinSizeOptimizationEnabled()
         // get user ddl input the node loads the smallest amount of data
    -    val expectedMinSizePerNode = carbonLoadModel.getLoadMinSize()
    +    val carbonTable = carbonLoadModel.getCarbonDataLoadSchema.getCarbonTable
    +    val loadMinSize = carbonTable.getTableInfo.getFactTable.getTableProperties.asScala
    --- End diff --
   
    Has been modified based on the review


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #2864: [CARBONDATA-3041] Optimize load minimum size strateg...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2864
 
    Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/1089/



---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #2864: [CARBONDATA-3041] Optimize load minimum size strateg...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2864
 
    Build Success with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/1301/



---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #2864: [CARBONDATA-3041] Optimize load minimum size strateg...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2864
 
    Build Success with Spark 2.3.1, Please check CI http://136.243.101.176:8080/job/carbondataprbuilder2.3/9353/



---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata pull request #2864: [CARBONDATA-3041] Optimize load minimum size ...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user xuchuanyin commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2864#discussion_r228777044
 
    --- Diff: integration/spark2/src/main/scala/org/apache/carbondata/spark/rdd/CarbonDataRDDFactory.scala ---
    @@ -1171,21 +1171,25 @@ object CarbonDataRDDFactory {
           .ensureExecutorsAndGetNodeList(blockList, sqlContext.sparkContext)
         val skewedDataOptimization = CarbonProperties.getInstance()
           .isLoadSkewedDataOptimizationEnabled()
    -    val loadMinSizeOptimization = CarbonProperties.getInstance()
    -      .isLoadMinSizeOptimizationEnabled()
         // get user ddl input the node loads the smallest amount of data
    -    val expectedMinSizePerNode = carbonLoadModel.getLoadMinSize()
    -    val blockAssignStrategy = if (skewedDataOptimization) {
    -      CarbonLoaderUtil.BlockAssignmentStrategy.BLOCK_SIZE_FIRST
    -    } else if (loadMinSizeOptimization) {
    +    val carbonTable = carbonLoadModel.getCarbonDataLoadSchema.getCarbonTable
    +    var loadMinSize = carbonLoadModel.getLoadMinSize()
    +    if (loadMinSize == "0" ) {
    --- End diff --
   
    for comparison for string, use String.equals instead of ==


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata pull request #2864: [CARBONDATA-3041] Optimize load minimum size ...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user xuchuanyin commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2864#discussion_r228776762
 
    --- Diff: docs/ddl-of-carbondata.md ---
    @@ -474,7 +475,19 @@ CarbonData DDL statements are documented here,which includes:
          be later viewed in table description for reference.
     
          ```
    -       TBLPROPERTIES('BAD_RECORD_PATH'='/opt/badrecords'')
    +       TBLPROPERTIES('BAD_RECORD_PATH'='/opt/badrecords')
    +     ```
    +    
    +   - ##### Load minimum data size
    --- End diff --
   
    You can check it by this
   
    ![image](https://user-images.githubusercontent.com/10445758/47624828-6357b780-db5b-11e8-8807-0bed1e220961.png)



---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata pull request #2864: [CARBONDATA-3041] Optimize load minimum size ...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user xuchuanyin commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2864#discussion_r228776926
 
    --- Diff: integration/spark-common/src/main/scala/org/apache/carbondata/spark/util/CommonUtil.scala ---
    @@ -833,4 +833,32 @@ object CommonUtil {
           })
         }
       }
    +
    +  /**
    +   * This method will validate single node minimum load data volume of table specified by the user
    +   *
    +   * @param tableProperties table property specified by user
    +   * @param propertyName property name
    +   */
    +  def validateLoadMinSize(tableProperties: Map[String, String], propertyName: String): Unit = {
    +    var size: Integer = 0
    +    if (tableProperties.get(propertyName).isDefined) {
    +      val loadSizeStr: String =
    +        parsePropertyValueStringInMB(tableProperties(propertyName))
    +      try {
    +        size = Integer.parseInt(loadSizeStr)
    +      } catch {
    +        case e: NumberFormatException =>
    +          throw new MalformedCarbonCommandException(s"Invalid $propertyName value found: " +
    +                                                    s"$loadSizeStr, only int value greater " +
    +                                                    s"than 0 is supported.")
    +      }
    +      // if the value is negative, set the value is 0
    +      if(size > 0) {
    +        tableProperties.put(propertyName, loadSizeStr)
    +      } else {
    +        tableProperties.put(propertyName, "0")
    --- End diff --
   
    for the default value '0', please use a static variable instead of magic number.


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata pull request #2864: [CARBONDATA-3041] Optimize load minimum size ...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user xuchuanyin commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2864#discussion_r228776708
 
    --- Diff: docs/ddl-of-carbondata.md ---
    @@ -474,7 +475,19 @@ CarbonData DDL statements are documented here,which includes:
          be later viewed in table description for reference.
     
          ```
    -       TBLPROPERTIES('BAD_RECORD_PATH'='/opt/badrecords'')
    +       TBLPROPERTIES('BAD_RECORD_PATH'='/opt/badrecords')
    +     ```
    +    
    +   - ##### Load minimum data size
    --- End diff --
   
    please note that current TOC is incorrect.
   
    ![image](https://user-images.githubusercontent.com/10445758/47624825-4cb16080-db5b-11e8-8074-d87d7c127ebc.png)



---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata pull request #2864: [CARBONDATA-3041] Optimize load minimum size ...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user ndwangsen commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2864#discussion_r228791841
 
    --- Diff: docs/ddl-of-carbondata.md ---
    @@ -474,7 +475,19 @@ CarbonData DDL statements are documented here,which includes:
          be later viewed in table description for reference.
     
          ```
    -       TBLPROPERTIES('BAD_RECORD_PATH'='/opt/badrecords'')
    +       TBLPROPERTIES('BAD_RECORD_PATH'='/opt/badrecords')
    +     ```
    +    
    +   - ##### Load minimum data size
    --- End diff --
   
    ok, i modify it


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata pull request #2864: [CARBONDATA-3041] Optimize load minimum size ...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user ndwangsen commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2864#discussion_r228791974
 
    --- Diff: integration/spark-common/src/main/scala/org/apache/carbondata/spark/util/CommonUtil.scala ---
    @@ -833,4 +833,32 @@ object CommonUtil {
           })
         }
       }
    +
    +  /**
    +   * This method will validate single node minimum load data volume of table specified by the user
    +   *
    +   * @param tableProperties table property specified by user
    +   * @param propertyName property name
    +   */
    +  def validateLoadMinSize(tableProperties: Map[String, String], propertyName: String): Unit = {
    +    var size: Integer = 0
    +    if (tableProperties.get(propertyName).isDefined) {
    +      val loadSizeStr: String =
    +        parsePropertyValueStringInMB(tableProperties(propertyName))
    +      try {
    +        size = Integer.parseInt(loadSizeStr)
    +      } catch {
    +        case e: NumberFormatException =>
    +          throw new MalformedCarbonCommandException(s"Invalid $propertyName value found: " +
    +                                                    s"$loadSizeStr, only int value greater " +
    +                                                    s"than 0 is supported.")
    +      }
    +      // if the value is negative, set the value is 0
    +      if(size > 0) {
    +        tableProperties.put(propertyName, loadSizeStr)
    +      } else {
    +        tableProperties.put(propertyName, "0")
    --- End diff --
   
    ok, i modify it


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata pull request #2864: [CARBONDATA-3041] Optimize load minimum size ...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user ndwangsen commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/2864#discussion_r228792020
 
    --- Diff: integration/spark2/src/main/scala/org/apache/carbondata/spark/rdd/CarbonDataRDDFactory.scala ---
    @@ -1171,21 +1171,25 @@ object CarbonDataRDDFactory {
           .ensureExecutorsAndGetNodeList(blockList, sqlContext.sparkContext)
         val skewedDataOptimization = CarbonProperties.getInstance()
           .isLoadSkewedDataOptimizationEnabled()
    -    val loadMinSizeOptimization = CarbonProperties.getInstance()
    -      .isLoadMinSizeOptimizationEnabled()
         // get user ddl input the node loads the smallest amount of data
    -    val expectedMinSizePerNode = carbonLoadModel.getLoadMinSize()
    -    val blockAssignStrategy = if (skewedDataOptimization) {
    -      CarbonLoaderUtil.BlockAssignmentStrategy.BLOCK_SIZE_FIRST
    -    } else if (loadMinSizeOptimization) {
    +    val carbonTable = carbonLoadModel.getCarbonDataLoadSchema.getCarbonTable
    +    var loadMinSize = carbonLoadModel.getLoadMinSize()
    +    if (loadMinSize == "0" ) {
    --- End diff --
   
    ok, i modify it


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #2864: [CARBONDATA-3041] Optimize load minimum size strateg...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2864
 
    Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/1099/



---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #2864: [CARBONDATA-3041] Optimize load minimum size strateg...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2864
 
    Build Failed with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/1310/



---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #2864: [CARBONDATA-3041] Optimize load minimum size strateg...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2864
 
    Build Success with Spark 2.3.1, Please check CI http://136.243.101.176:8080/job/carbondataprbuilder2.3/9362/



---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #2864: [CARBONDATA-3041] Optimize load minimum size strateg...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user ndwangsen commented on the issue:

    https://github.com/apache/carbondata/pull/2864
 
    retest this please



---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #2864: [CARBONDATA-3041] Optimize load minimum size strateg...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2864
 
    Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/1103/



---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #2864: [CARBONDATA-3041] Optimize load minimum size strateg...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2864
 
    Build Success with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/1314/



---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #2864: [CARBONDATA-3041] Optimize load minimum size strateg...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user CarbonDataQA commented on the issue:

    https://github.com/apache/carbondata/pull/2864
 
    Build Success with Spark 2.3.1, Please check CI http://136.243.101.176:8080/job/carbondataprbuilder2.3/9366/



---
Reply | Threaded
Open this post in threaded view
|

[GitHub] carbondata issue #2864: [CARBONDATA-3041] Optimize load minimum size strateg...

qiuchenjian-2
In reply to this post by qiuchenjian-2
Github user xuchuanyin commented on the issue:

    https://github.com/apache/carbondata/pull/2864
 
    LGTM


---
123