[GitHub] [carbondata] jackylk opened a new pull request #3606: [WIP] add compressor to file name and change default compressor to zstd

classic Classic list List threaded Threaded
66 messages Options
1234
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] CarbonDataQA1 commented on issue #3606: [CARBONDATA-3681] Change default compressor to zstd

GitBox
CarbonDataQA1 commented on issue #3606: [CARBONDATA-3681] Change default compressor to zstd
URL: https://github.com/apache/carbondata/pull/3606#issuecomment-583844201
 
 
   Build Failed  with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/1895/
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


With regards,
Apache Git Services
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] CarbonDataQA1 commented on issue #3606: [CARBONDATA-3681] Change default compressor to zstd

GitBox
In reply to this post by GitBox
CarbonDataQA1 commented on issue #3606: [CARBONDATA-3681] Change default compressor to zstd
URL: https://github.com/apache/carbondata/pull/3606#issuecomment-583860628
 
 
   Build Success with Spark 2.4.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.4/200/
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


With regards,
Apache Git Services
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] CarbonDataQA1 commented on issue #3606: [CARBONDATA-3681] Change default compressor to zstd

GitBox
In reply to this post by GitBox
CarbonDataQA1 commented on issue #3606: [CARBONDATA-3681] Change default compressor to zstd
URL: https://github.com/apache/carbondata/pull/3606#issuecomment-583867782
 
 
   Build Failed  with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/1902/
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


With regards,
Apache Git Services
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] CarbonDataQA1 commented on issue #3606: [CARBONDATA-3681] Change default compressor to zstd

GitBox
In reply to this post by GitBox
CarbonDataQA1 commented on issue #3606: [CARBONDATA-3681] Change default compressor to zstd
URL: https://github.com/apache/carbondata/pull/3606#issuecomment-583925032
 
 
   Build Success with Spark 2.4.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.4/203/
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


With regards,
Apache Git Services
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] QiangCai commented on issue #3606: [CARBONDATA-3681] Change default compressor to zstd

GitBox
In reply to this post by GitBox
QiangCai commented on issue #3606: [CARBONDATA-3681] Change default compressor to zstd
URL: https://github.com/apache/carbondata/pull/3606#issuecomment-583934608
 
 
   better to do further testing on a cluster, not only a local machine.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


With regards,
Apache Git Services
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] CarbonDataQA1 commented on issue #3606: [CARBONDATA-3681] Change default compressor to zstd

GitBox
In reply to this post by GitBox
CarbonDataQA1 commented on issue #3606: [CARBONDATA-3681] Change default compressor to zstd
URL: https://github.com/apache/carbondata/pull/3606#issuecomment-583935130
 
 
   Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/1905/
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


With regards,
Apache Git Services
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] jackylk commented on issue #3606: [CARBONDATA-3681] Change default compressor to zstd

GitBox
In reply to this post by GitBox
jackylk commented on issue #3606: [CARBONDATA-3681] Change default compressor to zstd
URL: https://github.com/apache/carbondata/pull/3606#issuecomment-583959274
 
 
   @QiangCai ok, I will verify on cluster

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


With regards,
Apache Git Services
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] jackylk edited a comment on issue #3606: [CARBONDATA-3681] Change default compressor to zstd

GitBox
In reply to this post by GitBox
jackylk edited a comment on issue #3606: [CARBONDATA-3681] Change default compressor to zstd
URL: https://github.com/apache/carbondata/pull/3606#issuecomment-583959274
 
 
   @QiangCai ok, I will verify on cluster, and with old table

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


With regards,
Apache Git Services
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] jackylk edited a comment on issue #3606: [CARBONDATA-3681] Change default compressor to zstd

GitBox
In reply to this post by GitBox
jackylk edited a comment on issue #3606: [CARBONDATA-3681] Change default compressor to zstd
URL: https://github.com/apache/carbondata/pull/3606#issuecomment-583959274
 
 
   @QiangCai ok, I will verify on cluster, and with old table created before this PR

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


With regards,
Apache Git Services
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] ajantha-bhat commented on a change in pull request #3606: [CARBONDATA-3681] Change default compressor to zstd

GitBox
In reply to this post by GitBox
ajantha-bhat commented on a change in pull request #3606: [CARBONDATA-3681] Change default compressor to zstd
URL: https://github.com/apache/carbondata/pull/3606#discussion_r376948840
 
 

 ##########
 File path: core/src/main/java/org/apache/carbondata/core/util/path/CarbonTablePath.java
 ##########
 @@ -285,17 +286,38 @@ public static String getSegmentPath(String tablePath, String segmentId) {
   }
 
   /**
-   * Gets data file name only with out path
-   *
-   * @param filePartNo          data file part number
-   * @param taskNo              task identifier
-   * @param factUpdateTimeStamp unique identifier to identify an update
-   * @return gets data file name only with out path
+   * Gets data file name only, without parent path
    */
   public static String getCarbonDataFileName(Integer filePartNo, String taskNo, int bucketNumber,
-      int batchNo, String factUpdateTimeStamp, String segmentNo) {
-    return DATA_PART_PREFIX + filePartNo + "-" + taskNo + BATCH_PREFIX + batchNo + "-"
-        + bucketNumber + "-" + segmentNo + "-" + factUpdateTimeStamp + CARBON_DATA_EXT;
+      int batchNo, String factUpdateTimeStamp, String segmentNo, String compressor) {
+    Objects.requireNonNull(filePartNo);
+    Objects.requireNonNull(taskNo);
+    Objects.requireNonNull(factUpdateTimeStamp);
+    Objects.requireNonNull(compressor);
+
+    // Start from CarbonData 2.0, the data file name patten is:
+    // partNo-taskNo-batchNo-bucketNo-segmentNo-timestamp.compressor.carbondata
+    // For example:
+    // part-0-0_batchno0-0-0-1580982686749.zstd.carbondata
+    //
+    // If the compressor name is missing, the file is compressed by snappy, which is
+    // the default compressor in CarbonData 1.x
+
+    return new StringBuilder()
+        .append(DATA_PART_PREFIX)
+        .append(filePartNo)
+        .append("-")
+        .append(taskNo)
+        .append(BATCH_PREFIX)
+        .append(batchNo)
+        .append("-")
+        .append(bucketNumber)
+        .append("-")
+        .append(segmentNo)
+        .append("-")
+        .append(factUpdateTimeStamp)
 
 Review comment:
   supposed to append compressor name here ? I see that argument `compressor` is not used.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


With regards,
Apache Git Services
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] ajantha-bhat commented on a change in pull request #3606: [CARBONDATA-3681] Change default compressor to zstd

GitBox
In reply to this post by GitBox
ajantha-bhat commented on a change in pull request #3606: [CARBONDATA-3681] Change default compressor to zstd
URL: https://github.com/apache/carbondata/pull/3606#discussion_r376988998
 
 

 ##########
 File path: integration/spark-common-test/src/test/scala/org/apache/carbondata/spark/testsuite/datacompaction/TableLevelCompactionOptionTest.scala
 ##########
 @@ -139,6 +139,8 @@ class TableLevelCompactionOptionTest extends QueryTest
       sql(s"LOAD DATA LOCAL INPATH '$tempFilePath' INTO TABLE carbon_table")
     }
 
+    sql("SHOW SEGMENTS FOR TABLE carbon_table").show
 
 Review comment:
   revert this or add validation

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


With regards,
Apache Git Services
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] ajantha-bhat commented on a change in pull request #3606: [CARBONDATA-3681] Change default compressor to zstd

GitBox
In reply to this post by GitBox
ajantha-bhat commented on a change in pull request #3606: [CARBONDATA-3681] Change default compressor to zstd
URL: https://github.com/apache/carbondata/pull/3606#discussion_r376989476
 
 

 ##########
 File path: integration/spark2/src/main/scala/org/apache/spark/sql/execution/command/table/CarbonDescribeFormattedCommand.scala
 ##########
 @@ -141,11 +141,6 @@ private[sql] case class CarbonDescribeFormattedCommand(
         Strings.formatSize(
           tblProps.getOrElse(CarbonCommonConstants.CARBON_LOAD_MIN_SIZE_INMB,
             CarbonCommonConstants.CARBON_LOAD_MIN_SIZE_INMB_DEFAULT).toFloat), ""),
-      ("Data File Compressor ", tblProps
 
 Review comment:
   why removed this ?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


With regards,
Apache Git Services
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] ajantha-bhat commented on a change in pull request #3606: [CARBONDATA-3681] Change default compressor to zstd

GitBox
In reply to this post by GitBox
ajantha-bhat commented on a change in pull request #3606: [CARBONDATA-3681] Change default compressor to zstd
URL: https://github.com/apache/carbondata/pull/3606#discussion_r376990515
 
 

 ##########
 File path: streaming/src/main/java/org/apache/carbondata/streaming/CarbonStreamRecordWriter.java
 ##########
 @@ -188,7 +190,7 @@ private void initializeAtFirstRow() throws IOException {
       compressorName = carbonTable.getTableInfo().getFactTable().getTableProperties().get(
           CarbonCommonConstants.COMPRESSOR);
       if (null == compressorName) {
-        compressorName = CompressorFactory.getInstance().getCompressor().getName();
+        compressorName = CompressorFactory.NativeSupportedCompressor.SNAPPY.getName();
 
 Review comment:
   why always snappy for streaming segment ? before I think it was supporting configured compressor

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


With regards,
Apache Git Services
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] ajantha-bhat commented on a change in pull request #3606: [CARBONDATA-3681] Change default compressor to zstd

GitBox
In reply to this post by GitBox
ajantha-bhat commented on a change in pull request #3606: [CARBONDATA-3681] Change default compressor to zstd
URL: https://github.com/apache/carbondata/pull/3606#discussion_r376990677
 
 

 ##########
 File path: streaming/src/main/java/org/apache/carbondata/streaming/CarbonStreamRecordWriter.java
 ##########
 @@ -140,7 +140,9 @@ private void initialize(TaskAttemptContext job) throws IOException {
 
     segmentDir = CarbonTablePath.getSegmentPath(
         carbonTable.getAbsoluteTableIdentifier().getTablePath(), segmentId);
-    fileName = CarbonTablePath.getCarbonDataFileName(0, taskNo + "", 0, 0, "0", segmentId);
+    fileName = CarbonTablePath.getCarbonDataFileName(
 
 Review comment:
   why always snappy for streaming segment ? before I think it was supporting configured compressor

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


With regards,
Apache Git Services
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] jackylk commented on a change in pull request #3606: [CARBONDATA-3681] Change default compressor to zstd

GitBox
In reply to this post by GitBox
jackylk commented on a change in pull request #3606: [CARBONDATA-3681] Change default compressor to zstd
URL: https://github.com/apache/carbondata/pull/3606#discussion_r377100880
 
 

 ##########
 File path: core/src/main/java/org/apache/carbondata/core/util/path/CarbonTablePath.java
 ##########
 @@ -285,17 +286,38 @@ public static String getSegmentPath(String tablePath, String segmentId) {
   }
 
   /**
-   * Gets data file name only with out path
-   *
-   * @param filePartNo          data file part number
-   * @param taskNo              task identifier
-   * @param factUpdateTimeStamp unique identifier to identify an update
-   * @return gets data file name only with out path
+   * Gets data file name only, without parent path
    */
   public static String getCarbonDataFileName(Integer filePartNo, String taskNo, int bucketNumber,
-      int batchNo, String factUpdateTimeStamp, String segmentNo) {
-    return DATA_PART_PREFIX + filePartNo + "-" + taskNo + BATCH_PREFIX + batchNo + "-"
-        + bucketNumber + "-" + segmentNo + "-" + factUpdateTimeStamp + CARBON_DATA_EXT;
+      int batchNo, String factUpdateTimeStamp, String segmentNo, String compressor) {
+    Objects.requireNonNull(filePartNo);
+    Objects.requireNonNull(taskNo);
+    Objects.requireNonNull(factUpdateTimeStamp);
+    Objects.requireNonNull(compressor);
+
+    // Start from CarbonData 2.0, the data file name patten is:
+    // partNo-taskNo-batchNo-bucketNo-segmentNo-timestamp.compressor.carbondata
+    // For example:
+    // part-0-0_batchno0-0-0-1580982686749.zstd.carbondata
+    //
+    // If the compressor name is missing, the file is compressed by snappy, which is
+    // the default compressor in CarbonData 1.x
+
+    return new StringBuilder()
+        .append(DATA_PART_PREFIX)
+        .append(filePartNo)
+        .append("-")
+        .append(taskNo)
+        .append(BATCH_PREFIX)
+        .append(batchNo)
+        .append("-")
+        .append(bucketNumber)
+        .append("-")
+        .append(segmentNo)
+        .append("-")
+        .append(factUpdateTimeStamp)
 
 Review comment:
   yes, I wanted to test once without changing the file name, in order to simulate using new jar to test on old files.
   As the CI is passed, so that the compatibility is ok.
   I will add the compressor name in file name now

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


With regards,
Apache Git Services
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] jackylk commented on a change in pull request #3606: [CARBONDATA-3681] Change default compressor to zstd

GitBox
In reply to this post by GitBox
jackylk commented on a change in pull request #3606: [CARBONDATA-3681] Change default compressor to zstd
URL: https://github.com/apache/carbondata/pull/3606#discussion_r377101401
 
 

 ##########
 File path: integration/spark-common-test/src/test/scala/org/apache/carbondata/spark/testsuite/datacompaction/TableLevelCompactionOptionTest.scala
 ##########
 @@ -139,6 +139,8 @@ class TableLevelCompactionOptionTest extends QueryTest
       sql(s"LOAD DATA LOCAL INPATH '$tempFilePath' INTO TABLE carbon_table")
     }
 
+    sql("SHOW SEGMENTS FOR TABLE carbon_table").show
 
 Review comment:
   fixed

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


With regards,
Apache Git Services
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] jackylk commented on a change in pull request #3606: [CARBONDATA-3681] Change default compressor to zstd

GitBox
In reply to this post by GitBox
jackylk commented on a change in pull request #3606: [CARBONDATA-3681] Change default compressor to zstd
URL: https://github.com/apache/carbondata/pull/3606#discussion_r377102667
 
 

 ##########
 File path: integration/spark2/src/main/scala/org/apache/spark/sql/execution/command/table/CarbonDescribeFormattedCommand.scala
 ##########
 @@ -141,11 +141,6 @@ private[sql] case class CarbonDescribeFormattedCommand(
         Strings.formatSize(
           tblProps.getOrElse(CarbonCommonConstants.CARBON_LOAD_MIN_SIZE_INMB,
             CarbonCommonConstants.CARBON_LOAD_MIN_SIZE_INMB_DEFAULT).toFloat), ""),
-      ("Data File Compressor ", tblProps
 
 Review comment:
   Because the default compressor changed, so this print is not correct now. And there is no way to know the compressor unless you check the file name. I think this is better since very loading can use different compressor, so we should not show in DESC which is table level

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


With regards,
Apache Git Services
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] jackylk commented on a change in pull request #3606: [CARBONDATA-3681] Change default compressor to zstd

GitBox
In reply to this post by GitBox
jackylk commented on a change in pull request #3606: [CARBONDATA-3681] Change default compressor to zstd
URL: https://github.com/apache/carbondata/pull/3606#discussion_r377103094
 
 

 ##########
 File path: streaming/src/main/java/org/apache/carbondata/streaming/CarbonStreamRecordWriter.java
 ##########
 @@ -188,7 +190,7 @@ private void initializeAtFirstRow() throws IOException {
       compressorName = carbonTable.getTableInfo().getFactTable().getTableProperties().get(
           CarbonCommonConstants.COMPRESSOR);
       if (null == compressorName) {
-        compressorName = CompressorFactory.getInstance().getCompressor().getName();
+        compressorName = CompressorFactory.NativeSupportedCompressor.SNAPPY.getName();
 
 Review comment:
   There is testcase failing if set to ZSTD. Need another PR for it

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


With regards,
Apache Git Services
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] jackylk commented on a change in pull request #3606: [CARBONDATA-3681] Change default compressor to zstd

GitBox
In reply to this post by GitBox
jackylk commented on a change in pull request #3606: [CARBONDATA-3681] Change default compressor to zstd
URL: https://github.com/apache/carbondata/pull/3606#discussion_r377103325
 
 

 ##########
 File path: streaming/src/main/java/org/apache/carbondata/streaming/CarbonStreamRecordWriter.java
 ##########
 @@ -140,7 +140,9 @@ private void initialize(TaskAttemptContext job) throws IOException {
 
     segmentDir = CarbonTablePath.getSegmentPath(
         carbonTable.getAbsoluteTableIdentifier().getTablePath(), segmentId);
-    fileName = CarbonTablePath.getCarbonDataFileName(0, taskNo + "", 0, 0, "0", segmentId);
+    fileName = CarbonTablePath.getCarbonDataFileName(
 
 Review comment:
   There is testcase failing if set to ZSTD. Need another PR for it

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


With regards,
Apache Git Services
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] CarbonDataQA1 commented on issue #3606: [CARBONDATA-3681] Change default compressor to zstd

GitBox
In reply to this post by GitBox
CarbonDataQA1 commented on issue #3606: [CARBONDATA-3681] Change default compressor to zstd
URL: https://github.com/apache/carbondata/pull/3606#issuecomment-584166330
 
 
   Build Success with Spark 2.4.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.4/216/
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


With regards,
Apache Git Services
1234