Apache CarbonData Dev Mailing List archive › Apache CarbonData JIRA issues

[GitHub] carbondata pull request #1632: [CARBONDATA-1839] [DataLoad]Fix bugs in compr...

Classic

List

Threaded

44 messages Options

123

qiuchenjian-2

[GitHub] carbondata issue #1632: [CARBONDATA-1839] [DataLoad]Fix bugs in compressing ...

Github user CarbonDataQA commented on the issue:

https://github.com/apache/carbondata/pull/1632

Build Success with Spark 2.2.0, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/686/

---

qiuchenjian-2

[GitHub] carbondata issue #1632: [CARBONDATA-1839] [DataLoad]Fix bugs in compressing ...

In reply to this post by qiuchenjian-2

Github user ravipesala commented on the issue:

https://github.com/apache/carbondata/pull/1632

SDV Build Success , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/2248/

---

qiuchenjian-2

[GitHub] carbondata issue #1632: [CARBONDATA-1839] [DataLoad]Fix bugs in compressing ...

In reply to this post by qiuchenjian-2

Github user xuchuanyin commented on the issue:

https://github.com/apache/carbondata/pull/1632

retest this please

---

qiuchenjian-2

[GitHub] carbondata issue #1632: [CARBONDATA-1839] [DataLoad]Fix bugs in compressing ...

In reply to this post by qiuchenjian-2

Github user CarbonDataQA commented on the issue:

https://github.com/apache/carbondata/pull/1632

Build Failed with Spark 2.2.0, Please check CI http://88.99.58.216:8080/job/ApacheCarbonPRBuilder/694/

---

qiuchenjian-2

[GitHub] carbondata issue #1632: [CARBONDATA-1839] [DataLoad]Fix bugs in compressing ...

In reply to this post by qiuchenjian-2

Github user CarbonDataQA commented on the issue:

https://github.com/apache/carbondata/pull/1632

Build Failed with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/1923/

---

qiuchenjian-2

[GitHub] carbondata issue #1632: [CARBONDATA-1839] [DataLoad]Fix bugs in compressing ...

In reply to this post by qiuchenjian-2

Github user xuchuanyin commented on the issue:

https://github.com/apache/carbondata/pull/1632

retest this please

---

qiuchenjian-2

[GitHub] carbondata issue #1632: [CARBONDATA-1839] [DataLoad]Fix bugs in compressing ...

In reply to this post by qiuchenjian-2

qiuchenjian-2

[GitHub] carbondata issue #1632: [CARBONDATA-1839] [DataLoad]Fix bugs in compressing ...

In reply to this post by qiuchenjian-2

qiuchenjian-2

[GitHub] carbondata issue #1632: [CARBONDATA-1839] [DataLoad]Fix bugs in compressing ...

In reply to this post by qiuchenjian-2

Github user xuchuanyin commented on the issue:

https://github.com/apache/carbondata/pull/1632

retest this please

---

qiuchenjian-2

[GitHub] carbondata issue #1632: [CARBONDATA-1839] [DataLoad]Fix bugs in compressing ...

In reply to this post by qiuchenjian-2

qiuchenjian-2

[GitHub] carbondata issue #1632: [CARBONDATA-1839] [DataLoad]Fix bugs in compressing ...

In reply to this post by qiuchenjian-2

Github user CarbonDataQA commented on the issue:

https://github.com/apache/carbondata/pull/1632

Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder1/1949/

---

qiuchenjian-2

[GitHub] carbondata pull request #1632: [CARBONDATA-1839] [DataLoad]Fix bugs in compr...

In reply to this post by qiuchenjian-2

Github user manishgupta88 commented on a diff in the pull request:

https://github.com/apache/carbondata/pull/1632#discussion_r156919460

--- Diff: core/src/main/java/org/apache/carbondata/core/util/NonDictionaryUtil.java ---
@@ -108,60 +105,21 @@ public static Object getMeasure(int index, Object[] row) {
return measures[index];
}

- public static byte[] getByteArrayForNoDictionaryCols(Object[] row) {
-
- return (byte[]) row[WriteStepRowUtil.NO_DICTIONARY_AND_COMPLEX];
+ /**
+ * Method to get the required non-dictionary & complex from 3-parted row
+ * @param index
+ * @param row
+ * @return
+ */
+ public static byte[] getNonDictOrComplex(int index, Object[] row) {
--- End diff --

Rename the method to getNoDictOrComplex

---

qiuchenjian-2

[GitHub] carbondata pull request #1632: [CARBONDATA-1839] [DataLoad]Fix bugs in compr...

In reply to this post by qiuchenjian-2

Github user manishgupta88 commented on a diff in the pull request:

https://github.com/apache/carbondata/pull/1632#discussion_r156954293

--- Diff: integration/spark-common/src/main/scala/org/apache/carbondata/spark/load/DataLoadProcessBuilderOnSpark.scala ---
@@ -121,17 +121,18 @@ object DataLoadProcessBuilderOnSpark {
CarbonProperties.getInstance().getGlobalSortRddStorageLevel()))
}

+ val sortStepRowConverter: SortStepRowHandler = new SortStepRowHandler(sortParameters)
import scala.reflect.classTag
+
+ // 3. sort
val sortRDD = convertRDD
- .sortBy(_.getData, numPartitions = numPartitions)(RowOrdering, classTag[Array[AnyRef]])
- .mapPartitionsWithIndex { case (index, rows) =>
- DataLoadProcessorStepOnSpark.convertTo3Parts(rows, index, modelBroadcast,
- sortStepRowCounter)
- }
+ .map(r => DataLoadProcessorStepOnSpark.convertTo3Parts(r, TaskContext.getPartitionId(),
+ modelBroadcast, sortStepRowConverter, sortStepRowCounter))
+ .sortBy(r => r.getData, numPartitions = numPartitions)(RowOrdering, classTag[Array[AnyRef]])
--- End diff --

@xuchuanyin ...
This PR is for compressing sort temp files but this code modification is for data load using global sort flow which does not involve creation of sort temp files. Can you please clarify?

---

qiuchenjian-2

[GitHub] carbondata pull request #1632: [CARBONDATA-1839] [DataLoad]Fix bugs in compr...

In reply to this post by qiuchenjian-2

Github user xuchuanyin commented on a diff in the pull request:

https://github.com/apache/carbondata/pull/1632#discussion_r157109850

--- Diff: core/src/main/java/org/apache/carbondata/core/util/NonDictionaryUtil.java ---
@@ -108,60 +105,21 @@ public static Object getMeasure(int index, Object[] row) {
return measures[index];
}

- public static byte[] getByteArrayForNoDictionaryCols(Object[] row) {
-
- return (byte[]) row[WriteStepRowUtil.NO_DICTIONARY_AND_COMPLEX];
+ /**
+ * Method to get the required non-dictionary & complex from 3-parted row
+ * @param index
+ * @param row
+ * @return
+ */
+ public static byte[] getNonDictOrComplex(int index, Object[] row) {
--- End diff --

OK~

---

qiuchenjian-2

[GitHub] carbondata pull request #1632: [CARBONDATA-1839] [DataLoad]Fix bugs in compr...

In reply to this post by qiuchenjian-2

Github user xuchuanyin commented on a diff in the pull request:

https://github.com/apache/carbondata/pull/1632#discussion_r157112148

--- Diff: integration/spark-common/src/main/scala/org/apache/carbondata/spark/load/DataLoadProcessBuilderOnSpark.scala ---
@@ -121,17 +121,18 @@ object DataLoadProcessBuilderOnSpark {
CarbonProperties.getInstance().getGlobalSortRddStorageLevel()))
}

+ val sortStepRowConverter: SortStepRowHandler = new SortStepRowHandler(sortParameters)
import scala.reflect.classTag
+
+ // 3. sort
val sortRDD = convertRDD
- .sortBy(_.getData, numPartitions = numPartitions)(RowOrdering, classTag[Array[AnyRef]])
- .mapPartitionsWithIndex { case (index, rows) =>
- DataLoadProcessorStepOnSpark.convertTo3Parts(rows, index, modelBroadcast,
- sortStepRowCounter)
- }
+ .map(r => DataLoadProcessorStepOnSpark.convertTo3Parts(r, TaskContext.getPartitionId(),
+ modelBroadcast, sortStepRowConverter, sortStepRowCounter))
+ .sortBy(r => r.getData, numPartitions = numPartitions)(RowOrdering, classTag[Array[AnyRef]])
--- End diff --

This change of code is not involved with sort temp file. I changed this because the interface and internal load procedure has been changed.

After `convertRDD`, each row is still raw-row; In the sort phrase, rows will be converted to 3-parts; In the write phrase, rows will be encoded and written.

In the previous implementation, Carbondata sort on these raw-rows and then convert each row to 3-parts in batch.

In the current implementation, Carbondata firstly convert each row to 3-parts, and sort on these rows.

While converting raw-row to 3-parts-row, the interface (DataLoadProcessorStepOnSpark.convertTo3Parts) has changed: previously deal with batch, currently deal with one row.

---

qiuchenjian-2

[GitHub] carbondata issue #1632: [CARBONDATA-1839] [DataLoad]Fix bugs in compressing ...

In reply to this post by qiuchenjian-2

Github user xuchuanyin commented on the issue:

https://github.com/apache/carbondata/pull/1632

@manishgupta88 review comments are resolved

---

qiuchenjian-2

[GitHub] carbondata issue #1632: [CARBONDATA-1839] [DataLoad]Fix bugs in compressing ...

In reply to this post by qiuchenjian-2

qiuchenjian-2

[GitHub] carbondata issue #1632: [CARBONDATA-1839] [DataLoad]Fix bugs in compressing ...

In reply to this post by qiuchenjian-2

qiuchenjian-2

[GitHub] carbondata issue #1632: [CARBONDATA-1839] [DataLoad]Fix bugs in compressing ...

In reply to this post by qiuchenjian-2

Github user ravipesala commented on the issue:

https://github.com/apache/carbondata/pull/1632

SDV Build Success , Please check CI http://144.76.159.231:8080/job/ApacheSDVTests/2304/

---

qiuchenjian-2

[GitHub] carbondata issue #1632: [CARBONDATA-1839] [DataLoad]Fix bugs in compressing ...

In reply to this post by qiuchenjian-2

Github user xuchuanyin commented on the issue:

https://github.com/apache/carbondata/pull/1632

@manishgupta88 @jackylk Hi, how do you think about this PR? I raised a discussion about it and prefer to another method.

Please refer to this: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/Discussion-Compression-for-sort-temp-files-in-Carbomdata-td31747.html

OR refer to this: https://issues.apache.org/jira/browse/CARBONDATA-1839

---

123