Apache CarbonData Dev Mailing List archive › Apache CarbonData JIRA issues

[GitHub] carbondata pull request #2715: [CARBONDATA-2930] Support customize column co...

Classic

List

Threaded

43 messages Options

123

qiuchenjian-2

[GitHub] carbondata issue #2715: [CARBONDATA-2930] Support customize column compresso...

Github user CarbonDataQA commented on the issue:

https://github.com/apache/carbondata/pull/2715

Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/531/

---

qiuchenjian-2

[GitHub] carbondata issue #2715: [CARBONDATA-2930] Support customize column compresso...

In reply to this post by qiuchenjian-2

Github user xuchuanyin commented on the issue:

https://github.com/apache/carbondata/pull/2715

@ravipesala please review this PR

---

qiuchenjian-2

[GitHub] carbondata pull request #2715: [CARBONDATA-2930] Support customize column co...

In reply to this post by qiuchenjian-2

Github user ravipesala commented on a diff in the pull request:

https://github.com/apache/carbondata/pull/2715#discussion_r226274512

--- Diff: integration/spark-common-test/src/test/scala/org/apache/carbondata/integration/spark/testsuite/dataload/TestLoadDataWithCompression.scala ---
@@ -42,6 +44,112 @@ case class Rcd(booleanField: Boolean, shortField: Short, intField: Int, bigintFi
dateField: String, charField: String, floatField: Float, stringDictField: String,
stringSortField: String, stringLocalDictField: String, longStringField: String)

+/**
+ * This compressor actually will not compress or decompress anything.
+ * It is used for test case of specifying customized compressor.
+ */
+class CustomizeCompressor extends Compressor {
+ override def getName: String = "org.apache.carbondata.integration.spark.testsuite.dataload.CustomizeCompressor"
--- End diff --

We no need to maintain the relationship between shortname and class name. user mentions short name in the implemented class and keeps the jar in classpath. Carbondata loads all classes which are implementing compressor class and loads the class and gets the shortname from that class.
So user only should give shortname in table properties and we store only shortname in thrift as well. This is just like spark datasource , please go through it once. It would be simple for user if we provide short name.

---

qiuchenjian-2

[GitHub] carbondata pull request #2715: [CARBONDATA-2930] Support customize column co...

In reply to this post by qiuchenjian-2

Github user xuchuanyin commented on a diff in the pull request:

https://github.com/apache/carbondata/pull/2715#discussion_r226913303

--- Diff: integration/spark-common-test/src/test/scala/org/apache/carbondata/integration/spark/testsuite/dataload/TestLoadDataWithCompression.scala ---
@@ -42,6 +44,112 @@ case class Rcd(booleanField: Boolean, shortField: Short, intField: Int, bigintFi
dateField: String, charField: String, floatField: Float, stringDictField: String,
stringSortField: String, stringLocalDictField: String, longStringField: String)

+/**
+ * This compressor actually will not compress or decompress anything.
+ * It is used for test case of specifying customized compressor.
+ */
+class CustomizeCompressor extends Compressor {
+ override def getName: String = "org.apache.carbondata.integration.spark.testsuite.dataload.CustomizeCompressor"
--- End diff --

> "Carbondata loads all classes which are implementing compressor class"

How can we figure out this? Spark's implementation still requires the user to specify the whole class name of their customize CompressionCodec.

My problem is that we do not know the full class name of the codec when we can only get the short name from the file meta.

---

qiuchenjian-2

[GitHub] carbondata pull request #2715: [CARBONDATA-2930] Support customize column co...

In reply to this post by qiuchenjian-2

Github user xuchuanyin commented on a diff in the pull request:

https://github.com/apache/carbondata/pull/2715#discussion_r226913624

--- Diff: integration/spark-common-test/src/test/scala/org/apache/carbondata/integration/spark/testsuite/dataload/TestLoadDataWithCompression.scala ---
@@ -42,6 +44,112 @@ case class Rcd(booleanField: Boolean, shortField: Short, intField: Int, bigintFi
dateField: String, charField: String, floatField: Float, stringDictField: String,
stringSortField: String, stringLocalDictField: String, longStringField: String)

+/**
+ * This compressor actually will not compress or decompress anything.
+ * It is used for test case of specifying customized compressor.
+ */
+class CustomizeCompressor extends Compressor {
+ override def getName: String = "org.apache.carbondata.integration.spark.testsuite.dataload.CustomizeCompressor"
--- End diff --

> "Carbondata loads all classes which are implementing compressor class"

How can we figure out this? Spark's implementation still requires the user to specify the whole class name of their customize CompressionCodec.

My problem is that we do not know the full class name of the codec when we can only get the short name from the file meta.

---

qiuchenjian-2

[GitHub] carbondata pull request #2715: [CARBONDATA-2930] Support customize column co...

In reply to this post by qiuchenjian-2

Github user ravipesala commented on a diff in the pull request:

https://github.com/apache/carbondata/pull/2715#discussion_r227235560

--- Diff: integration/spark-common-test/src/test/scala/org/apache/carbondata/integration/spark/testsuite/dataload/TestLoadDataWithCompression.scala ---
@@ -42,6 +44,112 @@ case class Rcd(booleanField: Boolean, shortField: Short, intField: Int, bigintFi
dateField: String, charField: String, floatField: Float, stringDictField: String,
stringSortField: String, stringLocalDictField: String, longStringField: String)

+/**
+ * This compressor actually will not compress or decompress anything.
+ * It is used for test case of specifying customized compressor.
+ */
+class CustomizeCompressor extends Compressor {
+ override def getName: String = "org.apache.carbondata.integration.spark.testsuite.dataload.CustomizeCompressor"
--- End diff --

I am not sure, what you are referring, but I am referring spark datasource (fileformat) design. For example, while creating carbon table we just mention `using carbon` but never mention whole implementation class name. It also saves only the shortname `carbon` in its metadata, not the whole class name. It can get the whole class name from implementation classes of `FileFormat`.

---

qiuchenjian-2

[GitHub] carbondata pull request #2715: [CARBONDATA-2930] Support customize column co...

In reply to this post by qiuchenjian-2

Github user xuchuanyin commented on a diff in the pull request:

https://github.com/apache/carbondata/pull/2715#discussion_r227241517

--- Diff: integration/spark-common-test/src/test/scala/org/apache/carbondata/integration/spark/testsuite/dataload/TestLoadDataWithCompression.scala ---
@@ -42,6 +44,112 @@ case class Rcd(booleanField: Boolean, shortField: Short, intField: Int, bigintFi
dateField: String, charField: String, floatField: Float, stringDictField: String,
stringSortField: String, stringLocalDictField: String, longStringField: String)

+/**
+ * This compressor actually will not compress or decompress anything.
+ * It is used for test case of specifying customized compressor.
+ */
+class CustomizeCompressor extends Compressor {
+ override def getName: String = "org.apache.carbondata.integration.spark.testsuite.dataload.CustomizeCompressor"
--- End diff --

I'm refering to the implementation of 'CompressionCodec' in spark.
Due to yesterday's github problem, some content of my previous was missing.

The corresponding code in spark is here: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/io/CompressionCodec.scala#L57-L82
For a customize compressionCodec, the line#75 use the name as the class name.

I think in your example, spark file format can handle the short name of 'carbon' because we are using CarbonSession and in this session we have registered the short name of carbon.

---

qiuchenjian-2

[GitHub] carbondata pull request #2715: [CARBONDATA-2930] Support customize column co...

In reply to this post by qiuchenjian-2

Github user xuchuanyin commented on a diff in the pull request:

https://github.com/apache/carbondata/pull/2715#discussion_r228054678

--- Diff: integration/spark-common-test/src/test/scala/org/apache/carbondata/integration/spark/testsuite/dataload/TestLoadDataWithCompression.scala ---
@@ -42,6 +44,112 @@ case class Rcd(booleanField: Boolean, shortField: Short, intField: Int, bigintFi
dateField: String, charField: String, floatField: Float, stringDictField: String,
stringSortField: String, stringLocalDictField: String, longStringField: String)

+/**
+ * This compressor actually will not compress or decompress anything.
+ * It is used for test case of specifying customized compressor.
+ */
+class CustomizeCompressor extends Compressor {
+ override def getName: String = "org.apache.carbondata.integration.spark.testsuite.dataload.CustomizeCompressor"
--- End diff --

@ravipesala
Hi, I've communicated with @KanakaKumar and finally figured out the background of your proposal.

For carbon spark datasource (fileformat), we can use âusing carbonâ in creating table instead of specifying the whole class name for SparkCarbonDataSource. So here you want the compressioncodec can be implemented the same way like this -- using the short name instead of whole class name. But there are something underlying that prevent us to do that -- the very thing that I'm concerning is that we cannot get the real class name only from short name.

1. In carbon spark-datasource module, we have a file named 'org.apache.spark.sql.sources.DataSourceRegister' with content 'org.apache.spark.sql.carbondata.execution.datasources.SparkCarbonFileFormat' in the path 'resources/META-INF.services'. Spark may discover this file and load this class, then get the short name and the whole class name for carbon datasource.

If our compressioncodec want to be implemented like this, we will have to provide extra property files and carbon will register the codecs after loading and reading this file.

2. There is another way to do this : java provides a way to find all the implementations of a specific interface. We can use this to find all the compressioncodecs and get their shortName as well as whole class name. But the problem may lie in that this procedure may take too much long time, especially we spark now have so many jars in classpath.

In conclusion, I think it's OK to accept current implementation by using the whole class name of compressioncodec. And actually spark implement it in this way too.

---

qiuchenjian-2

[GitHub] carbondata issue #2715: [CARBONDATA-2930] Support customize column compresso...

In reply to this post by qiuchenjian-2

Github user xuchuanyin commented on the issue:

https://github.com/apache/carbondata/pull/2715

retest this please

---

qiuchenjian-2

[GitHub] carbondata issue #2715: [CARBONDATA-2930] Support customize column compresso...

In reply to this post by qiuchenjian-2

Github user CarbonDataQA commented on the issue:

https://github.com/apache/carbondata/pull/2715

Build Failed with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/1402/

---

qiuchenjian-2

[GitHub] carbondata issue #2715: [CARBONDATA-2930] Support customize column compresso...

In reply to this post by qiuchenjian-2

Github user CarbonDataQA commented on the issue:

https://github.com/apache/carbondata/pull/2715

Build Failed with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/1188/

---

qiuchenjian-2

[GitHub] carbondata issue #2715: [CARBONDATA-2930] Support customize column compresso...

In reply to this post by qiuchenjian-2

Github user xuchuanyin commented on the issue:

https://github.com/apache/carbondata/pull/2715

rebased with current master

---

qiuchenjian-2

[GitHub] carbondata issue #2715: [CARBONDATA-2930] Support customize column compresso...

In reply to this post by qiuchenjian-2

Github user CarbonDataQA commented on the issue:

https://github.com/apache/carbondata/pull/2715

Build Success with Spark 2.3.1, Please check CI http://136.243.101.176:8080/job/carbondataprbuilder2.3/9454/

---

qiuchenjian-2

[GitHub] carbondata issue #2715: [CARBONDATA-2930] Support customize column compresso...

In reply to this post by qiuchenjian-2

Github user CarbonDataQA commented on the issue:

https://github.com/apache/carbondata/pull/2715

Build Success with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/1410/

---

qiuchenjian-2

[GitHub] carbondata issue #2715: [CARBONDATA-2930] Support customize column compresso...

In reply to this post by qiuchenjian-2

qiuchenjian-2

[GitHub] carbondata issue #2715: [CARBONDATA-2930] Support customize column compresso...

In reply to this post by qiuchenjian-2

qiuchenjian-2

[GitHub] carbondata issue #2715: [CARBONDATA-2930] Support customize column compresso...

In reply to this post by qiuchenjian-2

Github user ravipesala commented on the issue:

https://github.com/apache/carbondata/pull/2715

LGTM

---

qiuchenjian-2

[GitHub] carbondata issue #2715: [CARBONDATA-2930] Support customize column compresso...

In reply to this post by qiuchenjian-2

Github user xuchuanyin commented on the issue:

https://github.com/apache/carbondata/pull/2715

retest this please

---

qiuchenjian-2

[GitHub] carbondata issue #2715: [CARBONDATA-2930] Support customize column compresso...

In reply to this post by qiuchenjian-2

qiuchenjian-2

[GitHub] carbondata issue #2715: [CARBONDATA-2930] Support customize column compresso...

In reply to this post by qiuchenjian-2

123