Apache CarbonData Dev Mailing List archive › Apache CarbonData JIRA issues

[GitHub] carbondata pull request #1352: [CARBONDATA-1174] Streaming Ingestion - schem...

Classic

List

Threaded

47 messages Options

123

qiuchenjian-2

[GitHub] carbondata pull request #1352: [CARBONDATA-1174] Streaming Ingestion - schem...

GitHub user aniketadnaik opened a pull request:

https://github.com/apache/carbondata/pull/1352

[CARBONDATA-1174] Streaming Ingestion - schema validation and streaming examples

- Description:
This change is mainly targeted for "streaming_ingest" development branch. Following changes are added on top of previous framework changes (pr-1064):
1. schema validation of input data if its from a file source when schema is specified. We validate source schema against existing table schema. For socket source , there is no schema validation required since there is no schema attached to it.
2. added streaming examples - for file stream and socket stream sources,
CarbonStreamingIngestFileSourceExample.scala , CarbonStreamingIngestSocketSourceExample.scala
these examples are added to facilitate development activity to understand and analyze code flow. The examples would run in its totality when carbondata is able write into carbondata file format.

- Whether new unit test cases have been added or why no new tests are required?
Yes , new unit test for schema validation has been added

- What manual testing you have done?
$> mvn clean -Pspark-2.1 -Dspark.version=2.1.0 verify
[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary:
[INFO]
[INFO] Apache CarbonData :: Parent ........................ SUCCESS [ 1.320 s]
[INFO] Apache CarbonData :: Common ........................ SUCCESS [ 1.509 s]
[INFO] Apache CarbonData :: Core .......................... SUCCESS [ 26.109 s]
[INFO] Apache CarbonData :: Processing .................... SUCCESS [ 4.892 s]
[INFO] Apache CarbonData :: Hadoop ........................ SUCCESS [ 8.910 s]
[INFO] Apache CarbonData :: Spark Common .................. SUCCESS [ 13.876 s]
[INFO] Apache CarbonData :: Spark2 ........................ SUCCESS [02:29 min]
[INFO] Apache CarbonData :: Spark Common Test ............. SUCCESS [07:06 min]
[INFO] Apache CarbonData :: Assembly ...................... SUCCESS [ 1.724 s]
[INFO] Apache CarbonData :: Flink Examples ................ SUCCESS [ 2.480 s]
[INFO] Apache CarbonData :: Hive .......................... SUCCESS [ 4.776 s]
[INFO] Apache CarbonData :: presto ........................ SUCCESS [ 5.786 s]
[INFO] Apache CarbonData :: Spark2 Examples ............... SUCCESS [ 4.957 s]
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 10:52 min
[INFO] Finished at: 2017-09-12T10:50:40-07:00
[INFO] Final Memory: 119M/1223M
[INFO] ------------------------------------------------------------------------

$> mvn clean verify
[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary:
[INFO]
[INFO] Apache CarbonData :: Parent ........................ SUCCESS [ 6.925 s]
[INFO] Apache CarbonData :: Common ........................ SUCCESS [ 10.383 s]
[INFO] Apache CarbonData :: Core .......................... SUCCESS [02:07 min]
[INFO] Apache CarbonData :: Processing .................... SUCCESS [ 21.376 s]
[INFO] Apache CarbonData :: Hadoop ........................ SUCCESS [ 18.568 s]
[INFO] Apache CarbonData :: Spark Common .................. SUCCESS [01:03 min]
[INFO] Apache CarbonData :: Spark ......................... SUCCESS [04:34 min]
[INFO] Apache CarbonData :: Spark Common Test ............. SUCCESS [24:33 min]
[INFO] Apache CarbonData :: Assembly ...................... SUCCESS [ 8.661 s]
[INFO] Apache CarbonData :: Spark Examples ................ SUCCESS [ 22.520 s]
[INFO] Apache CarbonData :: Flink Examples ................ SUCCESS [ 6.592 s]
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 33:55 min
[INFO] Finished at: 2017-09-12T08:12:30-07:00
[INFO] Final Memory: 62M/298M
[INFO] ------------------------------------------------------------------------
* Made sure write path class invocation and schema validation happens correctly with with spark structured streaming (2.1) and parquet file source
* Made sure write path execution work flow with structured streaming(2.1) for both socket and file
sources

- Any additional information to help reviewers in testing this change.
For invalid schema carbondata throws exception and no record writer will be be instantiated. This is kind of first level of validation of input streaming data at CarbonSource entry point, another level of input data validation happens in carbon load path anyway.
Some file sources allow schema to be inferred if "spark.sql.streaming.schemaInference" is set to true and if no explicit schema is specified.In such case we validate againist inferred schema. Carbondata also provides inferSchema functionality if table path is provided.The inferSchema() functionality is used in read path (readStream) and will be applicable when read path functionality is implemented.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/aniketadnaik/carbondataStreamIngest streamIngest-1174

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/carbondata/pull/1352.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1352

----
commit ac4f9c2a3bd0c7e7569bde9ce797abcb424222a4
Author: Aniket Adnaik <[hidden email]>
Date: 2017-09-08T00:28:00Z

[CARBONDATA-1174] Streaming Ingestion - Schema validation and Examples

commit 8e710b8b5265cc1b3db52deecfae2086cb46993b
Author: Aniket Adnaik <[hidden email]>
Date: 2017-09-09T00:32:02Z

[CARBONDATA-1174] Streaming Ingestion - schema validation and streaming examples

commit 991d12aa0ec8ec58a5763f28ef6260c668b1f1c4
Author: Aniket Adnaik <[hidden email]>
Date: 2017-09-09T00:32:39Z

[CARBONDATA-1174] Streaming Ingestion - schema validation and streaming examples

commit 61d283ef63faabdd97e90d0c5f6d862f073c5b2b
Author: Aniket Adnaik <[hidden email]>
Date: 2017-09-10T00:54:03Z

[CARBONDATA-1174] Streaming Ingestion - schema validation and streaming examples

commit 6e24d4fa1af90bd61a4c1bb5bf80321135761973
Author: Aniket Adnaik <[hidden email]>
Date: 2017-09-12T01:59:48Z

[CARBONDATA-1174] Streaming Ingestion - schema validation and streaming examples

commit 84fb1b76ce319841721db0ed8ef719b16d6c9acf
Author: Aniket Adnaik <[hidden email]>
Date: 2017-09-12T08:53:07Z

[CARBONDATA-1174] Streaming Ingestion - Schema validation and Examples

commit 97646ae45defa1d09bcefa04ddd0497e9238e8fa
Author: Aniket Adnaik <[hidden email]>
Date: 2017-09-12T14:36:54Z

[CARBONDATA-1174] Streaming Ingestion - Schema validation and Examples

----

---

qiuchenjian-2

[GitHub] carbondata issue #1352: [CARBONDATA-1174] Streaming Ingestion - schema valid...

Github user CarbonDataQA commented on the issue:

https://github.com/apache/carbondata/pull/1352

Can one of the admins verify this patch?

---

qiuchenjian-2