Apache CarbonData Dev Mailing List archive › Apache CarbonData JIRA issues

[GitHub] carbondata pull request #2032: [CARBONDATA-2224] External File level reader ...

Classic

List

Threaded

41 messages Options

123

qiuchenjian-2

[GitHub] carbondata pull request #2032: [CARBONDATA-2224] External File level reader ...

GitHub user sounakr opened a pull request:

https://github.com/apache/carbondata/pull/2032

[CARBONDATA-2224] External File level reader support

File level reader reads any carbondata file placed in any external file path. The reading can be done through 3 methods.
a) Reading as a datasource from Spark. CarbonFileLevelFormat.scala is used in this case to read the file. To create a spark datasource external table
" CREATE TABLE sdkOutputTable **USING CarbonDataFileFormat** LOCATION '$writerOutputFilePath1'"
For more details please refer the test file org/apache/carbondata/spark/testsuite/createTable/TestCreateTableUsingCarbonFileLevelFormat.scala
file.

b) Reading from spark sql as a external table. CarbonFileinputFormat.java is used for reading the files. The create table syntax for this will be
"CREATE EXTERNAL TABLE sdkOutputTable **STORED BY 'carbondatafileformat'** LOCATION '$writerOutputFilePath6'"
For more details org/apache/carbondata/spark/testsuite/createTable/TestCarbonFileInputFormatWithExternalCarbonTable.scala.

c) Reading Through Hadoop Map reduce job. Please refer org/apache/carbondata/mapred/TestMapReduceCarbonFileInputFormat.java for more details.

- [ ] Any interfaces changed?

- [ ] Any backward compatibility impacted?

- [ ] Document update required?

- [ ] Testing done
Please provide details on
- Whether new unit test cases have been added or why no new tests are required?
- How it is tested? Please attach test report.
- Is it a performance related change? Please attach the performance test report.
- Any additional information to help reviewers in testing this change.

- [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/sounakr/incubator-carbondata file_level_reader

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/carbondata/pull/2032.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #2032

----
commit 65ce23b1f6e35c3c6722c7f0c14c19b7c8536d23
Author: Jacky Li <jacky.likun@...>
Date: 2018-01-06T12:28:44Z

[CARBONDATA-1992] Remove partitionId in CarbonTablePath

In CarbonTablePath, there is a deprecated partition id which is always 0, it should be removed to avoid confusion.

This closes #1765

commit c9ceaaae66574c98a13cc65bc3b91ab8346a456b
Author: Jacky Li <jacky.likun@...>
Date: 2018-01-30T13:24:04Z

[CARBONDATA-2099] Refactor query scan process to improve readability

Unified concepts in scan process flow:

1.QueryModel contains all parameter for scan, it is created by API in CarbonTable. (In future, CarbonTable will be the entry point for various table operations)
2.Use term ColumnChunk to represent one column in one blocklet, and use ChunkIndex in reader to read specified column chunk
3.Use term ColumnPage to represent one page in one ColumnChunk
4.QueryColumn => ProjectionColumn, indicating it is for projection

This closes #1874

commit 01fcd539af815956975eb4ea480f14e4bb1a2062
Author: ravipesala <ravi.pesala@...>
Date: 2017-11-15T14:18:40Z

[CARBONDATA-1544][Datamap] Datamap FineGrain implementation

Implemented interfaces for FG datamap and integrated to filterscanner to use the pruned bitset from FG datamap.
FG Query flow as follows.
1.The user can add FG datamap to any table and implement there interfaces.
2. Any filter query which hits the table with datamap will call prune method of FGdatamap.
3. The prune method of FGDatamap return list FineGrainBlocklet , these blocklets contain the information of block, blocklet, page and rowids information as well.
4. The pruned blocklets are internally wriitten to file and returns only the block , blocklet and filepath information as part of Splits.
5. Based on the splits scanrdd schedule the tasks.
6. In filterscanner we check the datamapwriterpath from split and reNoteads the bitset if exists. And pass this bitset as input to it.

This closes #1471

commit da82cdbda4f45fa741f56594e23c61a575c2fd2c
Author: Jacky Li <jacky.likun@...>
Date: 2018-02-27T00:51:25Z

[REBASE] resolve conflict after rebasing to master

commit 072c95a6770a2b847e111f3349df271bade62675
Author: Jacky Li <jacky.likun@...>
Date: 2018-02-10T02:34:59Z

Revert "[CARBONDATA-2023][DataLoad] Add size base block allocation in data loading"

This reverts commit 6dd8b038fc898dbf48ad30adfc870c19eb38e3d0.

commit 50af4d91ca2415d12e559b6070f72bfe5a881641
Author: Jacky Li <jacky.likun@...>
Date: 2018-02-11T13:37:04Z

[CARBONDATA-2159] Remove carbon-spark dependency in store-sdk module

To make assembling JAR of store-sdk module, it should not depend on carbon-spark module

This closes #1970

commit e77fcac978a87d9d526ea7012954fc8e48e9e34c
Author: xuchuanyin <xuchuanyin@...>
Date: 2018-02-08T06:42:39Z

[CARBONDATA-2023][DataLoad] Add size base block allocation in data loading

Carbondata assign blocks to nodes at the beginning of data loading.
Previous block allocation strategy is block number based and it will
suffer skewed data problem if the size of input files differs a lot.

We introduced a size based block allocation strategy to optimize data
loading performance in skewed data scenario.

This closes #1808

commit 00e5208a6da5cc13aabd3ed6c437d2d1c5fa06ff
Author: sounakr <sounakr@...>
Date: 2017-09-28T10:51:05Z

[CARBONDATA-1480]Min Max Index Example for DataMap

Datamap Example. Implementation of Min Max Index through Datamap. And Using the Index while prunning.

This closes #1359

commit 3212c0c025191c754c454ad88de3adbec26dc58b
Author: ravipesala <ravi.pesala@...>
Date: 2017-11-15T14:18:40Z

[CARBONDATA-1544][Datamap] Datamap FineGrain implementation

Implemented interfaces for FG datamap and integrated to filterscanner to use the pruned bitset from FG datamap.
FG Query flow as follows.
1.The user can add FG datamap to any table and implement there interfaces.
2. Any filter query which hits the table with datamap will call prune method of FGdatamap.
3. The prune method of FGDatamap return list FineGrainBlocklet , these blocklets contain the information of block, blocklet, page and rowids information as well.
4. The pruned blocklets are internally wriitten to file and returns only the block , blocklet and filepath information as part of Splits.
5. Based on the splits scanrdd schedule the tasks.
6. In filterscanner we check the datamapwriterpath from split and reNoteads the bitset if exists. And pass this bitset as input to it.

This closes #1471

commit aa3f2ff731fa6e0004dea827417c0d932d4a6291
Author: Jacky Li <jacky.likun@...>
Date: 2018-01-06T12:28:44Z

[CARBONDATA-1992] Remove partitionId in CarbonTablePath

In CarbonTablePath, there is a deprecated partition id which is always 0, it should be removed to avoid confusion.

This closes #1765

commit 3ba31a162dc66bc5ee9023c7ff466c7de4c31c50
Author: Jacky Li <jacky.likun@...>
Date: 2018-01-30T13:24:04Z

[CARBONDATA-2099] Refactor query scan process to improve readability

Unified concepts in scan process flow:

1.QueryModel contains all parameter for scan, it is created by API in CarbonTable. (In future, CarbonTable will be the entry point for various table operations)
2.Use term ColumnChunk to represent one column in one blocklet, and use ChunkIndex in reader to read specified column chunk
3.Use term ColumnPage to represent one page in one ColumnChunk
4.QueryColumn => ProjectionColumn, indicating it is for projection

This closes #1874

commit 810f093c28dc9e8a70a04bef1bc701569ec4261e
Author: Jacky Li <jacky.likun@...>
Date: 2018-01-31T08:14:27Z

[CARBONDATA-2025] Unify all path construction through CarbonTablePath static method

Refactory CarbonTablePath:

1.Remove CarbonStorePath and use CarbonTablePath only.
2.Make CarbonTablePath an utility without object creation, it can avoid creating object before using it, thus code is cleaner and GC is less.

This closes #1768

commit 5a91a4cf49e3554f95f88637d93b51c80bf5329f
Author: xuchuanyin <xuchuanyin@...>
Date: 2018-02-08T06:42:39Z

[CARBONDATA-2023][DataLoad] Add size base block allocation in data loading

Carbondata assign blocks to nodes at the beginning of data loading.
Previous block allocation strategy is block number based and it will
suffer skewed data problem if the size of input files differs a lot.

We introduced a size based block allocation strategy to optimize data
loading performance in skewed data scenario.

This closes #1808

commit 667303e7dfa515cda7cd3e34c736b74b5e246c29
Author: xuchuanyin <xuchuanyin@...>
Date: 2018-02-08T07:39:45Z

[HotFix][CheckStyle] Fix import related checkstyle

This closes #1952

commit 442350f6cbc908ea02ec6ef5f8d5b748b63d73d9
Author: Jacky Li <jacky.likun@...>
Date: 2018-02-27T03:26:30Z

[REBASE] Solve conflict after merging master

commit ea51dbf0d0d03d5cf9a946594cec61e4d9a2a46d
Author: Jacky Li <jacky.likun@...>
Date: 2018-02-10T02:34:59Z

Revert "[CARBONDATA-2023][DataLoad] Add size base block allocation in data loading"

This reverts commit 6dd8b038fc898dbf48ad30adfc870c19eb38e3d0.

commit d13f01bfb7bf84fd8a231300219cbc4818eabe5b
Author: sounakr <sounakr@...>
Date: 2018-02-24T02:25:14Z

File Format Reader

commit 06b0c74edbc6097ada28382f27c54905a1b07159
Author: sounakr <sounakr@...>
Date: 2018-02-26T11:58:47Z

File Format Phase 2

commit 372b380470600c03a2f723b53a106a5ce0087ae9
Author: Ajantha-Bhat <ajanthabhat@...>
Date: 2018-02-27T06:06:56Z

* File Format Phase 2 (cleanup code)

commit 8eb20a5dd9543029239a051bd978e855a69d805c
Author: Ajantha-Bhat <ajanthabhat@...>
Date: 2018-02-27T06:36:28Z

* File Format Phase 2 (cleanup code)

commit 462fd28cbc1268bbb529f947ee2e93c068e0d682
Author: Ajantha-Bhat <ajanthabhat@...>
Date: 2018-02-27T09:54:43Z

* File Format Phase 2 (cleanup code and adding testCase)

commit 952688b8cf1b17954b85af6143abcab77d081da8
Author: Ajantha-Bhat <ajanthabhat@...>
Date: 2018-02-27T11:58:37Z

* File Format Phase 2 (filter issue fix)

commit 87c84943122c8523291cc25751829ac143161469
Author: Ajantha-Bhat <ajanthabhat@...>
Date: 2018-02-27T12:20:46Z

* File Format Phase 2 (filter issue fix return value)

commit 3a0c3b9448c3cca0742db0f557518ffa12d0dabb
Author: sounakr <sounakr@...>
Date: 2018-02-27T13:55:16Z

Clear DataMap Cache

commit 1943cf6dcd266cd78483f137e0499083d95e4332
Author: Ajantha-Bhat <ajanthabhat@...>
Date: 2018-02-27T14:02:35Z

* File Format Phase 2 (test cases)

commit 4f97c7e35fade5fe0abb58b0c781a6b7f5b744e9
Author: sounakr <sounakr@...>
Date: 2018-02-28T03:18:45Z

Refactor CarbonFileInputFormat

commit 7df78cf50b658cc6fb79e28b0ad76f74dc8a680a
Author: Ajantha-Bhat <ajanthabhat@...>
Date: 2018-02-28T10:02:08Z

* File Format Phase 2
a. test cases addition
b. Exception handling when the files are not present
c. Setting the filter expression in carbonTableInputFormat

commit 4825fcc8d023c2b1a031ee0417addf5b6f2d5763
Author: Ajantha-Bhat <ajanthabhat@...>
Date: 2018-02-28T10:02:08Z

* File Format Phase 2
a. test cases addition
b. Exception handling when the files are not present
c. Setting the filter expression in carbonTableInputFormat

commit 5e5adbe21b8b786c13fda13e7e052bc5e46f22b4
Author: Ajantha-Bhat <ajanthabhat@...>
Date: 2018-02-28T10:02:08Z

* File Format Phase 2
a. test cases addition
b. Exception handling when the files are not present
c. Setting the filter expression in carbonTableInputFormat

commit b510faa9e033fb2ca0ae64125aee10709201e69f
Author: sounakr <sounakr@...>
Date: 2018-03-01T11:23:39Z

Map Reduce Test Case for CarbonInputFileFormat

----

---

qiuchenjian-2