Apache CarbonData Dev Mailing List archive › Apache CarbonData JIRA issues

[GitHub] carbondata pull request #3047: [CARBONDATA-3223] Fixed Wrong Datasize and In...

Classic

List

28 messages Options

Options

12

[GitHub] carbondata pull request #3047: [CARBONDATA-3223] Fixed Wrong Datasize and In...

GitHub user manishnalla1994 opened a pull request:

https://github.com/apache/carbondata/pull/3047

[CARBONDATA-3223] Fixed Wrong Datasize and Indexsize calculation for old store using Show Segments

Problem: Table Created and Loading on older version(1.1) was showing data-size and index-size 0B when refreshed on new version. This was because when the data-size was coming as "null" we were not computing it, directly assigning 0 value to it.

Solution: Computed the correct data-size and index-size using CarbonTable.

Be sure to do all of the following checklist to help us incorporate
your contribution quickly and easily:

- [ ] Any interfaces changed?

- [ ] Any backward compatibility impacted?

- [ ] Document update required?

- [x] Testing done
Please provide details on
- Whether new unit test cases have been added or why no new tests are required?
- How it is tested? Please attach test report.
- Is it a performance related change? Please attach the performance test report.
- Any additional information to help reviewers in testing this change.

- [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/manishnalla1994/carbondata Datasize0Issue

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/carbondata/pull/3047.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #3047

----
commit 6bf65d7a0b42e8d9a822fd234a510550bd8d2f17
Author: manishnalla1994 <manish.nalla1994@...>
Date: 2019-01-02T12:30:36Z

Fixed Wrong Datasize and Indexsize calculation for old store

----

---

[GitHub] carbondata issue #3047: [CARBONDATA-3223] Fixed Wrong Datasize and Indexsize...

Github user CarbonDataQA commented on the issue:

https://github.com/apache/carbondata/pull/3047

Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/2124/

---

[GitHub] carbondata issue #3047: [CARBONDATA-3223] Fixed Wrong Datasize and Indexsize...

In reply to this post by qiuchenjian-2

Github user CarbonDataQA commented on the issue:

https://github.com/apache/carbondata/pull/3047

Build Failed with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/2330/

---

[GitHub] carbondata issue #3047: [CARBONDATA-3223] Fixed Wrong Datasize and Indexsize...

In reply to this post by qiuchenjian-2

Github user CarbonDataQA commented on the issue:

https://github.com/apache/carbondata/pull/3047

Build Failed with Spark 2.3.2, Please check CI http://136.243.101.176:8080/job/carbondataprbuilder2.3/10378/

---

[GitHub] carbondata pull request #3047: [CARBONDATA-3223] Fixed Wrong Datasize and In...

In reply to this post by qiuchenjian-2

Github user qiuchenjian commented on a diff in the pull request:

https://github.com/apache/carbondata/pull/3047#discussion_r244895354

--- Diff: integration/spark-common/src/main/scala/org/apache/carbondata/api/CarbonStore.scala ---
@@ -101,14 +102,21 @@ object CarbonStore {
val (dataSize, indexSize) = if (load.getFileFormat == FileFormat.ROW_V1) {
// for streaming segment, we should get the actual size from the index file
// since it is continuously inserting data
- val segmentDir = CarbonTablePath.getSegmentPath(tablePath, load.getLoadName)
+ val segmentDir = CarbonTablePath
+ .getSegmentPath(carbonTable.getTablePath, load.getLoadName)
val indexPath = CarbonTablePath.getCarbonStreamIndexFilePath(segmentDir)
val indices = StreamSegment.readIndexFile(indexPath, FileFactory.getFileType(indexPath))
(indices.asScala.map(_.getFile_size).sum, FileFactory.getCarbonFile(indexPath).getSize)
} else {
// for batch segment, we can get the data size from table status file directly
- (if (load.getDataSize == null) 0L else load.getDataSize.toLong,
- if (load.getIndexSize == null) 0L else load.getIndexSize.toLong)
+ if (null == load.getDataSize && null == load.getIndexSize) {
+ val dataIndexSize = CarbonUtil.calculateDataIndexSize(carbonTable, false)
+ (dataIndexSize.get(CarbonCommonConstants.CARBON_TOTAL_DATA_SIZE).toLong,
+ dataIndexSize.get(CarbonCommonConstants.CARBON_TOTAL_INDEX_SIZE).toLong)
+ } else {
+ (load.getDataSize.toLong,
--- End diff --

if one of load.getDataSize and load.getIndexSize is null, it will throw exception, i think this scene should be considered

---

[GitHub] carbondata pull request #3047: [CARBONDATA-3223] Fixed Wrong Datasize and In...

In reply to this post by qiuchenjian-2

Github user manishnalla1994 commented on a diff in the pull request:

https://github.com/apache/carbondata/pull/3047#discussion_r244911752

--- Diff: integration/spark-common/src/main/scala/org/apache/carbondata/api/CarbonStore.scala ---
@@ -101,14 +102,21 @@ object CarbonStore {
val (dataSize, indexSize) = if (load.getFileFormat == FileFormat.ROW_V1) {
// for streaming segment, we should get the actual size from the index file
// since it is continuously inserting data
- val segmentDir = CarbonTablePath.getSegmentPath(tablePath, load.getLoadName)
+ val segmentDir = CarbonTablePath
+ .getSegmentPath(carbonTable.getTablePath, load.getLoadName)
val indexPath = CarbonTablePath.getCarbonStreamIndexFilePath(segmentDir)
val indices = StreamSegment.readIndexFile(indexPath, FileFactory.getFileType(indexPath))
(indices.asScala.map(_.getFile_size).sum, FileFactory.getCarbonFile(indexPath).getSize)
} else {
// for batch segment, we can get the data size from table status file directly
- (if (load.getDataSize == null) 0L else load.getDataSize.toLong,
- if (load.getIndexSize == null) 0L else load.getIndexSize.toLong)
+ if (null == load.getDataSize && null == load.getIndexSize) {
+ val dataIndexSize = CarbonUtil.calculateDataIndexSize(carbonTable, false)
+ (dataIndexSize.get(CarbonCommonConstants.CARBON_TOTAL_DATA_SIZE).toLong,
+ dataIndexSize.get(CarbonCommonConstants.CARBON_TOTAL_INDEX_SIZE).toLong)
+ } else {
+ (load.getDataSize.toLong,
--- End diff --

Yes, fixed it now.

---

[GitHub] carbondata issue #3047: [CARBONDATA-3223] Fixed Wrong Datasize and Indexsize...

In reply to this post by qiuchenjian-2

Github user CarbonDataQA commented on the issue:

https://github.com/apache/carbondata/pull/3047

Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/2135/

---

[GitHub] carbondata issue #3047: [CARBONDATA-3223] Fixed Wrong Datasize and Indexsize...

In reply to this post by qiuchenjian-2

Github user CarbonDataQA commented on the issue:

https://github.com/apache/carbondata/pull/3047

Build Success with Spark 2.3.2, Please check CI http://136.243.101.176:8080/job/carbondataprbuilder2.3/10389/

---

[GitHub] carbondata issue #3047: [CARBONDATA-3223] Fixed Wrong Datasize and Indexsize...

In reply to this post by qiuchenjian-2

Github user CarbonDataQA commented on the issue:

https://github.com/apache/carbondata/pull/3047

Build Success with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/2341/

---

[GitHub] carbondata pull request #3047: [CARBONDATA-3223] Fixed Wrong Datasize and In...

In reply to this post by qiuchenjian-2

Github user manishgupta88 commented on a diff in the pull request:

https://github.com/apache/carbondata/pull/3047#discussion_r244922117

--- Diff: integration/spark-common/src/main/scala/org/apache/carbondata/api/CarbonStore.scala ---
@@ -101,14 +102,23 @@ object CarbonStore {
val (dataSize, indexSize) = if (load.getFileFormat == FileFormat.ROW_V1) {
// for streaming segment, we should get the actual size from the index file
// since it is continuously inserting data
- val segmentDir = CarbonTablePath.getSegmentPath(tablePath, load.getLoadName)
+ val segmentDir = CarbonTablePath
+ .getSegmentPath(carbonTable.getTablePath, load.getLoadName)
val indexPath = CarbonTablePath.getCarbonStreamIndexFilePath(segmentDir)
val indices = StreamSegment.readIndexFile(indexPath, FileFactory.getFileType(indexPath))
(indices.asScala.map(_.getFile_size).sum, FileFactory.getCarbonFile(indexPath).getSize)
} else {
// for batch segment, we can get the data size from table status file directly
- (if (load.getDataSize == null) 0L else load.getDataSize.toLong,
- if (load.getIndexSize == null) 0L else load.getIndexSize.toLong)
+ if (null == load.getDataSize || null == load.getIndexSize) {
+ // If either of datasize or indexsize comes to be null the we calculate the correct
+ // size and assign
+ val dataIndexSize = CarbonUtil.calculateDataIndexSize(carbonTable, false)
--- End diff --

Boolean flag in the method call is to update the data and index size in the table status file. Pass the flag as true so that it computes the size and update the table status file. This will avoid calculation for each Show Segment call

---

[GitHub] carbondata pull request #3047: [CARBONDATA-3223] Fixed Wrong Datasize and In...

In reply to this post by qiuchenjian-2

Github user manishgupta88 commented on a diff in the pull request:

https://github.com/apache/carbondata/pull/3047#discussion_r244920921

--- Diff: integration/spark-common/src/main/scala/org/apache/carbondata/api/CarbonStore.scala ---
@@ -46,9 +47,9 @@ object CarbonStore {

def showSegments(
limit: Option[String],
- tablePath: String,
+ carbonTable: CarbonTable,
--- End diff --

Move `carbonTable` as the first argument of method

---

[GitHub] carbondata issue #3047: [CARBONDATA-3223] Fixed Wrong Datasize and Indexsize...

In reply to this post by qiuchenjian-2

Github user CarbonDataQA commented on the issue:

https://github.com/apache/carbondata/pull/3047

Build Success with Spark 2.1.0, Please check CI http://136.243.101.176:8080/job/ApacheCarbonPRBuilder2.1/2143/

---

[GitHub] carbondata issue #3047: [CARBONDATA-3223] Fixed Wrong Datasize and Indexsize...

In reply to this post by qiuchenjian-2

Github user manishgupta88 commented on the issue:

https://github.com/apache/carbondata/pull/3047

LGTM...can be merged once build passes

---

[GitHub] carbondata issue #3047: [CARBONDATA-3223] Fixed Wrong Datasize and Indexsize...

In reply to this post by qiuchenjian-2

Github user CarbonDataQA commented on the issue:

https://github.com/apache/carbondata/pull/3047

Build Failed with Spark 2.3.2, Please check CI http://136.243.101.176:8080/job/carbondataprbuilder2.3/10397/

---

[GitHub] carbondata pull request #3047: [CARBONDATA-3223] Fixed Wrong Datasize and In...

In reply to this post by qiuchenjian-2

Github user manishnalla1994 commented on a diff in the pull request:

https://github.com/apache/carbondata/pull/3047#discussion_r244957693

--- Diff: integration/spark-common/src/main/scala/org/apache/carbondata/api/CarbonStore.scala ---
@@ -46,9 +47,9 @@ object CarbonStore {

def showSegments(
limit: Option[String],
- tablePath: String,
+ carbonTable: CarbonTable,
--- End diff --

Done.

---

[GitHub] carbondata pull request #3047: [CARBONDATA-3223] Fixed Wrong Datasize and In...

In reply to this post by qiuchenjian-2

Github user manishnalla1994 commented on a diff in the pull request:

https://github.com/apache/carbondata/pull/3047#discussion_r244957746

--- Diff: integration/spark-common/src/main/scala/org/apache/carbondata/api/CarbonStore.scala ---
@@ -101,14 +102,23 @@ object CarbonStore {
val (dataSize, indexSize) = if (load.getFileFormat == FileFormat.ROW_V1) {
// for streaming segment, we should get the actual size from the index file
// since it is continuously inserting data
- val segmentDir = CarbonTablePath.getSegmentPath(tablePath, load.getLoadName)
+ val segmentDir = CarbonTablePath
+ .getSegmentPath(carbonTable.getTablePath, load.getLoadName)
val indexPath = CarbonTablePath.getCarbonStreamIndexFilePath(segmentDir)
val indices = StreamSegment.readIndexFile(indexPath, FileFactory.getFileType(indexPath))
(indices.asScala.map(_.getFile_size).sum, FileFactory.getCarbonFile(indexPath).getSize)
} else {
// for batch segment, we can get the data size from table status file directly
- (if (load.getDataSize == null) 0L else load.getDataSize.toLong,
- if (load.getIndexSize == null) 0L else load.getIndexSize.toLong)
+ if (null == load.getDataSize || null == load.getIndexSize) {
+ // If either of datasize or indexsize comes to be null the we calculate the correct
+ // size and assign
+ val dataIndexSize = CarbonUtil.calculateDataIndexSize(carbonTable, false)
--- End diff --

Fixed.

---

[GitHub] carbondata issue #3047: [CARBONDATA-3223] Fixed Wrong Datasize and Indexsize...

In reply to this post by qiuchenjian-2

Github user CarbonDataQA commented on the issue:

https://github.com/apache/carbondata/pull/3047

Build Success with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/2349/

---

[GitHub] carbondata issue #3047: [CARBONDATA-3223] Fixed Wrong Datasize and Indexsize...

In reply to this post by qiuchenjian-2

Github user manishnalla1994 commented on the issue:

https://github.com/apache/carbondata/pull/3047

retest this please

---

[GitHub] carbondata issue #3047: [CARBONDATA-3223] Fixed Wrong Datasize and Indexsize...

In reply to this post by qiuchenjian-2

Github user CarbonDataQA commented on the issue:

https://github.com/apache/carbondata/pull/3047

Build Failed with Spark 2.2.1, Please check CI http://95.216.28.178:8080/job/ApacheCarbonPRBuilder1/2361/

---

[GitHub] carbondata pull request #3047: [CARBONDATA-3223] Fixed Wrong Datasize and In...

In reply to this post by qiuchenjian-2

Github user KanakaKumar commented on a diff in the pull request:

https://github.com/apache/carbondata/pull/3047#discussion_r244980360

--- Diff: integration/spark-common/src/main/scala/org/apache/carbondata/api/CarbonStore.scala ---
@@ -101,14 +102,23 @@ object CarbonStore {
val (dataSize, indexSize) = if (load.getFileFormat == FileFormat.ROW_V1) {
// for streaming segment, we should get the actual size from the index file
// since it is continuously inserting data
- val segmentDir = CarbonTablePath.getSegmentPath(tablePath, load.getLoadName)
+ val segmentDir = CarbonTablePath
+ .getSegmentPath(carbonTable.getTablePath, load.getLoadName)
val indexPath = CarbonTablePath.getCarbonStreamIndexFilePath(segmentDir)
val indices = StreamSegment.readIndexFile(indexPath, FileFactory.getFileType(indexPath))
(indices.asScala.map(_.getFile_size).sum, FileFactory.getCarbonFile(indexPath).getSize)
} else {
// for batch segment, we can get the data size from table status file directly
- (if (load.getDataSize == null) 0L else load.getDataSize.toLong,
- if (load.getIndexSize == null) 0L else load.getIndexSize.toLong)
+ if (null == load.getDataSize || null == load.getIndexSize) {
+ // If either of datasize or indexsize comes to be null the we calculate the correct
+ // size and assign
+ val dataIndexSize = CarbonUtil.calculateDataIndexSize(carbonTable, true)
--- End diff --

Show segments is a read only query. I think we should not perform write operation in a query.
So, I feel its better to calculate every time and show OR just display as not available.

---

12