Apache CarbonData Dev Mailing List archive › Apache CarbonData JIRA issues

[GitHub] [carbondata] marchpure opened a new pull request #3880: [CARBONDATA-3879] Filtering Segmets Optimazation

Classic

List

10 messages Options

Options

GitBox

[GitHub] [carbondata] marchpure opened a new pull request #3880: [CARBONDATA-3879] Filtering Segmets Optimazation

marchpure opened a new pull request #3880:
URL: https://github.com/apache/carbondata/pull/3880

### Why is this PR needed?
During filter segments flow, there are a lot of LIST.CONTAINS, which has heavy time overhead when there are tens of thousands segments.

For example, if there are 50000 segments. it will trigger LIST.CONTAINS for each segment, the LIST also has about 50000 elements. so the time complexity will be O（50000 * 50000 ）

### What changes were proposed in this PR?
Change List.CONTAINS to MAP.containsKEY

### Does this PR introduce any user interface change?
- No

### Is any new testcase added?
- No

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

GitBox

[GitHub] [carbondata] CarbonDataQA1 commented on pull request #3880: [CARBONDATA-3879] Filtering Segmets Optimazation

CarbonDataQA1 commented on pull request #3880:
URL: https://github.com/apache/carbondata/pull/3880#issuecomment-669675655

Build Success with Spark 2.4.5, Please check CI http://121.244.95.60:12545/job/ApacheCarbon_PR_Builder_2.4.5/1884/

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

GitBox

[GitHub] [carbondata] CarbonDataQA1 commented on pull request #3880: [CARBONDATA-3879] Filtering Segmets Optimazation

In reply to this post by GitBox

CarbonDataQA1 commented on pull request #3880:
URL: https://github.com/apache/carbondata/pull/3880#issuecomment-669677665

Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12545/job/ApacheCarbonPRBuilder2.3/3623/

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

GitBox

[GitHub] [carbondata] ajantha-bhat commented on a change in pull request #3880: [CARBONDATA-3879] Filtering Segmets Optimazation

In reply to this post by GitBox

ajantha-bhat commented on a change in pull request #3880:
URL: https://github.com/apache/carbondata/pull/3880#discussion_r466243016

##########
File path: hadoop/src/main/java/org/apache/carbondata/hadoop/api/CarbonTableInputFormat.java
##########
@@ -389,14 +389,16 @@ protected FileSplit makeSplit(String segmentId, String filePath, long start, lon

public void updateLoadMetaDataDetailsToSegments(List<Segment> validSegments,
List<org.apache.carbondata.hadoop.CarbonInputSplit> prunedSplits) {
+ Map<String, Segment> validSegmentsMap = validSegments.stream()

Review comment:
creating a map everytime for query is also overhead when the valid segments is in thousands, I suggest we can use `SET` for `validSegments` when it is formed originally instead of 'LIST'

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

GitBox

[GitHub] [carbondata] marchpure commented on a change in pull request #3880: [CARBONDATA-3879] Filtering Segmets Optimazation

In reply to this post by GitBox

marchpure commented on a change in pull request #3880:
URL: https://github.com/apache/carbondata/pull/3880#discussion_r466265360

##########
File path: hadoop/src/main/java/org/apache/carbondata/hadoop/api/CarbonTableInputFormat.java
##########
@@ -389,14 +389,16 @@ protected FileSplit makeSplit(String segmentId, String filePath, long start, lon

public void updateLoadMetaDataDetailsToSegments(List<Segment> validSegments,
List<org.apache.carbondata.hadoop.CarbonInputSplit> prunedSplits) {
+ Map<String, Segment> validSegmentsMap = validSegments.stream()

Review comment:
Appreciate for that good suggestion. But here we can only use Map instead of Set.

Reason: The The pseudo-code for this function is shown as below.

// **0. segments.hashcode is segmentno, to when we compare 2 segments, only segmentno will be compared**
if (validSegments.contains(segmentInSplit)) {
**1. fetch the segment from validSegments**
Segment segmentInValidSegment <- fetch the segment from validSegments
2.
segmentInSplit.setLoadMetadataDetails(segmentInValidSegment.getLoadMetadataDetails);
}

For SET. it's hard to fetch the segment from validSegments.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

GitBox

[GitHub] [carbondata] ajantha-bhat commented on a change in pull request #3880: [CARBONDATA-3879] Filtering Segmets Optimazation

In reply to this post by GitBox

ajantha-bhat commented on a change in pull request #3880:
URL: https://github.com/apache/carbondata/pull/3880#discussion_r466270144

##########
File path: hadoop/src/main/java/org/apache/carbondata/hadoop/api/CarbonTableInputFormat.java
##########
@@ -389,14 +389,16 @@ protected FileSplit makeSplit(String segmentId, String filePath, long start, lon

public void updateLoadMetaDataDetailsToSegments(List<Segment> validSegments,
List<org.apache.carbondata.hadoop.CarbonInputSplit> prunedSplits) {
+ Map<String, Segment> validSegmentsMap = validSegments.stream()

Review comment:
if you see the `equals` implementation of `Segment.java`, it is based on only segment number comparison. so, I think you can still use `SET`

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

GitBox

[GitHub] [carbondata] marchpure commented on a change in pull request #3880: [CARBONDATA-3879] Filtering Segmets Optimazation

In reply to this post by GitBox

marchpure commented on a change in pull request #3880:
URL: https://github.com/apache/carbondata/pull/3880#discussion_r466275337

##########
File path: hadoop/src/main/java/org/apache/carbondata/hadoop/api/CarbonTableInputFormat.java
##########
@@ -389,14 +389,16 @@ protected FileSplit makeSplit(String segmentId, String filePath, long start, lon

public void updateLoadMetaDataDetailsToSegments(List<Segment> validSegments,
List<org.apache.carbondata.hadoop.CarbonInputSplit> prunedSplits) {
+ Map<String, Segment> validSegmentsMap = validSegments.stream()

Review comment:
Yeah. But if use SET. it's hard to read a element with specified segmentno.
In this code. we need to read a segment with specified segmentno from validSegments.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

GitBox

[GitHub] [carbondata] ajantha-bhat commented on a change in pull request #3880: [CARBONDATA-3879] Filtering Segmets Optimazation

In reply to this post by GitBox

ajantha-bhat commented on a change in pull request #3880:
URL: https://github.com/apache/carbondata/pull/3880#discussion_r466277616

##########
File path: hadoop/src/main/java/org/apache/carbondata/hadoop/api/CarbonTableInputFormat.java
##########
@@ -389,14 +389,16 @@ protected FileSplit makeSplit(String segmentId, String filePath, long start, lon

public void updateLoadMetaDataDetailsToSegments(List<Segment> validSegments,
List<org.apache.carbondata.hadoop.CarbonInputSplit> prunedSplits) {
+ Map<String, Segment> validSegmentsMap = validSegments.stream()

Review comment:
oh, I got what you mean, we still need to get the element from valid segment to read its LoadMetadataDetails. So, SET can help only for contains, but not for getting the valid segment.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

GitBox

[GitHub] [carbondata] ajantha-bhat commented on pull request #3880: [CARBONDATA-3879] Filtering Segmets Optimazation

In reply to this post by GitBox

ajantha-bhat commented on pull request #3880:
URL: https://github.com/apache/carbondata/pull/3880#issuecomment-669819275

LGTM

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]

GitBox

[GitHub] [carbondata] asfgit closed pull request #3880: [CARBONDATA-3879] Filtering Segmets Optimazation

In reply to this post by GitBox

asfgit closed pull request #3880:
URL: https://github.com/apache/carbondata/pull/3880

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[hidden email]