[GitHub] [carbondata] jackylk opened a new pull request #3449: [CARBONDATA-3578] make table status file smaller

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[GitHub] [carbondata] jackylk opened a new pull request #3449: [CARBONDATA-3578] make table status file smaller

GitBox
jackylk opened a new pull request #3449: [CARBONDATA-3578] make table status file smaller
URL: https://github.com/apache/carbondata/pull/3449
 
 
   Currently, each segment entry in the table status file occupies 347 Bytes, if one has 10000 segments, the file becomes 3.47MB.
   Since carbondata relies on this file heavily, it is better to reduce its size to improve IO, especially in data lake scenario.
   
   Each entry in table status file is one LoadMetadataDetails object.
   In this PR, following changes are made in LoadMetadataDetails to reduce its size:
   1. Do not write fields that has default value, like "visibility", "fileFormat", etc
   2. User shorter key, for example, "loadStatus" is changed to "ls"
   
   In this PR, table status file size is reduced to 1/3.
   Before change: 347Bytes
   ```json
        {
            "timestamp": "1573635015982",
            "loadStatus": "Success",
            "loadName": "0",
            "partitionCount": "0",
            "isDeleted": "FALSE",
            "dataSize": "2977",
            "indexSize": "1469",
            "updateDeltaEndTimestamp": "",
            "updateDeltaStartTimestamp": "",
            "updateStatusFileName": "",
            "loadStartTime": "1573635014638",
            "visibility": "true",
            "fileFormat": "columnar_v3",
            "segmentFile": "0_1573635014638.segment"
        }
   ```
   
   After change: 118Bytes ( reduced to 1/3 size)
   ```json
        {
            "ts": "1573635284677",
            "ls": "S",
            "ln": "0",
            "ds": "2977",
            "is": "1469",
            "lt": "1573635284045",
            "sf": "0_1573635284045.segment"
        }
   ```
   
   About the backward compatibility, this PR still can read the old table status file, by using GSON's @SerializedName(alternate), so it does not break backward compatibility.
   
    - [X] Any interfaces changed?
    No
    - [X] Any backward compatibility impacted?
    No
    - [X] Document update required?
   No
    - [X] Testing done
           Please provide details on
           - Whether new unit test cases have been added or why no new tests are required?
           - How it is tested? Please attach test report.
           - Is it a performance related change? Please attach the performance test report.
           - Any additional information to help reviewers in testing this change.
       Run existing testcase  
    - [X] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.
   No
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[hidden email]


With regards,
Apache Git Services