Loading... |
Reply to author |
Edit post |
Move post |
Delete this post |
Delete this post and replies |
Change post date |
Print post |
Permalink |
Raw mail |
*Background*: This feature will support the carbondata SDK to load data from
Parquet, ORC, CSV, Avro and JSON file. Details of solution and implementation are mentioned in the document attached to JIRA. https://issues.apache.org/jira/browse/CARBONDATA-3855 NOTE: This design handles load data from Parquet, ORC, CSV, and Avro file. JSON file will be handled later. Thanks, Nihal ojha -- Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/ |
Loading... |
Reply to author |
Edit post |
Move post |
Delete this post |
Delete this post and replies |
Change post date |
Print post |
Permalink |
Raw mail |
+1,
If carbondata sdk can support load data from parquet, ORC, CSV, Avro and JSON file, it will more convenient for users to use CarbonData. It avoid every user to parser different fileformat and convert to carbondata format by coding. CarbonData SDK can refer spark-sql implementation, but CarbonSDK should dependency Spark to implement this function because carbonData SDK is usually used for no spark environment. CarbonData can integrate Parquet, Json, CSV parser SDK to implement it. For API design: 1. For load parquet, avro , json, csv files: (1) it's better to support load many times data after create CarbonWriter. (2) it's better to support schema map, for example, carbondata and parquet has different schema, such as column name is different, column name is col1 in carbon , but column name is col2 in parquet, it's better for CarbonSDK to support schema map, map parquet col2 to carbon col1. 2. API name: public CarbonWriterBuilder withParquetFile(String filePath) =>suggestion: Parquet maybe has many files in the same folder. withParquetFile is easily misunderstand one parquet file. so we can use withParquet(String filePath) or withParquetPath withOrcFile and withCsvFile is the same. and it better to support load files list , for example, CarbonData SDK can support load three files of five files in the same folder, which can be selected by users. 3. lic void write() throws IOException; => we should support load data after create CarbonWriter, and load data to memory and convert to carbon after call load method. 4. now we write data to disk/obs/hdfs by calling close method, it's better to support flush. -- Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/ |
Loading... |
Reply to author |
Edit post |
Move post |
Delete this post |
Delete this post and replies |
Change post date |
Print post |
Permalink |
Raw mail |
Updated the design document in Jira after addressing the latest comment.
1. Changed the API name as suggested. 2. Now user can load the single file or all files at a given directory or selected files under the directory. 3. First, we create the carbon writer, and then we called the write() method. Under this method, we read the data of file iteratively and write in carbon format. 4. We are using the current interface for the close() method. Flush we will support later. Thanks and Regards, Nihal Ojha -- Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/ |
Free forum by Nabble | Edit this page |