What is the fastest way to load a big dataset?

12 Sep 2020

Reading time ~2 minutes

데이터 분석을 시작할 때 가장 흔히 사용하는 데이터의 형태는 xlsx, csv일 것이다.

파이썬 새내기 시절 xlsx 파일을 csv 파일로 바꾸는 것만으로도 엄청난 파일 로드 시간 혁신을 이룰 수 있었다.

하지만 분석의 경험이 쌓이고 데이터셋의 크기도 점점 커지다 보니 csv 데이터셋을 불러오는 것 역시 만족스럽지 않았고 더 좋은 파일 format에 대해서 찾아보게 되었다.

Stack overflow

What is the fastest way to upload a big csv file in notebook to work with python pandas?

💡

https://stackoverflow.com/questions/37010212/what-is-the-fastest-way-to-upload-a-big-csv-file-in-notebook-to-work-with-python

세상의 모든 코딩 지식은 stack overflow에 있는 거 같다. 위의 답변을 보고 hdf란 확장자를 처음 알게 되었고 hdf를 통해 xlsx → csv 때 보다 훨씬 더 큰 효과를 볼 수 있었다.

하지만, 인간의 욕심은 끝이 없는지라 더 많은 format이 궁금해지기도 하고 상황에 따라 가장 효율적인 format을 직접 실험해 보고자 하는 욕구가 강력히 생겨났다.

pandas data format list

우선 pandas에 존재하는 모든 format에 대해서 실험을 진행했다.

'''
pd.to_csv, compression : {'infer', 'gzip', 'bz2', 'zip', 'xz', None}, default 'infer'
pd.to_json, orient : {'split', 'records', 'index', 'columns', 'values', 'table'}, 
                    default 'columns' 
            compression{'infer', 'gzip', 'bz2', 'zip', 'xz', None}, default 'infer'
pd.to_excel, None 
pd.to_hdf, complib : {'zlib', 'lzo', 'bzip2', 'blosc'}, default 'zlib', 
                format : {'fixed', 'table', None}, default 'fixed',
pd.to_feather, None
pd.to_parquet, compression{'snappy', 'gzip', 'brotli', None}, default 'snappy'
pd.to_stata, None
pd.to_pickle, compression{'infer', 'gzip', 'bz2', 'zip', 'xz', None}, default 'infer'
'''

phm_data_challenge_2018 데이터셋에서 train/02_M01_DC_train.csv 파일을 이용했다.

파일 크기 : 1065 Mb

순위는 매우 주관적이지만 객관적이려 노력했으며 다음과 같은 사항을 중점적으로 고려했다.