浅析Pandas中append方法的效率问题

　　效率在大数据量下是非常重要的。

定位问题：for循环下使用df.append添加数据（数据量几十万级）
效率：非常低下，macbook pro跑8h跑不完。
解决方法：先将df转为list，再使用.append()。
原因：Pandas的.append方法不改变原有对象，而是创建一个新的对象。而列表则不会如此。
具体官方代码：

def append(self, other, ignore_index=False, verify_integrity=False, sort=None):
    """
    Append rows of `other` to the end of caller, returning a new object.

    Columns in `other` that are not in the caller are added as new columns.

    Parameters
    ----------
    other : DataFrame or Series/dict-like object, or list of these
        The data to append.
    ignore_index : boolean, default False
        If True, do not use the index labels.
    verify_integrity : boolean, default False
        If True, raise ValueError on creating index with duplicates.
    sort : boolean, default None
        Sort columns if the columns of `self` and `other` are not aligned.
        The default sorting is deprecated and will change to not-sorting
        in a future version of pandas. Explicitly pass ``sort=True`` to
        silence the warning and sort. Explicitly pass ``sort=False`` to
        silence the warning and not sort.

        .. versionadded:: 0.23.0

    Returns
    -------
    DataFrame

    See Also
    --------
    concat : General function to concatenate DataFrame or Series objects.

    Notes
    -----
    If a list of dict/series is passed and the keys are all contained in
    the DataFrame's index, the order of the columns in the resulting
    DataFrame will be unchanged.

    Iteratively appending rows to a DataFrame can be more computationally
    intensive than a single concatenate. A better solution is to append
    those rows to a list and then concatenate the list with the original
    DataFrame all at once.

    Examples
    --------

    >>> df = pd.DataFrame([[1, 2], [3, 4]], columns=list('AB'))
    >>> df
       A  B
    0  1  2
    1  3  4
    >>> df2 = pd.DataFrame([[5, 6], [7, 8]], columns=list('AB'))
    >>> df.append(df2)
       A  B
    0  1  2
    1  3  4
    0  5  6
    1  7  8

    With `ignore_index` set to True:

    >>> df.append(df2, ignore_index=True)
       A  B
    0  1  2
    1  3  4
    2  5  6
    3  7  8

    The following, while not recommended methods for generating DataFrames,
    show two ways to generate a DataFrame from multiple data sources.

    Less efficient:

    >>> df = pd.DataFrame(columns=['A'])
    >>> for i in range(5):
    ...     df = df.append({'A': i}, ignore_index=True)
    >>> df
       A
    0  0
    1  1
    2  2
    3  3
    4  4

    More efficient:

    >>> pd.concat([pd.DataFrame([i], columns=['A']) for i in range(5)],
    ...           ignore_index=True)
       A
    0  0
    1  1
    2  2
    3  3
    4  4
    """
    if isinstance(other, (Series, dict)):
        if isinstance(other, dict):
            other = Series(other)
        if other.name is None and not ignore_index:
            raise TypeError(
                "Can only append a Series if ignore_index=True"
                " or if the Series has a name"
            )

        if other.name is None:
            index = None
        else:
            # other must have the same index name as self, otherwise
            # index name will be reset
            index = Index([other.name], name=self.index.name)

        idx_diff = other.index.difference(self.columns)
        try:
            combined_columns = self.columns.append(idx_diff)
        except TypeError:
            combined_columns = self.columns.astype(object).append(idx_diff)
        other = other.reindex(combined_columns, copy=False)
        other = DataFrame(
            other.values.reshape((1, len(other))),
            index=index,
            columns=combined_columns,
        )
        other = other._convert(datetime=True, timedelta=True)
        if not self.columns.equals(combined_columns):
            self = self.reindex(columns=combined_columns)
    elif isinstance(other, list) and not isinstance(other[0], DataFrame):
        other = DataFrame(other)
        if (self.columns.get_indexer(other.columns) >= 0).all():
            other = other.reindex(columns=self.columns)

    from pandas.core.reshape.concat import concat

    if isinstance(other, (list, tuple)):
        to_concat = [self] + other
    else:
        to_concat = [self, other]
    return concat(
        to_concat,
        ignore_index=ignore_index,
        verify_integrity=verify_integrity,
        sort=sort,
    )

该函数下的官方注释也特地强调了效率问题：

Iteratively appending rows to a DataFrame can be more computationally intensive than a single concatenate. A better solution is to append those rows to a list and then concatenate the list with the original DataFrame all at once.

将行迭代地附加到df比单次连接计算量更大。更好的解决方案是将这些行追加到一个列表中，然后一次性将该列表与原始df连接起来。

这样是低效的：

1
2
3

df = pd.DataFrame(columns=['A'])
for i in range(5):
    df = df.append({'A': i}, ignore_index=True)

下面这样比较高效：

1 2	pd.concat([pd.DataFrame([i], columns=['A']) for i in range(5)], ... ignore_index=True)