Introduction
Python has gradually become one of the popular languages worldwide, with its mostly used open source library Pandas being many’s first choice when it comes to managing and manipulating data. Pandas library is also one of the most versatile libraries in the sense that for any operation on the dataset you are looking to do, depending on the nature of the data, you might be able to achieve it through different built-in functions. Dropping duplicates is a good example of this, as there might be more ways than you think to drop the duplicates from a DataFrame and their performance can vary, especially with bigger dataset.
The Sample Dataset
Suppose we have a dataset that tell us the construction cost of an underground tube station in different cities
The line
column contains data for the underground line number for each city and the values in the stop
column are the stop number for each line. Cost is in millions. The original construction cost for a stop could be updated more by once by inserting a row in the existing dataset, and the DataFrame comes with city
as the index. Imagine this dataset contains data for all the underground stations in this world, what is the best way for you to filter down to the most up to date construction cost?
The first thing that comes to mind is the df.drop_duplicates()
function. We might think to do df.drop_duplicates(subset=[‘city’,’line’,’stop’])
, but this will throw error because city
is the index
Dropping Duplicates with Indices Only
If we want to perform drop duplicates on index columns only, we could use the pandas.Index.duplicated
attribute. As we are looking to get unique rows of city
, line
and stop
values, and if those were the indices of our dataset, we could get it work:
Dropping Duplicates with Indices and Columns
However, the problem we have now is we want to perform drop duplicates on a dataset using a combination of index and columns. There are two ways we could do this.
Does it matter which one we use? It does if performance is one of the most important factors you have in mind when coding up the solution.
The costly operation in reset_index()
is the copying of the original DataFrame(self.copy()
is used in the reset_index()
)
Of course you might wonder why don’t we use reset_index(inplace=True)
with drop_duplicates()
? Using inplace
is generally considered a bad practice in Pandas, as it could give unpredictable behaviours and lead to the dreaded SettingWithCopyWarning
when called on a DataFrame column (For those who are interested to dig this deeper could read on this section from the official Pandas documentation)
Conclusions
In summary, when facing the problem of removing duplicates in Python, drop_duplicates()
is our best friend when we only want to consider certain columns or only the indices but not a combination of both. If we need to drop duplicates with both indices and columns, we could use groupby()
, which has better performance statistics than reset_index()
and then drop_duplicates()
References & Credits
This article was inspired by these Stack Overflow posts: https://stackoverflow.com/questions/13035764/remove-pandas-rows-with-duplicate-indices
Kumar, A. (2019). Mastering Pandas: A Complete Guide to Pandas, from Installation to Advanced Data Analysis Techniques, 2nd Edition. United Kingdom: Packt Publishing.
McKinney, W. (2017). Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython. United States: O’Reilly Media.
Dataset source: this author conceptualise this article when she worked on removing duplicates on financial assets identifiers mappings, which are proprietary data so she made up the tube construction cost dataset for this article