Dropping Duplicates in Pandas

Helen Ren
3 min readJan 29, 2022

Introduction

Python has gradually become one of the popular languages worldwide, with its mostly used open source library Pandas being many’s first choice when it comes to managing and manipulating data. Pandas library is also one of the most versatile libraries in the sense that for any operation on the dataset you are looking to do, depending on the nature of the data, you might be able to achieve it through different built-in functions. Dropping duplicates is a good example of this, as there might be more ways than you think to drop the duplicates from a DataFrame and their performance can vary, especially with bigger dataset.

The Sample Dataset

Suppose we have a dataset that tell us the construction cost of an underground tube station in different cities

The line column contains data for the underground line number for each city and the values in the stop column are the stop number for each line. Cost is in millions. The original construction cost for a stop could be updated more by once by inserting a row in the existing dataset, and the DataFrame comes with city as the index. Imagine this dataset contains data for all the underground stations in this world, what is the best way for you to filter down to the most up to date construction cost?

The first thing that comes to mind is the df.drop_duplicates() function. We might think to do df.drop_duplicates(subset=[‘city’,’line’,’stop’]), but this will throw error because city is the index

Dropping Duplicates with Indices Only

If we want to perform drop duplicates on index columns only, we could use the pandas.Index.duplicated attribute. As we are looking to get unique rows of city, line and stop values, and if those were the indices of our dataset, we could get it work:

Dropping Duplicates with Indices and Columns

However, the problem we have now is we want to perform drop duplicates on a dataset using a combination of index and columns. There are two ways we could do this.

Does it matter which one we use? It does if performance is one of the most important factors you have in mind when coding up the solution.

The costly operation in reset_index() is the copying of the original DataFrame(self.copy() is used in the reset_index())

Of course you might wonder why don’t we use reset_index(inplace=True) with drop_duplicates()? Using inplace is generally considered a bad practice in Pandas, as it could give unpredictable behaviours and lead to the dreaded SettingWithCopyWarning when called on a DataFrame column (For those who are interested to dig this deeper could read on this section from the official Pandas documentation)

Conclusions

In summary, when facing the problem of removing duplicates in Python, drop_duplicates() is our best friend when we only want to consider certain columns or only the indices but not a combination of both. If we need to drop duplicates with both indices and columns, we could use groupby(), which has better performance statistics than reset_index() and then drop_duplicates()

References & Credits

This article was inspired by these Stack Overflow posts: https://stackoverflow.com/questions/13035764/remove-pandas-rows-with-duplicate-indices

https://stackoverflow.com/questions/50249979/why-use-reset-indexdrop-true-when-setting-the-index-is-much-faster

Kumar, A. (2019). Mastering Pandas: A Complete Guide to Pandas, from Installation to Advanced Data Analysis Techniques, 2nd Edition. United Kingdom: Packt Publishing.

McKinney, W. (2017). Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython. United States: O’Reilly Media.

Dataset source: this author conceptualise this article when she worked on removing duplicates on financial assets identifiers mappings, which are proprietary data so she made up the tube construction cost dataset for this article

--

--