How to: Find duplicate documents across multiple SharePoint farms

One of important aspects of a well-governed SharePoint deployment is visibility into what data is stored in SharePoint and how it is being used. This is equally important to the business and and to the IT. One of the tools that can help improving SharePoint transparency is Quest Site Administrator for SharePoint. Here's one of the examples how.


A question recently came up in a customer conversation, whether it is possible to identify duplicate files across the three SharePoint farms they have. To make it more complicated, two of these are running SharePoint 2007 and one is SharePoint Server 2010. Can this be done? Yes, and here's a quick walkthrough:


Step 1 - run the report.

In Site Administrator Information Portal, run the Enterprise Overview. This is a default report that opens when an administrator opens Information Portal, and it combines data from across entire environment - including multiple farms and different SharePoint versions. In the report, scroll down to the Top 5 largest documents widget and click View more.

Step 2 - export the data.

The report you get will look something like this. Note the report does not show all documents, so go ahead to create and download the CSV. Note that depending on the size of the environment, it can take considerable time to create the CSV file.

Step 3 - analyze the data.

The report you get will include information on all documents across all farms, their name and URL, created and modified dates, last access time, as well as the size and version infromation. Again, this can be a lot of data, so for larger environments you might want to use something like Microsoft Access or other database tools to analyze it. Note that you can also run the same report on a smaller scope, such as a farm, web applicated or particular site collection or site.


In a simpler lab enviornment that I have, I just import the raw data into Excel and create a simple pivot table. My assumption is that the file is probably a duplicate if the file name, last version size and modification dates are the same. So I grouped files by their name, then current version size and finally - by modification date. Now I can see how many copies of each file exist and get the file URLs to understand where these duplicates are located.

This is it. You can now see documents that are likely to be duplicates, as well as the location of each copy across several SharePoint farms. You can also filter the report to show documents larger than certain site or that have more than certain number of versions, to narrow down the scope of analysis. There're other interesting things you can do when analyzing the data, for example combining this pivot table with the usage information on each copy. All of this depends on what is the scope and the ultimate goal of your research, and Site Administrator gives you the tool to get the data you are looking for.



Product Manager @QuestSharePoint