Cleaning up a .bib file

Question

I have a large .bib file which is in a mess. In particular:

Entries are not in any particular order.
Some entries may be duplicates (albeit with different keys).
Some entries may not be used in the relevant book.

(1
 is an issue because I'm working through the .bib file fixing up items one by one and examining the corresponding changes in the book; at the moment I'm having to dart back and forth a lot.

3 is an issue because I need to check the correctness of each entry manually, which can take a while; thus unused entries mean a lot of wasted time.)

Is there a way of taking a .bib file and cleaning it up?  In particular, I'd like to

a) create a .bib file which is sorted by author, and

b) delete or, better, comment out unused entries.

Boris · Accepted Answer

I would recommend bibexport script. It creates a new .bib file that includes only the references you cite in the .tex file, and cleans them up. I use it for submissions to journals, when I want to send only the relevant references rather than my databases.

Also JabRef has some duplicate search and resolve capabilities which can be accessed via the menu items shown (for v2.8) below:

enter image description here

Also JabRef has some duplicate search and resolve capabilities which can be accessed via the menu items shown (for v2.8) below:

Tengis · Answer

This is a python code that automatically deal with some of these issues. Please feel free to give feedback or suggestions to enhance or extend the features of this code. Hope it can help.

# perform some tests on ped.bib related to the pdf-directory (pdfdir)
# activate one or more tests by setting those control variables to 1

# 1. is_check_file: check objects with no or empty file entry
# 2. is_check_double_file: check if two or more different objects have the same pdf entry
# 3. is_check_unused_files: check <pdfdir> for unsused files

import os, sys, glob, copy
from os import path, access, R_OK  # W_OK for write permission.
from operator import itemgetter

#-------- control parameter ---------
is_check_file = 0
is_check_double_file = 0
is_check_unused_files = 1
#------------------------------------   

ROOT = os.getenv("HOME") # Home directory
#ROOT = path.expanduser("~") # works on all platforms  

pdfdir = ROOT + '/lit/pdf/'
print 'pdfdir', pdfdir
debug = 0 # shutdown debug/info messages

bib_data = open('ped.bib')

words = ['author', 'title', 'journal', 'year', 'volume', 'comment', 'issue', 'owner', 'file', 'timestamp', 'booktitle', 'editor', 'publisher', 'number', 'part', 'keywords', 'doi', 'month', 'organization', 'url']
#------------------------------------------------------------------------------

def check_missing_pdf(element):
    """
    following testes:
    1. element has no file entry
    2. element has empty file entry
    3. element has file entry with inexistant pdf-file
    """
    #check missing files
    if not element.has_key('file'): # element has no file
        print >>sys.stderr, '==== %s has no file entry'%element.get('key')
    elif not element.get('file'):  # .. or file is empty
            print >>sys.stderr, '**** %s has empty file entry'%element.get('key')
    else:
        # we have a file entry -----> check its existance in the pdf dir
        pdffile = pdfdir + element.get('file')
        if not( path.exists(pdffile) and path.isfile(pdffile) and access(pdffile, R_OK)):
            print >>sys.stderr, '#### %s with file entry [%s]: Either file is missing or is not readable'%(element.get('key'), element.get('file'))

#------------------------------------------------------------------------------

def check_doubles(elements):
    """
    check if two or more different objects have the same pdf-file 
    """
    doubles = {}
    for element in elements:
        pdffile = element.get("file")
        key = element.get("key")
        if not doubles.has_key(pdffile):
            doubles[pdffile] = [key]
        else:
            doubles[pdffile].append(key)
    for f, k in doubles.iteritems():
        if f and len(k) > 1: # if f excludes case f == None
            print >>sys.stderr, "Keys:", k, "have the same file <%s>"%f
#------------------------------------------------------------------------------

def check_unused_files(elements):
    """
    check dir pdfdir ("lit/pdf") for pdf-files that are not used
    in the ped.bib
    """
    pdf_files = glob.glob( pdfdir + "*.pdf") # pdfs in lit/pdf/
    dummy_files = copy.copy(pdf_files) # list of unused files
    for pdf in pdf_files:
        for element in elements:
            element_pdf =  element.get("file")
            if element_pdf is None:
                continue
            element_pdf = path.basename(element_pdf)

            if path.basename(pdf) == element_pdf:
                dummy_files.remove(pdf)
                break # check next files

    if dummy_files:
        print >>sys.stderr, "%d files are not used:"%len(dummy_files)
        for f in dummy_files:
            print >>sys.stderr, "---->",path.basename(f)
#------------------------------------------------------------------------------

def putWord(string, dic, line):
    """
    extract from <line> the value of the key <string> and put it in  <dic>
    """
    tmp = line[1].strip(' { } , .').split(':')
    # some files are like this :llll:aaaa. So tmp[0] is here == ''
    if not tmp[0]: 
        dic[string] = tmp[1]
    else:
        dic[string] = tmp[0]
#------------------------------------------------------------------------------

def getElement(f):
    """
    get ONE element from file f.
    return dict
    """
    dic = {}
    for line in f:
        line = line.strip(' nr')
        if not line:
            continue
        #get <key> and <type>
        if line[0] == '@':
            sline = line.split('{')
            typ = sline[0][1:]
            if typ == 'comment': # ignore jabref-meta
                continue
            dic['type'] = typ.strip(',') 
            key = sline[1].strip(',')
            if debug:
                print >> sys.stderr, '--------> type: <%s>'%typ
                print >> sys.stderr, '--------> key: <%s>'%key
            dic['key'] = key

        line = line.split('=')

        for word in words:
            if line[0].strip(' ') == word:
                putWord(word, dic, line)
                if debug:
                    print  >> sys.stderr, '--------> %s: <%s>'%(word, line[1].strip(' { },.') ) 

        # check for last line of element
        if line[0] == '}':
            if debug:
                print >> sys.stderr, '---------------------------------'
            return dic
#------------------------------------------------------------------------------


#----------------------- get content of file in elements ------------------------------
elements = []
while True:
    dic = getElement(bib_data) 
    if not dic:
        sorted(elements, key=itemgetter('key')) 
        break
    elements.append( dic )
#------------------------------------------------------------------------------

if is_check_file:
    print "check missing files ..."
    for element in elements:
        check_missing_pdf(element)

if is_check_double_file:
    print "check double files ..."
    check_doubles(elements)

if is_check_unused_files:
    print "is_check_unused_files ..."
    check_unused_files(elements)

Answered by Tengis on April 7, 2021

fbeck · Answer

The python script BibTeX Check helps cleaning up a bib file, though doing it not automatically but only providing a list of problems. It points to duplicate entries, missing fields, and inconsistent entries. When extending the script, it may reorder the list of entries or filter only the necessary ones.

Answered by fbeck on April 7, 2021

Adam H · Answer

A totally hackable python script:

https://bitbucket.org/adamhammouda3/bibtools-py

Right now it only sorts items with respect to the bibtex entry key.  My preferred format is
First-Author-Lastname. So it would sort things w.r.t author last name, assuming only that your keys are setup as such.

user38849 · Answer

I have checked out several other scripts that are designed to clean the bib file. In addition to other tools mentioned here, bibtexformat (try this github version if the link doesn't work)

Generates and adds the labels (cite keys) to a BibTeX library file, abbreviates journal
  titles, filters out unneeded items, performs string replacements and checks author names
  for format errors that may lead to incorrect citations.

And BibTool by Gerd Neugebauer can extract bibliography easily from .aux file.

Hasnein Tareque · Answer

Again duplicates can be very easily handled using JabRef's (I am using 2.10) File --> Import into Current Database. It checks for duplicates and all duplicates can be deselected easily by using "Deselect all Duplicates" button.

JabRef Import window

All duplicate entries would be maked as "D". Also answered here

Tshilidzi Mudau · Answer

You can use checkcites to get a list of all unused references via your .aux file. To do this simply open your terminal and type:

checkcites --unused document.aux

tom4everitt · Answer

For biber users, the bibexport script suggested in @Boris answer does not work.

The good news is that the built-in biber command

biber --output_format=bibtex --output_resolve <filename>.bcf

does the same thing.

MF.OX · Answer

I only needed to get rid of unused items in .bib file, but many of the packages referenced in this thread were too old and did not work for me or required extensive setup. So I wrote a short command-line PHP script that:

generates a new .bib file keeping only references that are actually used in your  .tex file
optionally removes some unnecessary fields from your reference file (e.g., URL, URLDATE etc)

To make it easy to use, it runs with PHP (available and easy to install on any computer). If interested, get the script from github.

jankos · Answer

A very good way to import the bib file into Mendeley desktop (https://www.mendeley.com/download-desktop/)
Then select the imported references (prior to this you can import them in a separate folder) and click "Update details". Mendeley will correct the bib fields with regards to the online database. Then you can go to "Tools" and remove duplicates. Finally, export the records back to bib.

FlamingTempura · Answer

Here is an online BibTeX tidy tool I wrote to sort entries, merge duplicates, and enforce consistent indentation: flamingtempura.github.io/bibtex-tidy

Also available as a node.js script: github.com/FlamingTempura/bibtex-tidy

SaMeji · Answer

Junxiao Shi has also created a number of scripts to deduplicate BibTeX entries: https://yoursunny.com/t/2017/bibtex-deduplicate/

Cleaning up a .bib file

12 Answers

Add your own answers!

Ask a Question