TeX - LaTeX Asked on April 7, 2021
I have a large .bib
file which is in a mess. In particular:
(1
is an issue because I’m working through the .bib
file fixing up items one by one and examining the corresponding changes in the book; at the moment I’m having to dart back and forth a lot.
3 is an issue because I need to check the correctness of each entry manually, which can take a while; thus unused entries mean a lot of wasted time.)
Is there a way of taking a .bib
file and cleaning it up? In particular, I’d like to
a) create a .bib
file which is sorted by author, and
b) delete or, better, comment out unused entries.
I would recommend bibexport script. It creates a new .bib
file that includes only the references you cite in the .tex
file, and cleans them up. I use it for submissions to journals, when I want to send only the relevant references rather than my databases.
Correct answer by Boris on April 7, 2021
This is a python code that automatically deal with some of these issues. Please feel free to give feedback or suggestions to enhance or extend the features of this code. Hope it can help.
# perform some tests on ped.bib related to the pdf-directory (pdfdir)
# activate one or more tests by setting those control variables to 1
# 1. is_check_file: check objects with no or empty file entry
# 2. is_check_double_file: check if two or more different objects have the same pdf entry
# 3. is_check_unused_files: check <pdfdir> for unsused files
import os, sys, glob, copy
from os import path, access, R_OK # W_OK for write permission.
from operator import itemgetter
#-------- control parameter ---------
is_check_file = 0
is_check_double_file = 0
is_check_unused_files = 1
#------------------------------------
ROOT = os.getenv("HOME") # Home directory
#ROOT = path.expanduser("~") # works on all platforms
pdfdir = ROOT + '/lit/pdf/'
print 'pdfdir', pdfdir
debug = 0 # shutdown debug/info messages
bib_data = open('ped.bib')
words = ['author', 'title', 'journal', 'year', 'volume', 'comment', 'issue', 'owner', 'file', 'timestamp', 'booktitle', 'editor', 'publisher', 'number', 'part', 'keywords', 'doi', 'month', 'organization', 'url']
#------------------------------------------------------------------------------
def check_missing_pdf(element):
"""
following testes:
1. element has no file entry
2. element has empty file entry
3. element has file entry with inexistant pdf-file
"""
#check missing files
if not element.has_key('file'): # element has no file
print >>sys.stderr, '==== %s has no file entry'%element.get('key')
elif not element.get('file'): # .. or file is empty
print >>sys.stderr, '**** %s has empty file entry'%element.get('key')
else:
# we have a file entry -----> check its existance in the pdf dir
pdffile = pdfdir + element.get('file')
if not( path.exists(pdffile) and path.isfile(pdffile) and access(pdffile, R_OK)):
print >>sys.stderr, '#### %s with file entry [%s]: Either file is missing or is not readable'%(element.get('key'), element.get('file'))
#------------------------------------------------------------------------------
def check_doubles(elements):
"""
check if two or more different objects have the same pdf-file
"""
doubles = {}
for element in elements:
pdffile = element.get("file")
key = element.get("key")
if not doubles.has_key(pdffile):
doubles[pdffile] = [key]
else:
doubles[pdffile].append(key)
for f, k in doubles.iteritems():
if f and len(k) > 1: # if f excludes case f == None
print >>sys.stderr, "Keys:", k, "have the same file <%s>"%f
#------------------------------------------------------------------------------
def check_unused_files(elements):
"""
check dir pdfdir ("lit/pdf") for pdf-files that are not used
in the ped.bib
"""
pdf_files = glob.glob( pdfdir + "*.pdf") # pdfs in lit/pdf/
dummy_files = copy.copy(pdf_files) # list of unused files
for pdf in pdf_files:
for element in elements:
element_pdf = element.get("file")
if element_pdf is None:
continue
element_pdf = path.basename(element_pdf)
if path.basename(pdf) == element_pdf:
dummy_files.remove(pdf)
break # check next files
if dummy_files:
print >>sys.stderr, "%d files are not used:"%len(dummy_files)
for f in dummy_files:
print >>sys.stderr, "---->",path.basename(f)
#------------------------------------------------------------------------------
def putWord(string, dic, line):
"""
extract from <line> the value of the key <string> and put it in <dic>
"""
tmp = line[1].strip(' { } , .').split(':')
# some files are like this :llll:aaaa. So tmp[0] is here == ''
if not tmp[0]:
dic[string] = tmp[1]
else:
dic[string] = tmp[0]
#------------------------------------------------------------------------------
def getElement(f):
"""
get ONE element from file f.
return dict
"""
dic = {}
for line in f:
line = line.strip(' nr')
if not line:
continue
#get <key> and <type>
if line[0] == '@':
sline = line.split('{')
typ = sline[0][1:]
if typ == 'comment': # ignore jabref-meta
continue
dic['type'] = typ.strip(',')
key = sline[1].strip(',')
if debug:
print >> sys.stderr, '--------> type: <%s>'%typ
print >> sys.stderr, '--------> key: <%s>'%key
dic['key'] = key
line = line.split('=')
for word in words:
if line[0].strip(' ') == word:
putWord(word, dic, line)
if debug:
print >> sys.stderr, '--------> %s: <%s>'%(word, line[1].strip(' { },.') )
# check for last line of element
if line[0] == '}':
if debug:
print >> sys.stderr, '---------------------------------'
return dic
#------------------------------------------------------------------------------
#----------------------- get content of file in elements ------------------------------
elements = []
while True:
dic = getElement(bib_data)
if not dic:
sorted(elements, key=itemgetter('key'))
break
elements.append( dic )
#------------------------------------------------------------------------------
if is_check_file:
print "check missing files ..."
for element in elements:
check_missing_pdf(element)
if is_check_double_file:
print "check double files ..."
check_doubles(elements)
if is_check_unused_files:
print "is_check_unused_files ..."
check_unused_files(elements)
Answered by Tengis on April 7, 2021
The python script BibTeX Check helps cleaning up a bib file, though doing it not automatically but only providing a list of problems. It points to duplicate entries, missing fields, and inconsistent entries. When extending the script, it may reorder the list of entries or filter only the necessary ones.
Answered by fbeck on April 7, 2021
A totally hackable python script:
https://bitbucket.org/adamhammouda3/bibtools-py
Right now it only sorts items with respect to the bibtex entry key. My preferred format is First-Author-Lastname. So it would sort things w.r.t author last name, assuming only that your keys are setup as such.
Answered by Adam H on April 7, 2021
I have checked out several other scripts that are designed to clean the bib file. In addition to other tools mentioned here, bibtexformat (try this github version if the link doesn't work)
Generates and adds the labels (cite keys) to a BibTeX library file, abbreviates journal titles, filters out unneeded items, performs string replacements and checks author names for format errors that may lead to incorrect citations.
And BibTool by Gerd Neugebauer can extract bibliography easily from .aux file.
Answered by user38849 on April 7, 2021
Again duplicates can be very easily handled using JabRef's (I am using 2.10) File --> Import into Current Database. It checks for duplicates and all duplicates can be deselected easily by using "Deselect all Duplicates" button.
All duplicate entries would be maked as "D". Also answered here
Answered by Hasnein Tareque on April 7, 2021
You can use checkcites
to get a list of all unused references via your .aux file. To do this simply open your terminal and type:
checkcites --unused document.aux
Answered by Tshilidzi Mudau on April 7, 2021
For biber
users, the bibexport script suggested in @Boris answer does not work.
The good news is that the built-in biber
command
biber --output_format=bibtex --output_resolve <filename>.bcf
does the same thing.
Answered by tom4everitt on April 7, 2021
I only needed to get rid of unused items in .bib file, but many of the packages referenced in this thread were too old and did not work for me or required extensive setup. So I wrote a short command-line PHP script that:
To make it easy to use, it runs with PHP (available and easy to install on any computer). If interested, get the script from github.
Answered by MF.OX on April 7, 2021
A very good way to import the bib file into Mendeley desktop (https://www.mendeley.com/download-desktop/) Then select the imported references (prior to this you can import them in a separate folder) and click "Update details". Mendeley will correct the bib fields with regards to the online database. Then you can go to "Tools" and remove duplicates. Finally, export the records back to bib.
Answered by jankos on April 7, 2021
Here is an online BibTeX tidy tool I wrote to sort entries, merge duplicates, and enforce consistent indentation: flamingtempura.github.io/bibtex-tidy
Also available as a node.js script: github.com/FlamingTempura/bibtex-tidy
Answered by FlamingTempura on April 7, 2021
Junxiao Shi has also created a number of scripts to deduplicate BibTeX entries: https://yoursunny.com/t/2017/bibtex-deduplicate/
Answered by SaMeji on April 7, 2021
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP