Database Administrators Asked by jkozma on December 22, 2020
I created a single table SQLite db from a csv file using python. I was able to open, browse and query the database using DB Browser for SQLite (browser) immediately after creating it. However, when I try to open the same database a week or two later, having done nothing with it in the intervening time, it causes the browser to hang. In particular, it does appear to open correctly and show the table structure, but hangs when I try to browse or run a query. I suspect the problem is that the files are highly fragmented when they are first created. Later they are defragmented by the os, and that’s how they break. I’ve used sysinternals contig.exe to see how many fragments each file has. The old “corrupt” version has 2 fragments, but the newly created version has 18.
Could defragmentation be the problem?
Acting on Mark Stewart’s answer, I tried connecting to both the recently created db and the one created several weeks ago from the command line. Both exhibited the same behavior as when I opened them in the browser. Here is the session from the command line:
D:CoursespmMachLearnsqlAlchemy>sqlite3 csv2db_loop.db SQLite version 3.24.0 2018-06-04 19:24:41 Enter ".help" for usage hints. sqlite> .tables complaints sqlite> select count(*) from complaints; 1036460 sqlite> .exitD:CoursespmMachLearnsqlAlchemy>sqlite3 csv2db_loop_BAD2.db SQLite version 3.24.0 2018-06-04 19:24:41 Enter ".help" for usage hints. sqlite> .tables complaints sqlite> select count(*) from complaints;
At this point, sqlite3 hung, and I waited several minutes before finally exiting with Ctrl-C. I then used fc to compare the two db files, and tried connecting again to the old file. The continuation of the session above, after Ctrl-C:
D:CoursespmMachLearnsqlAlchemy>fc csv2db_loop.db csv2db_loop_BAD2.db Comparing files csv2db_loop.db and CSV2DB_LOOP_BAD2.DB FC: no differences encountered D:CoursespmMachLearnsqlAlchemy>sqlite3 csv2db_loop_BAD2.db SQLite version 3.24.0 2018-06-04 19:24:41 Enter ".help" for usage hints. sqlite> .tables complaints sqlite> select count(*) from complaints; 1036460 sqlite>
Whatever was causing sqlite3 to hang was apparently cured by performing the file comparison. The old file also works correctly now in the browser.
Before trying the file comparison, I had made a copy of the working file, as an experiment to see if a less fragmented copy would exhibit the same behavior as the old versions. It did have fewer fragments– just 1 versus 18 for the file I copied from. However, it did not hang like the older versions. That does seem to indicate that defragmenting is not the cause of the problem. I thought it might have something to do with the file not being in memory. However, I also tried copying an old non-working file, and neither the old file nor the copy worked. After I compared them with each other, though, they both worked.
To summarize the results of my experiments:
Regarding LC’s answer, You were right, at least about the query just not being finished. I had let it run for several minutes before aborting with ctrl-C, but based on your assessment, I connected to another “stale” file and just let it run while I went and took a nap. I opened task manager while it was running and saw that it was using the cpu and it was doing a lot of disk transfers. I don’t know how long it ran, but it had completed when I came back from a nap and watering some plants– less than 2 hours, but more than 10 minutes. I did notice that it was in sorted order by the primary key when it opened in the browser, rather than in the order in which the records were added. I guess the delay may be due to the necessity of sorting 1,000,000+ records, but that still seems a little suspicious given the machine’s 2+ GHz clock rate and assuming something close to n log(n) sorting complexity. However, while the query was running, its cpu usage was only about 1% max, while disk usage was a pretty constant 0.7 MB/s. I also don’t yet understand how the delay manifests in a file copied from a “stale” file, but not in a file copied from one recently created.
I’m pretty sure there’s no fragmentation within the databases themselves. Each was created with a single table, and no records were ever updated or deleted. My whole point in creating the db in the first place was to compare two different methods in Python for putting a csv data file into a database table. As such, it’s somewhat disconcerting to find that a single query with an aggregate function can take orders of magnitude more time to execute depending on the memory cache. I’m doing another experiment, to see if creating an index will prevent the excessive delay for a stale file.
None of these files is corrupt. The query appears to hang because it just has not yet finished.
The problem might be not only that the database file is fragmented, but also that the pages in the database are fragmented (this typically happens after many deletions/insertions). When SQLite tries to read all rows in order, it reads the pages in a random order, which prevents the OS from prefetching data from disk.
To defragment the database, run VACUUM.
Anyway, the most likely reason is that after two weeks, the database file is no longer in the file cache. After running fc
, the query runs faster because the entire file already is in the OS file cache.
Answered by CL. on December 22, 2020
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP