Stack Overflow Asked by Tyler P on November 15, 2021
I’ve tried to find the correct code and have written multiple attempts at this myself but am hoping that someone here can help.
I have a small file and a large file:
Small: 90,000 lines
Large: 1,200,000 lines
I’m trying to print any line from the large file to a result file if that line from the large file contains any line from the small file. Also I should add to lower case as that should not matter.
Example:
Large file line: '{"data": "SN@StackOverflow"}'
Small file line (ANY): '@stackoVerfloW'
-> Prints large line.
This should yield a print of the large line to the result file. Note that the small files line would be a subset of the larger files line. This is the goal as I am not looking for matching lines but rather lines that contain data from the small file and then saving those large file lines as results.
I’ve tried to do this with a nested for loop however it is not printing any of the results. Also I loaded the files in as lists hoping to reduce time as the lines will be stored in memory. It is also not possible for me to sort these lines as the lines are not uniform.
results = open("results.txt", "w")
small = []
large = []
for line in open('small.txt'):
small.append(line)
for line in open('large.txt'):
large.append(line)
for j in large:
for i in small:
if i in j:
results.write(j + 'n')
Please let me know if there is anything I can do to clarify my issue and sorry this is my first post and I hope to write better questions in the future.
The problem there can be though -
as you put it, the naive approach of trying to match all 90000 lines against each of the 1,200,000 lines on the other file, and even while at that, performing an 'contains' (Python in
) operator on the large file line, will lead to a prohibitive processing cost. You are basically talking about M x N x O operations, M being the large file size, N being the small file, and O the average length of lines in the larger file (minus average length of lines in the small file) - those are 1 trillion operations to start with - with computer operating in the GHz range, it could be feasible in a few hours, if the small file can fit into memory.
A smarter alternative would use location-independent hashes for each line in the large file. Location independent hashes can be complicated - but a few hashes of all possible word subset in the wider line can be matched against a dictionary containing all the 90.000 lines in the smaller file in constant time O(1) - doing it once for each of the 1,200,000 lines can be done in linear time -reducing this search to a few seconds from hours or days, just using string normalization and python dictionaries.
In the end, this should be all the needed code:
import re
def normalize(text):
text = text.strip().lower()
# strip punctuation
text = re.sub('[^w d]', ' ', text)
words = tuple(text.split())
return words
def normalized_subsets(text):
words = normalize(text)
combinations = set()
for i in range(len(words)):
for j in range(i + 1, len(words) + 1):
combinations.add(words[i: j])
return combinations
def main():
small = {normalize(line): linenum for
linenum, line in enumerate(open("small.txt")) if line.strip()}
with open("large.txt") as large, open("results.txt", "w") as results:
for line_num, line in large:
for combination in combinations(line):
if combination in small:
results.write(f"{linenum} {small[combination]} {line}")
main()
So, while you still see nested for
loops in this code, the nested versions only loops through the possible subsets of words in each line - for a line with 30 words, that will be less than 500 combinations - we make 500 comparisons to match any of these word-subgroup in the 90000 smaller file dictionary, as opposed to 90.000 comparisons.
So, it is still a quadratic algorithm in the end, but should be sharply faster - (for the sample line in the question, after removing the punctuation, it will try a match for every element in
{('data',),
('data', 'sn'),
('data', 'sn', 'stackoverflow'),
('sn',),
('sn', 'stackoverflow'),
('stackoverflow',)}
Which is just 6 comparisons (down for a linear search in 90000 lines)
(For greater value this code is also recording the line numbers in the large file and on the smaller file at the beggining of the match line in the results file)
Answered by jsbueno on November 15, 2021
You can also simplify the "file to list-of-line" step by doing:
small = open('small.txt', 'r').readlines()
large = open('large.txt', 'r').readlines()
then iterating as follow:
with open("results.txt", "w") as results:
for j in large:
for i in small:
if i.lower() in j.lower():
results.write(j)
Good luck
Answered by blondelg on November 15, 2021
There is no need to read all of the large file into memory at the same time. You would also almost certainly want to strip off the newline characters from the lines from the small file before doing your in
test (you can use strip
for this, to strip any leading and trailing whitespace).
Looking at your example strings, it also seems that you need to do a case insensitive comparison, hence using lower()
here to convert both to lower case before comparing them.
You would presumably also only write each output line once even if there are multiple lines in the small file that match it, hence the break
. Note also you don't need to write an additional newline if you have not stripped it from the input line from the large file.
Putting these together would give something like this.
small = []
with open('small.txt') as f:
for line in f:
small.append(line.strip().lower())
with open('results.txt', 'w') as fout:
with open('large.txt') as fin:
for line in fin:
for i in small:
if i in line.lower():
fout.write(line)
break # breaks from inner loop only (for i in small)
Answered by alani on November 15, 2021
If your sample data is indicative of the actual data you're using, it could be as simple as comparing them lower-case:
# ... your I/O code
for j in large:
for i in small:
if i.lower() in j.lower():
results.write(j + 'n')
Note the .lower()
calls, which is the only modification I've made to your code.
If this still doesn't work, please post a few more lines from each file to help us assess.
Answered by Daniel Skovli on November 15, 2021
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP