python - Setting up SQLite database for cluster analysis -
i'm new databases. i'd advice on how set , use sqlite database cluster analysis , topic modeling tasks.
i have 2 gb file each line json object. here example json object file:
{"body": "heath ledger's joker...", "subreddit_id": "t5_2qh3s", "name": "t1_clpmhgo", "author": "l3thaln3ss", "created_utc": "1414799999", "subreddit": "movies", "parent_id": "t3_2kwdi3", "score": 1, "link_id": "t3_2kwdi3", "sub_type": "links - high"}
i have created sqlite database so:
import json import sqlite3 import sys def main(argv): if len(argv) != 2: sys.exit("provide database name.") dbname = argv[1] db = sqlite3.connect(dbname) db.execute('''create table if not exists comments (name text primary key, author text, body text, score integer, parent_id text, link_id text, subreddit text, subreddit_id text, sub_type text, created_utc text, foreign key (parent_id) references comment(name));''') db.commit() db.close() if __name__ == "__main__": main(sys.argv)
is initial setup database?
i populating database so:
import json import sqlite3 import sys def main(argv): if len(argv) != 2: sys.exit("provide comment file (of json objects) name.") fname = argv[1] db = sqlite3.connect("commentdb") columns = ['name', 'author', 'body', 'score', 'parent_id', 'link_id', 'subreddit', 'subreddit_id', 'sub_type', 'created_utc'] query = "insert or ignore comments values (?,?,?,?,?,?,?,?,?,?)" c = db.cursor() open(fname, 'r') infile: comment in infile: decodedcomment = json.loads(comment) keys = () col in columns: keys += (decodedcomment[col],) print str(keys) print c.execute(query, keys) c.close() db.commit() db.close() if __name__ == "__main__": main(sys.argv)
ultimately, i'm going clustering subreddits based on shared frequent words in comments, users comment where, , differences in topic models obtained analyzing words found in subreddit comments. note have many more 2 gb files i'd work in, ideally solution should relatively scalable. general advice on how setup (especially improving have written) database sort of work appreciated.
thanks!
edit: removed question insert performance.
several minor improvements suggest -- e.g, create table
comments
has references comment(name)
i'm pretty sure comment
mis-spelling , meant comments
(so code posted wouldn't work).
speed-wise, building peculiarly-named keys
tuple
wasteful -- list
better, i.e, replace
keys = () col in columns: keys += (decodedcomment[col],)
with
keys = [decodedcomment[col] col in columns]
for better performance (it's not documented, perhaps, cursor's execute
method takes second arg that's list happily takes tuple).
but overall have start -- , should fine after ingesting single 2gb input file. however, sqlite, awesome in many respects, doesn't scale multiples of size -- you'll need "real" database that. i'd recommend postgresql
mysql
(and variants such mariadb
) , commercial (non-open-source) offerings fine too.
if "many more" (2gb files) mean hundreds, or thousands, "serious" professional dbs might @ point start creaking @ seams, depending on processing, exactly, plan throw @ them; mention of "every word in comment" (implying imagine body
field needs processed -- stemming &c -- collection of words) worrisome in promising heavy processing come.
once becomes problem, "nosql" offerings, or stuff seriously meant scale such e.g bigquery, may worth while. however, small-scale experimentation, can surely start sqlite , use develop algorithms "clustering" have in mind; scale postgresql or whatever check how scale on middle-scale work; @ point, if need be, take work consider non-relational solutions, which, while powerful, tend require commitment patterns of access (relational dbs, need add indices, more suitable more experimental-stage play!).
Comments
Post a Comment