python - Setting up SQLite database for cluster analysis -

- March 15, 2012

i'm new databases. i'd advice on how set , use sqlite database cluster analysis , topic modeling tasks.

i have 2 gb file each line json object. here example json object file:

{"body": "heath ledger's joker...", "subreddit_id": "t5_2qh3s", "name": "t1_clpmhgo", "author": "l3thaln3ss", "created_utc": "1414799999", "subreddit": "movies", "parent_id": "t3_2kwdi3", "score": 1, "link_id": "t3_2kwdi3", "sub_type": "links - high"}

i have created sqlite database so:

import json import sqlite3 import sys  def main(argv):     if len(argv) != 2:         sys.exit("provide database name.")      dbname = argv[1]     db = sqlite3.connect(dbname)      db.execute('''create table if not exists comments               (name text primary key,                author text,                body text,                score integer,                parent_id text,                link_id text,                subreddit text,                subreddit_id text,                sub_type text,                created_utc text,                foreign key (parent_id) references comment(name));''')      db.commit()     db.close()  if __name__ == "__main__":     main(sys.argv)

is initial setup database?

i populating database so:

import json import sqlite3 import sys  def main(argv):     if len(argv) != 2:         sys.exit("provide comment file (of json objects) name.")      fname = argv[1]      db = sqlite3.connect("commentdb")     columns = ['name', 'author', 'body', 'score', 'parent_id', 'link_id', 'subreddit', 'subreddit_id', 'sub_type', 'created_utc']      query = "insert or ignore comments values (?,?,?,?,?,?,?,?,?,?)"      c = db.cursor()      open(fname, 'r') infile:         comment in infile:             decodedcomment = json.loads(comment)             keys = ()             col in columns:                 keys += (decodedcomment[col],)             print str(keys)             print             c.execute(query, keys)      c.close()     db.commit()     db.close()   if __name__ == "__main__":     main(sys.argv)

ultimately, i'm going clustering subreddits based on shared frequent words in comments, users comment where, , differences in topic models obtained analyzing words found in subreddit comments. note have many more 2 gb files i'd work in, ideally solution should relatively scalable. general advice on how setup (especially improving have written) database sort of work appreciated.

thanks!

edit: removed question insert performance.

several minor improvements suggest -- e.g, create table comments has references comment(name) i'm pretty sure comment mis-spelling , meant comments (so code posted wouldn't work).

speed-wise, building peculiarly-named keys tuple wasteful -- list better, i.e, replace

        keys = ()         col in columns:             keys += (decodedcomment[col],)

with

        keys = [decodedcomment[col] col in columns]

for better performance (it's not documented, perhaps, cursor's execute method takes second arg that's list happily takes tuple).

but overall have start -- , should fine after ingesting single 2gb input file. however, sqlite, awesome in many respects, doesn't scale multiples of size -- you'll need "real" database that. i'd recommend postgresql mysql (and variants such mariadb) , commercial (non-open-source) offerings fine too.

if "many more" (2gb files) mean hundreds, or thousands, "serious" professional dbs might @ point start creaking @ seams, depending on processing, exactly, plan throw @ them; mention of "every word in comment" (implying imagine body field needs processed -- stemming &c -- collection of words) worrisome in promising heavy processing come.

once becomes problem, "nosql" offerings, or stuff seriously meant scale such e.g bigquery, may worth while. however, small-scale experimentation, can surely start sqlite , use develop algorithms "clustering" have in mind; scale postgresql or whatever check how scale on middle-scale work; @ point, if need be, take work consider non-relational solutions, which, while powerful, tend require commitment patterns of access (relational dbs, need add indices, more suitable more experimental-stage play!).

Search This Blog

Unity

python - Setting up SQLite database for cluster analysis -

Comments

Post a Comment

Popular posts from this blog

angularjs - Showing an empty as first option in select tag -

qt - Change color of QGraphicsView rubber band -

string - Writing a Java program to encrypt and decrypt a ADFGVX cipher -