java - HBase Table Design for maintaining hourly visitors count per source -
i working on project have report hourly unique visitors per source. have calculate unique visitors each source each hour. visitors identified unique id. should design calculation of hourly unique visitors efficient considering data of order of 20k entries per 8 hours.
at present using sourceid+ visitorid row key.
let's start saying 2500k entries per hour pretty low volume of data (not 1/second). unless want scale massively project achievable single sql server.
anyway, have 2 options:
1. non-realtime
log every visitorid+source , run job (like mapreduce) analyze data every hour, or every day depending on needs. in case can avoid hbase , stick hadoop. can log data different file each hour, process afterwards , store results in sql (or in hbase if wish). performance wise best approach.
2. realtime
track data realtime making use of hbase counters, in case i'd consider using 2 tables:
table unique_users: track last time visitorid has visited site (rowkey visitorid+source or visitorid, depending on if visitor id can have different sources or one). table can have ttl of 3600 seconds if want automatically discard old data can let few days of data.
table date_source_stats: track unique visitorid per source per hour. table can have ttl of few weeks or years depending on retention requirements.
when visitor enters site read unique_users table check last access date, if date older 1 hour consider new visit , increment counter date+hour+sourceid combination in date_source_stats table. afterwards, update unique_users set last visit time current time.
that way, can retrieve unique visits particular date+hour scan , sources. may consider source_date_stats table in case want perform queries specific source, i.e, hourly report last 7 days x source... (you can store stats in same table using different rowkeys).
please notice few things approach:
- i've not being detailed schemas, let me know if need me to.
- i store total visits in counter (which incremented regardless of if it's unique or not), it's useful value.
- this proposal can extended as want track daily, weekly, , monthly unique visitors, you'll need more counters , rowkeys: date+sourceid, month+sourceid... in case can have multiple column families distinct ttl properties adjust retention policy of each set.
- this proposal face hotspotting issues due rowkeys being sequential if have thousands of reqs per second, can read more here.
- an alternative approach date_source_stats opt wide design in have sourceid rowkey , date_hour columns.
Comments
Post a Comment