[Building Sakai] Stats for File Storage

David Adams da1 at vt.edu
Fri Jan 31 04:03:37 PST 2014


This query is a good start. Be sure not to ignore the /attachments/ tree,
which can be a significant portion of some sites' storage, and the
/group-user/ tree, which is for dropbox.

If you're okay with using Ruby, the "sakai-info" gem can provide the data
you're looking for on a site-by-site basis. This gist shows how simple the
code is, and includes instructions for setting it up:

    https://gist.github.com/daveadams/8706838

Note that this script takes an hour and 40 minutes to run against our
142,000 sites. Some of that slowness is probably Ruby's fault, and the
Sequel library that the sakai-info gem uses also slows things down. But you
can't argue with the conciseness of the end-result code.

The gist above includes user workspaces in its report. However, if you
actually want to track down files by user, there's no way to do this
directly via the database. You must extract and decode the 'binary_entity'
field in the content_resource table. Luckily the sakai-info gem has a way
to parse that data out. But it's slower still, since reading blobs is slow
to begin with.

The sakai-info library doesn't have all the bits you'd need to come up with
a per-user report, so here's another this other gist that uses the content
resource binary_entity parser from sakai-info along with more general
Sequel code and some big memory-hogging data structures to report on file
counts and total size by user, by mime type, by file extension, and by
creation date, all fields that you can only find in the binary_entity
column:

    https://gist.github.com/daveadams/8730747

On our mostly-idle preproduction system with over 10,000,000 records in the
content_resource table, that script took just over 5 hours to run, and ate
up 16GB+ of RAM before it wrapped up. But it produces some useful reports
(in CSV format) of which users, dates, mime types, and file extensions are
associated with the most files or most disk usage. And it could be easily
tweaked to provide more information or specific information.

However, obviously five hours to scan a single DB table is outrageous. The
problem is the binary_entity field. Extracting those fields out into actual
relational DB fields would be a huge win for direct database usability,
reporting, and for the maintainers of the Resources tool, who could
probably speed things up and provide new search-and-sort features a lot
more easily.

Anyway, hopefully someone will find all this useful.

-dave

--
David Adams
Director of Server and Network Operations
Virginia Tech TLOS, Technology-enhanced Learning and Online Strategies


On Tue, Jan 28, 2014 at 4:37 PM, Sam Ottenhoff <ottenhoff at longsight.com>wrote:

> Here is the basic idea tested on MySQL (not Oracle):
>
>     SELECT SUBSTRING_INDEX(SUBSTRING_INDEX(REPLACE(cr.IN_COLLECTION,
> \"private/meleteDocs\", \"group\"), \"/\", 3), \"/\", -1) AS site_id,
> SUM(FILE_SIZE) as aggregate_size
>     FROM CONTENT_RESOURCE
>     GROUP BY site_id
>     ORDER BY aggregate_size DESC
>
>
> On Mon, Jan 27, 2014 at 9:23 AM, May, Megan Marie <mmmay at iu.edu> wrote:
>
>>  Has anyone ever run a report to get the file storage by site or by
>> user?
>>
>>
>>
>> Wondering if anyone has something we can reuse.
>>
>>
>>
>> Thanks,
>>
>> Megan
>>
>>
>>
>>
>>
>> Megan May
>>
>> Manager, Learning Management Systems / Kuali Student Development
>>
>> UITS Enterprise Student Systems
>>
>> Enterprise Software Division
>>
>> Indiana University
>>
>> Office:  317-274-4528
>>
>>
>>
>> _______________________________________________
>> sakai-dev mailing list
>> sakai-dev at collab.sakaiproject.org
>> http://collab.sakaiproject.org/mailman/listinfo/sakai-dev
>>
>> TO UNSUBSCRIBE: send email to
>> sakai-dev-unsubscribe at collab.sakaiproject.org with a subject of
>> "unsubscribe"
>>
>
>
> _______________________________________________
> sakai-dev mailing list
> sakai-dev at collab.sakaiproject.org
> http://collab.sakaiproject.org/mailman/listinfo/sakai-dev
>
> TO UNSUBSCRIBE: send email to
> sakai-dev-unsubscribe at collab.sakaiproject.org with a subject of
> "unsubscribe"
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://collab.sakaiproject.org/pipermail/sakai-dev/attachments/20140131/6a1906c4/attachment.html 


More information about the sakai-dev mailing list