[Building Sakai] Culling duplicates from SAM_ITEMGRADING_T

Tue Mar 26 12:07:49 PDT 2013

Hi Karen. Thanks for your response.

On Thu, Mar 21, 2013 at 3:05 PM, Karen Tsao <ktsao at stanford.edu> wrote:

> Hi Jushua,
>
> The reason the script only cleans up null SUBMITTEDDATE records is we can
> successfully reproduce duplicates with null SUBMITTEDDATE (SAM-758). If you
> have duplicates that have not null SUBMITTEDDATE, these must be created in
> other flow which we haven't found. Are these old data or some are lately
> created?
>

These items span from the time we started using Sakai - in 2005 - to the
present. We are using the rSmart CLE, but I'm not aware of any changes they
may have made.

>
> Are these SUBMITTEDDATE different? How about other fields, eg, score? If
> they are all the same except the SUBMITTEDDATE, I think it should be ok to
> remove the earliest duplicates. But, I am not 100 % confident. Maybe you
> can try this in your QA instance first?
>
>
Typically I see that the SUBMITTEDDATEs are the same. PUBLISHEDANSWERID is
null except in the one with the highest ITEMGRADINGID. AUTOSCORE is 1 in
the one withthe highest ITEMGRADINGID, and 0 in the others.

> Not sure if there are other schools encountered this issue. If yes, hope
> they can give you some feedback.
>
>
Additionally, even with after the multiple not-null SUBMITTEDDATE items are
removed, I still can't create the index. It turns out that by restricting
the query to those that correspond to a published item with TYPEID 1 and 4,
many duplicates are missed. This is the original query:

SELECT g.PUBLISHEDITEMID, g.AGENTID, g.ASSESSMENTGRADINGID,
g.PUBLISHEDITEMTEXTID
FROM SAM_ITEMGRADING_T g, sam_publisheditem_t i where g.publisheditemid =
i.itemid and (i.typeid='1' or i.typeid='4')
GROUP BY g.PUBLISHEDITEMID, g.AGENTID, g.ASSESSMENTGRADINGID,
g.PUBLISHEDITEMTEXTID HAVING COUNT(*) > 1

And after taking out the TYPEID requirement:

SELECT g.PUBLISHEDITEMID, g.AGENTID, g.ASSESSMENTGRADINGID,
g.PUBLISHEDITEMTEXTID
FROM SAM_ITEMGRADING_T g
GROUP BY g.PUBLISHEDITEMID, g.AGENTID, g.ASSESSMENTGRADINGID,
g.PUBLISHEDITEMTEXTID HAVING COUNT(*) > 1

First, I removed all duplicates that were identified by the initial query
(which selects types 1 and 4). Then I run this query that doesn't depend on
the type, and there are over 20,000 results. Many of these items have up to
10 duplicates.

Removing the duplicates isn't much of a problem, I'm just wondering what
the risk is.

Josh

> Thanks,
> Karen
>
> The script does include a commented-out statement for those particular
>> entries:
>>
>>  delete from sam_itemgrading_t where itemgradingid in (
>>         select MIN(g.itemgradingid) from sam_itemgrading_t g where
>>         g.agentid = r.agentid and
>>         g.publisheditemid = r.publisheditemid and
>>         g.assessmentgradingid = r.assessmentgradingid and
>>         g.publisheditemtextid = r.publisheditemtextid
>>         GROUP BY g.PUBLISHEDITEMID, g.AGENTID, g.ASSESSMENTGRADINGID,
>> g.PUBLISHEDITEMTEXTID
>>
>> This seems pretty reasonable, to delete the earliest submission.
>>
>> Josh
>>
>> _______________________________________________
>> sakai-dev mailing list
>> sakai-dev at collab.sakaiproject.org
>> http://collab.sakaiproject.org/mailman/listinfo/sakai-dev
>>
>> TO UNSUBSCRIBE: send email to
>> sakai-dev-unsubscribe at collab.sakaiproject.org with a subject of
>> "unsubscribe"
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://collab.sakaiproject.org/pipermail/sakai-dev/attachments/20130326/6e636157/attachment.html