Eclipse Community Forums: BIRT » Data Cube Count Distinct missing

Home » Archived » BIRT » Data Cube Count Distinct missing(Using count for a summary field in a data cube yields the wrong result, why is count distinct missing?)

Data Cube Count Distinct missing [message #1031941]

Tue, 02 April 2013 07:44

Eclipse User

I have a dataset which returns rows of stories along with some data, see attached csv:
----------
storyid, date, category, views
1, 2013-01-01, news, 30
1, 2013-01-02, news, 465
..
2, 2013-01-01, sport, 65
2, 2013-01-02, sport, 423
..
3, 2013-01-01, games, 5
3, 2013-01-03, games, 54
----------

I want to analyze the data in regard to the categories and the time periods: How many views has a category per month and how many stories are published in a category in a month? So I create a data cube and add these dimensions as groups. As summary fields I pick the views (sum) and the storyid (count). See attached rptdesign.

Now everything is fine except the number of stories per category per month: Since the only option is to use COUNT as function in the summary field for storyid too many stories are counted. That happens because there are multiple rows for the same storyid containing data for different dates.

I assume a COUNTDISTINCT would work but it seems that function is not available for a cube summary field. Am i right? If yes (why oh why?) is there a way to circumevent this problem?

Attachment: cube_countdistinct.csv
(Size: 0.54KB, Downloaded 227 times)
Attachment: cube_countdistinct.rptdesign
(Size: 23.95KB, Downloaded 2843 times)

[Updated on: Tue, 02 April 2013 07:46] by Moderator

Re: Data Cube Count Distinct missing [message #1032231 is a reply to message #1031941]

Tue, 02 April 2013 14:51

Eclipse User

You could probably create a computed column in your dataSet to keep track of this for you, then, use this in your cube. In this example, would all of the counts be 1, since no category has multiple ids?

Re: Data Cube Count Distinct missing [message #1034548 is a reply to message #1032231]

Fri, 05 April 2013 11:16

Eclipse User

Thank you for the quick reply!

Michael Williams wrote on Tue, 02 April 2013 14:51

In this example, would all of the counts be 1, since no category has multiple ids?

Yes in the example above the expected result is 1 story per category.

I attached another example (cube_countdistinct_v2.csv) with more stories per category (this is dummy data of course). The expected result is as displayed in the attached cube_countdistinct_v2_result.png .

Before your answer I came up with the solution to calculate the uniqueness of every storyid in the summary function of the cube (I found pointers to this on birt-exchange). Part of my question is if this is the way to go (seems a bit hackish to me, seems to me that yes). See attached cube_countdistinct_v2.rptdesign . The code is:

var storyids = vars["storyids"];
var storyid = dataSetRow["comp_storyid"];
var date = dataSetRow["comp_date"];
var dateKey = Formatter.format(date, "YYYYMM");
var val = 0;
if (storyids[dateKey] == undefined) {
   storyids[dateKey] = {};
   storyids[dateKey][storyid] = 1;
   val = 1;
} else {
   if (storyids[dateKey][storyid] == undefined) {
      storyids[dateKey][storyid] = 1;
      val = 1;
   }
}
val

The performance impact is well observable (speed and in-memory-usage). Do you recommend something similar to use as a computed column? Will this increase performance (mabye because it is outside of the cube)? Is it advisable to use a report variable to hold the "hashmap" for the storyids?

Attachment: cube_countdistinct_v2_result.png
(Size: 3.67KB, Downloaded 408 times)
Attachment: cube_countdistinct_v2.rptdesign
(Size: 26.14KB, Downloaded 917 times)
Attachment: cube_countdistinct_v2.csv
(Size: 0.76KB, Downloaded 229 times)

Re: Data Cube Count Distinct missing [message #1034582 is a reply to message #1031941]

Fri, 05 April 2013 12:10

Eclipse User

My thoughts were very similar to that on how I'd do the computed column. Here is the computed column code I had that works the same:

monthYear = Formatter.format(row["comp_date"], "MMYY");
temp = reportContext.getPersistentGlobalVariable(row["storyid"] + row["category"] + monthYear);
if(temp == null){
	reportContext.setPersistentGlobalVariable(row["storyid"] + row["category"] + monthYear, "set");
	1;
}
else{
	0;
}

Then, you just use this in your cube as a summed measure.

[Updated on: Fri, 05 April 2013 12:12] by Moderator

Re: Data Cube Count Distinct missing [message #1036524 is a reply to message #1034582]

Mon, 08 April 2013 09:49

Eclipse User

Your code works flawlessly in my simplified example. When I applied it to my real report I noticed it stopped working: The real report uses a joint dataset as the source for the cube and that seems to be the problem. Interestingly it works when using the custom code inside the cube and not for a computed column in the joint dataset.

See attached report with a joint dataset made of the 2 attached csv datasources (the same data as before just separated). It shows 2 tables, one is not working (extra code as computed column) and the other is working (extra code inside the cube). I'll go with the working solution Smile

but asked myself: Did i do something wrong or is this a bug or an unsupported feature?

Attachment: cube_countdistinct_v3.rptdesign
(Size: 47.80KB, Downloaded 494 times)
Attachment: cube_countdistinct_v3_stories.csv
(Size: 0.07KB, Downloaded 225 times)
Attachment: cube_countdistinct_v3_views.csv
(Size: 0.56KB, Downloaded 208 times)