Time
S
Nick
Message
07:31
poikilotherm joined #dataverse
09:10
stefankasberger joined #dataverse
09:12
juancorr joined #dataverse
09:44
stefankasberger joined #dataverse
10:23
pdurbin joined #dataverse
10:24
pdurbin
Hi, all. I did a little research into upgrading http://chat.dataverse.org from Shout over the weekend.
10:31
poikilotherm
Hi pdurbin :-)
10:39
poikilotherm
pdurbin do you think there is a chance someone will take a look at my PRs today?
10:56
pdurbin
I was going to look at the Solr one.
10:57
poikilotherm
Great
10:57
pdurbin
I just assigned myself to it.
11:02
poikilotherm
:-) :-) :-)
11:06
pdurbin
Have you seen Adam Bien's pom.xml files for Java EE 7 that have only 21 lines? This: http://www.adam-bien.com/roller/abien/entry/essential_javaee_7_pom_xml
11:08
poikilotherm
Nope.
11:08
poikilotherm
But I kinda doubt he will use it
11:09
yoh joined #dataverse
11:09
poikilotherm
In his podcast from last week about the first line of Quarkus he told us differently
11:09
poikilotherm
He now is into add everything now, start hacking and remove later. But obviously stay as much in EE as possible.
11:09
poikilotherm
s/in/with/
11:10
pdurbin
Right, stick to the provided APIs if you can. That's what I mean.
11:10
poikilotherm
:-)
11:10
poikilotherm
You better don't look at Dataverse POM :-D
11:10
pdurbin
Minimize your external dependencies. Use the framework you're on.
11:11
pdurbin
I know, I know. I just left a comment on your pull request.
11:11
* poikilotherm
goes looking
11:13
poikilotherm
Re :-)
11:28
pdurbin
poikilotherm: I moved the pom.xml pull request over. You aren't blocked on the Solr one, it seems. You're mostly just suggesting a better approach.
11:28
pdurbin
Using existing APIs.
11:29
yoh joined #dataverse
11:29
poikilotherm
Yeah. I am not all blocked, but it would be superb to have this sooner than later to go ahead
11:29
pdurbin
Yeah.
11:30
pdurbin
I think the only thing I'm wondering about is if we should take the Harvard-specific stuff out.
11:31
pdurbin
Out of schema_dv_cmb_copies.xml and schema_dv_cmb_fields.xml, I mean. Do you think you could try taking the Harvard specific fields out?
11:32
poikilotherm
I could. But that will break backward compatibility
11:33
poikilotherm
People might depend on the existing stuff and reuse Harvard specials
11:33
poikilotherm
If you guys tell me go ahead, I'll do :-)
11:33
poikilotherm
This is just about removing the fields, which is not a lot of work to do... ;-)
11:36
pdurbin
I think it would be much cleaner and leaner to remove those Harvard-specific fields. I just assigned it to you and moved it to community dev. If it's too hard or weird, please just ping me.
11:39
poikilotherm
https://github.com/IQSS/dataverse/pull/6146#issuecomment-529431257
12:13
pdurbin
Thanks, I replied with some screenshots and such. :)
12:14
poikilotherm
Yeah, I saw you made some great work :-D
12:14
poikilotherm
I'm trying to filter the TSV fields so I don't miss a field
12:18
pdurbin
Hmm.
12:19
pdurbin
I would suggest a clean installation with just the standard 6 metadata blocks. Then run your new script. That should do it, right?
12:20
pdurbin
Are you sorry you asked me to look at your pull requests? :)
12:21
pdurbin
Again, if it's too hard or weird, please let me know.
12:23
poikilotherm
It's no problem at all and I fully agree
12:23
pdurbin
phew
12:23
pdurbin
Do you think you could add that 6th block to the appendix too?
12:25
poikilotherm
cat scripts/api/data/metadatablocks/customCHIA.tsv | grep -A5000 "#datasetField" | grep -B5000 "#controlledVocabulary" | grep -E -e "^\s+" | cut -f2
12:25
poikilotherm
Getting there :-D
12:25
poikilotherm
Sure
12:25
pdurbin
Thanks!
12:25
pdurbin
Even more added value in this pull request. :)
12:27
poikilotherm
Happy to be helpfull :-)
12:27
pdurbin
That 5 vs 6 mismatch has bothered me for a long time. Very untidy. :)
12:34
poikilotherm
Ah I like greping
12:34
poikilotherm
I just generated a list of fields and used sed to kill the lines in both files
12:35
poikilotherm
It's interesting that not all fields are also having a fulltext index
12:45
poikilotherm
pdurbin where is this journal.tsv coming from?
12:45
poikilotherm
There are no obvious reference in the docs
12:50
poikilotherm
I found the Google docs link, but I have no idea on what this has been based... Every other schema has some kind of inspiration/base/...
12:50
poikilotherm
I could add this as a TODO and you push a commit correcting this.
12:56
poikilotherm
pdurbin: just pushed two commits with your requested changes
13:06
pdurbin
poikilotherm: thanks but why is mraCollection still in there? That's from customMRA.tsv
13:07
pdurbin
hbgdkiBirthWeight is still in there too. It's from custom_hbgdki.tsv
13:07
poikilotherm
That's just because I'm stupid
13:08
pdurbin
I blame grep. :)
13:08
poikilotherm
Forgot that copyField uses source=, not name==
13:08
poikilotherm
I'll amend and force push, ok?
13:09
poikilotherm
(So nobody will ever know...)
13:09
pdurbin
heh, that's fine
13:10
poikilotherm
Here you go. Shortly AFK
13:14
Jerry19 joined #dataverse
13:15
Jerry19
To whom it may concern, is it a place to ask questions about Counter Processor?
13:15
poikilotherm
pdurbin that's for you :-D
13:15
poikilotherm
Hello Jerry19
13:16
Jerry19
Hi poikilotherm
13:17
Jerry19
I'm Jerry, working at Purdue University Libraries
13:17
poikilotherm
Welcome Jerry :-)
13:18
poikilotherm
IIRC Phil (pdurbin) was the one who implemented the counter code
13:19
pdurbin
Jerry19: nice, both of my nieces are going to Purdue. One of them is majoring in animation and video game design. I'm jealous. :)
13:20
Jerry19
Nice, congratulations to them.You might also come if you want to :)
13:22
pdurbin
It would be neat to check out the labs or whatever. :)
13:23
Jerry19
Hi Pdurbin, I ran the counter processor against our datasets' access logs, and the result shows that the view number for a dataset is 0, however, I think it is supposed to return a positive number. Do you want me to post the log here?
13:23
pdurbin
Anyway, I did not write the Counter Processor code but I helped get it working with Dataverse and I'm happy to answer questions.
13:25
Jerry19
"data-type": "dataset", "yop": "2014", "uri": "https://purr.purdue.edu/publications/1561/1 ", "performance": [ { "period": { "begin-date": "2018-01-01", "end-date": "2018-01-31" }, "instance": [] } ] },
13:25
Jerry19
Thank you, pdurbin
13:25
pdurbin
Hmm. I'm trying to think of the best place to post it. An issue at https://github.com/IQSS/dataverse/issues would probably be better. I'm not sure if you're worried about IP addresses being in a public place though. This channel is logged and GitHub Issues is public.
13:26
pdurbin
Yeah an empty array under "instance" is no good.
13:27
Jerry19
May I discuss the issue through email with you?
13:28
pdurbin
Sure, email is fine. The best way would be to attach your log in an email to support dataverse.org . That'll create a private support ticket.
13:29
Jerry19
Sounds good. I will send the email in a short time. Thank you in advance.
13:37
pdurbin
Sure, no problem.
13:38
pdurbin
Jerry19: while you're thinking about all this Make Data Count and Counter Processor stuff, please feel free to leave a comment at https://github.com/IQSS/dataverse/issues/6082
13:39
pdurbin
Jerry19: also, I don't know if this helps or not but when I want to play around with Counter Processor, I enable it here: https://github.com/IQSS/dataverse-ansible/blob/be0b2aef038cb8f82e3f3b1f3b602fa8ffe1ddbc/defaults/main.yml#L33
13:50
Jerry19
Thank you, pdurbin.
13:51
pdurbin
Jerry19: sure. Actually, in practice I would probably change counter enabled to true here: https://github.com/IQSS/dataverse-sample-data/blob/407aba72cfa140ae18e9ab80d2e504e0686252a9/ec2config.yaml#L33
13:51
pdurbin
and use the ec2-create-instance.sh script mentioned in the README of that "sample data" repo
14:12
pdurbin
poikilotherm: I just left you another review :)
14:17
poikilotherm
Thx pdurbin
14:17
pdurbin
poikilotherm: is it easy for you to link up that tsv?
14:17
poikilotherm
I just copied the link...?
14:17
poikilotherm
Its a google doc
14:17
pdurbin
no sorry
14:17
pdurbin
there are two links in the other examples
14:18
pdurbin
one to a doc
14:18
pdurbin
one to a tsv
14:19
poikilotherm
Ah I didn't see those... The file is a bit messy with all those links and all in one line
14:19
poikilotherm
I'll add the link
14:19
pdurbin
yeah, it is messy
14:20
pdurbin
and while you're in there you could move it down like Julian suggested.
14:22
pdurbin
... to make it clear that the journal metadata block isn't based on a standard
14:24
pdurbin
Julian just left a comment.
14:33
poikilotherm
pdurbin: I can add a line to the setup-optional-harvard.sh - would you test and maybe debug?
14:34
pdurbin
poikilotherm: sure!
14:35
pdurbin
poikilotherm: my other thought was that you could add a comment to setup-datasetfields.sh above citation.tsv saying it's required
14:35
pdurbin
and if you want a comment later that says the other tsv files are optional
14:40
poikilotherm
I'm fiddling with the env var handling... Thinking about adding getopts parsing to make it easier
14:42
poikilotherm
Passing env vars to sudo is not very easy :-D
14:42
pdurbin
poikilotherm: ok, but don't break anything :)
14:45
poikilotherm
Ok, gotta run pick up kids. More tomorrow.
14:45
poikilotherm
Cu
14:47
pdurbin
that pull request is really coming together
14:55
pdurbin
Jerry19: I just replied. One of the things I'm confused about is how your logs have lines from 2018. Are these Apache access logs? You have to use special logs that Dataverse 4.12 (shipped in April 2019) and above creates.
15:03
Jerry19
Hi Phil, yes, the logs that I sent to you are created using the Apache access logs and corresponding metadata information of the dataset in our database
15:04
Jerry19
Each line includes the 21 items that the Counter Processor requires
15:05
Jerry19
The first 10 items are extracted from the apache access log
15:06
Jerry19
The other 11 items are pulled out from database. Then these two parts are combined to form the log file that counter processor can deal with
15:11
Jerry19
Hi Phil, please correct me if there is anything wrong. Thank you
16:36
pdurbin
Jerry19: hi sorry, standup and then meetings after. You cannot use Apache access logs. It's not supported. I had the same question myself. Here's where I asked about it: https://github.com/CDLUC3/counter-processor/issues/3
16:46
Jerry19
Thank you, Phil. I'm checking it out.
16:52
pdurbin
Jerry19: sure. To configure Dataverse to create the special logs that Counter Processor wants, you need to configure the ":MDCLogPath" database settings as described at http://guides.dataverse.org/en/4.16/admin/make-data-count.html#enable-logging-for-make-data-count
16:55
Jerry19
Hi Phil, I read your thread and I think my situation is different. The logs I shared with you are not Apache access logs. They are the combination of apache access log and the corresponding dataset's metadata information.
16:58
Jerry19
As the same as yours, our Apache access logs don't include the metadata information of the dataset. I write another script to retrieve the metadata information based on the information in each line of the apache access log, and then combine the "apache access log" with the "database information"
16:59
Jerry19
2018-01-06T20:10:30+0000 98.223.104.111 - - - /publications/1561/572?media=Image:thumb 10.4231/R7B8562S - 1833 Mozilla/5.0 (Windows NT 6.2; Win64; x64; rv:52.0) Gecko/20100101 Firefox/52.0 MCD: Matlab programs for computing the Macdonald function for complex orders Purdue University Research Repository (PURR) re3data::10.17616/R3V90N
16:59
Jerry19
Such as the line above
16:59
Jerry19
2018-01-06T20:10:30+0000 98.223.104.111 - - - /publications/1561/572?media=Image:thumb 10.4231/R7B8562S - 1833 Mozilla/5.0 (Windows NT 6.2; Win64; x64; rv:52.0) Gecko/20100101 Firefox/52.0
17:00
Jerry19
Above part comes from our repository's server apache access log
17:00
Jerry19
MCD: Matlab programs for computing the Macdonald function for complex orders Purdue University Research Repository (PURR) re3data::10.17616/R3V90N Walter Gautschi 2014-04-22T16:50:00+0000 1 1561 https://purr.purdue.edu/publications/1561/1 2014
17:01
Jerry19
The other part as above shows comes from our database where dataset's metadata information are saved
17:01
Jerry19
The whole line above meets the requirement by the Counter Processor
17:02
Jerry19
That the log that Counter Process deals with must contains 21 items, as introduced in section:Items to log per line for processing on page https://github.com/CDLUC3/counter-processor
17:03
pdurbin
Jerry19: interesting! Are you saying you wrote a script to take old Apache logs and supplement them with enough information from Dataverse to put them into the format that Counter Processor requires?
17:03
Jerry19
So what I did was I write script to create the log files that Counter Processor can deal with, then I run the counter processor on them.
17:03
Jerry19
Yes, exactly
17:04
Jerry19
Sorry that I didn't mention it at the beginning
17:15
pdurbin_m joined #dataverse
17:16
pdurbin_m
Jerry19: is your script open source?!? Please say yes. :)
17:19
Jerry31 joined #dataverse
17:19
Jerry31
I accidentally quit the channel.
17:22
Jerry31
Back to my confusions, the reality is that there are more than 20 accesses to the web page of the dataset, however, the Counter Processor doesn't get any investigation number.
17:25
pdurbin_m joined #dataverse
17:26
pdurbin_m
Jerry31: I got disconnected too. Typing with thumbs. Is your script open source?
17:38
Jerry31
Hi Phil, the script is not open source for now.
17:42
Jerry31
Hi Phil, can you please introduce how you run the Counter Processor on your hub? Don't you create the log files that Counter Processor requires and deals with ?
18:01
pdurbin
Jerry31: hi, sorry, I was out for a walk after lunch. Harvard Dataverse has not yet put Counter Processor into production. Here's the issue we're using to track this but it's not in our current sprint: https://github.com/orgs/IQSS/projects/2#card-23210979
18:20
Jerry31
ok, thank you, Phil.
18:25
pdurbin
Jerry31: sure. What I'm saying is that you're a pioneer. There's one other guy, Jim Myers, who is starting to create issues about Make Data Count, like this one: https://github.com/IQSS/dataverse/issues/6138
18:25
pdurbin
So you are very, very welcome to open issues too! :)
18:30
pdurbin
Jerry31: what if you run Counter Processor on the special logs produced by Dataverse? Does it work? Do you get views and downloads?
19:01
Jerry31
Hi Phil, Will do. Can you please send me the logs produced by Dataverse? I lost the information due to quit last time
19:05
pdurbin
Jerry31: sorry, I don't have any logs like that handy. One thing I could do is spin up an installation of Dataverse on EC2 with Make Data Count and Counter Processor enabled and give you ssh access to it. Then we could both look at the logs. What do you think?
20:45
Jerry31
Hi Phil, probably not this time. Thank you for all the help today. Let me know if I can help on anything.
20:46
pdurbin
Jerry31: ok, no problem. I'm heading home soon anyway. :)