IQSS logo

IRC log for #dataverse, 2019-09-09

Connect via chat.dataverse.org to discuss Dataverse (dataverse.org, an open source web application for sharing, citing, analyzing, and preserving research data) with users and developers.

| Channels | #dataverse index | Today | | Search | Google Search | Plain-Text | plain, newest first | summary

All times shown according to UTC.

Time S Nick Message
07:31 poikilotherm joined #dataverse
09:10 stefankasberger joined #dataverse
09:12 juancorr joined #dataverse
09:44 stefankasberger joined #dataverse
10:23 pdurbin joined #dataverse
10:24 pdurbin Hi, all. I did a little research into upgrading http://chat.dataverse.org from Shout over the weekend.
10:31 poikilotherm Hi pdurbin :-)
10:39 poikilotherm pdurbin do you think there is a chance someone will take a look at my PRs today?
10:56 pdurbin I was going to look at the Solr one.
10:57 poikilotherm Great
10:57 pdurbin I just assigned myself to it.
11:02 poikilotherm :-) :-) :-)
11:06 pdurbin Have you seen Adam Bien's pom.xml files for Java EE 7 that have only 21 lines? This: http://www.adam-bien.com/roller/abien/entry/essential_javaee_7_pom_xml
11:08 poikilotherm Nope.
11:08 poikilotherm But I kinda doubt he will use it
11:09 yoh joined #dataverse
11:09 poikilotherm In his podcast from last week about the first line of Quarkus he told us differently
11:09 poikilotherm He now is into add everything now, start hacking and remove later. But obviously stay as much in EE as possible.
11:09 poikilotherm s/in/with/
11:10 pdurbin Right, stick to the provided APIs if you can. That's what I mean.
11:10 poikilotherm :-)
11:10 poikilotherm You better don't look at Dataverse POM :-D
11:10 pdurbin Minimize your external dependencies. Use the framework you're on.
11:11 pdurbin I know, I know. I just left a comment on your pull request.
11:11 * poikilotherm goes looking
11:13 poikilotherm Re :-)
11:28 pdurbin poikilotherm: I moved the pom.xml pull request over. You aren't blocked on the Solr one, it seems. You're mostly just suggesting a better approach.
11:28 pdurbin Using existing APIs.
11:29 yoh joined #dataverse
11:29 poikilotherm Yeah. I am not all blocked, but it would be superb to have this sooner than later to go ahead
11:29 pdurbin Yeah.
11:30 pdurbin I think the only thing I'm wondering about is if we should take the Harvard-specific stuff out.
11:31 pdurbin Out of schema_dv_cmb_copies.xml and schema_dv_cmb_fields.xml, I mean. Do you think you could try taking the Harvard specific fields out?
11:32 poikilotherm I could. But that will break backward compatibility
11:33 poikilotherm People might depend on the existing stuff and reuse Harvard specials
11:33 poikilotherm If you guys tell me go ahead, I'll do :-)
11:33 poikilotherm This is just about removing the fields, which is not a lot of work to do... ;-)
11:36 pdurbin I think it would be much cleaner and leaner to remove those Harvard-specific fields. I just assigned it to you and moved it to community dev. If it's too hard or weird, please just ping me.
11:39 poikilotherm https://github.com/IQSS/dataverse/pull/6146#issuecomment-529431257
12:13 pdurbin Thanks, I replied with some screenshots and such. :)
12:14 poikilotherm Yeah, I saw you made some great work :-D
12:14 poikilotherm I'm trying to filter the TSV fields so I don't miss a field
12:18 pdurbin Hmm.
12:19 pdurbin I would suggest a clean installation with just the standard 6 metadata blocks. Then run your new script. That should do it, right?
12:20 pdurbin Are you sorry you asked me to look at your pull requests? :)
12:21 pdurbin Again, if it's too hard or weird, please let me know.
12:23 poikilotherm It's no problem at all and I fully agree
12:23 pdurbin phew
12:23 pdurbin Do you think you could add that 6th block to the appendix too?
12:25 poikilotherm cat scripts/api/data/metadatablocks/customCHIA.tsv | grep -A5000 "#datasetField" | grep -B5000 "#controlledVocabulary" | grep -E -e "^\s+" | cut -f2
12:25 poikilotherm Getting there :-D
12:25 poikilotherm Sure
12:25 pdurbin Thanks!
12:25 pdurbin Even more added value in this pull request. :)
12:27 poikilotherm Happy to be helpfull :-)
12:27 pdurbin That 5 vs 6 mismatch has bothered me for a long time. Very untidy. :)
12:34 poikilotherm Ah I like greping
12:34 poikilotherm I just generated a list of fields and used sed to kill the lines in both files
12:35 poikilotherm It's interesting that not all fields are also having a fulltext index
12:45 poikilotherm pdurbin where is this journal.tsv coming from?
12:45 poikilotherm There are no obvious reference in the docs
12:50 poikilotherm I found the Google docs link, but I have no idea on what this has been based... Every other schema has some kind of inspiration/base/...
12:50 poikilotherm I could add this as a TODO and you push a commit correcting this.
12:56 poikilotherm pdurbin: just pushed two commits with your requested changes
13:06 pdurbin poikilotherm: thanks but why is mraCollection still in there? That's from customMRA.tsv
13:07 pdurbin hbgdkiBirthWeight is still in there too. It's from custom_hbgdki.tsv
13:07 poikilotherm That's just because I'm stupid
13:08 pdurbin I blame grep. :)
13:08 poikilotherm Forgot that copyField uses source=, not name==
13:08 poikilotherm I'll amend and force push, ok?
13:09 poikilotherm (So nobody will ever know...)
13:09 pdurbin heh, that's fine
13:10 poikilotherm Here you go. Shortly AFK
13:14 Jerry19 joined #dataverse
13:15 Jerry19 To whom it may concern, is it a place to ask questions about Counter Processor?
13:15 poikilotherm pdurbin that's for you :-D
13:15 poikilotherm Hello Jerry19
13:16 Jerry19 Hi poikilotherm
13:17 Jerry19 I'm Jerry, working at Purdue University Libraries
13:17 poikilotherm Welcome Jerry :-)
13:18 poikilotherm IIRC Phil (pdurbin) was the one who implemented the counter code
13:19 pdurbin Jerry19: nice, both of my nieces are going to Purdue. One of them is majoring in animation and video game design. I'm jealous. :)
13:20 Jerry19 Nice, congratulations to them.You might also come if you want to :)
13:22 pdurbin It would be neat to check out the labs or whatever. :)
13:23 Jerry19 Hi Pdurbin, I ran the counter processor against our datasets' access logs, and the result shows that the view number for a dataset is 0, however, I think it is supposed to return a positive number. Do you want me to post the log here?
13:23 pdurbin Anyway, I did not write the Counter Processor code but I helped get it working with Dataverse and I'm happy to answer questions.
13:25 Jerry19 "data-type": "dataset",       "yop": "2014",       "uri": "https://purr.purdue.edu/publications/1561/1",       "performance": [         {           "period": {             "begin-date": "2018-01-01",             "end-date": "2018-01-31"           },           "instance": []         }       ]     },
13:25 Jerry19 Thank you, pdurbin
13:25 pdurbin Hmm. I'm trying to think of the best place to post it. An issue at https://github.com/IQSS/dataverse/issues would probably be better. I'm not sure if you're worried about IP addresses being in a public place though. This channel is logged and GitHub Issues is public.
13:26 pdurbin Yeah an empty array under "instance" is no good.
13:27 Jerry19 May I discuss the issue through email with you?
13:28 pdurbin Sure, email is fine. The best way would be to attach your log in an email to support@dataverse.org . That'll create a private support ticket.
13:29 Jerry19 Sounds good. I will send the email in a short time. Thank you in advance.
13:37 pdurbin Sure, no problem.
13:38 pdurbin Jerry19: while you're thinking about all this Make Data Count and Counter Processor stuff, please feel free to leave a comment at https://github.com/IQSS/dataverse/issues/6082
13:39 pdurbin Jerry19: also, I don't know if this helps or not but when I want to play around with Counter Processor, I enable it here: https://github.com/IQSS/dataverse-ansible/blob/be0b2aef038cb8f82e3f3b1f3b602fa8ffe1ddbc/defaults/main.yml#L33
13:50 Jerry19 Thank you, pdurbin.
13:51 pdurbin Jerry19: sure. Actually, in practice I would probably change counter enabled to true here: https://github.com/IQSS/dataverse-sample-data/blob/407aba72cfa140ae18e9ab80d2e504e0686252a9/ec2config.yaml#L33
13:51 pdurbin and use the ec2-create-instance.sh script mentioned in the README of that "sample data" repo
14:12 pdurbin poikilotherm: I just left you another review :)
14:17 poikilotherm Thx pdurbin
14:17 pdurbin poikilotherm: is it easy for you to link up that tsv?
14:17 poikilotherm I just copied the link...?
14:17 poikilotherm Its a google doc
14:17 pdurbin no sorry
14:17 pdurbin there are two links in the other examples
14:18 pdurbin one to a doc
14:18 pdurbin one to a tsv
14:19 poikilotherm Ah I didn't see those... The file is a bit messy with all those links and all in one line
14:19 poikilotherm I'll add the link
14:19 pdurbin yeah, it is messy
14:20 pdurbin and while you're in there you could move it down like Julian suggested.
14:22 pdurbin ... to make it clear that the journal metadata block isn't based on a standard
14:24 pdurbin Julian just left a comment.
14:33 poikilotherm pdurbin: I can add a line to the setup-optional-harvard.sh - would you test and maybe debug?
14:34 pdurbin poikilotherm: sure!
14:35 pdurbin poikilotherm: my other thought was that you could add a comment to setup-datasetfields.sh above citation.tsv saying it's required
14:35 pdurbin and if you want a comment later that says the other tsv files are optional
14:40 poikilotherm I'm fiddling with the env var handling... Thinking about adding getopts parsing to make it easier
14:42 poikilotherm Passing env vars to sudo is not very easy :-D
14:42 pdurbin poikilotherm: ok, but don't break anything :)
14:45 poikilotherm Ok, gotta run pick up kids. More tomorrow.
14:45 poikilotherm Cu
14:47 pdurbin that pull request is really coming together
14:55 pdurbin Jerry19: I just replied. One of the things I'm confused about is how your logs have lines from 2018. Are these Apache access logs? You have to use special logs that Dataverse 4.12 (shipped in April 2019) and above creates.
15:03 Jerry19 Hi Phil, yes, the logs that I sent to you are created using the Apache access logs and corresponding metadata information of the dataset in our database
15:04 Jerry19 Each line includes the 21 items that the Counter Processor requires
15:05 Jerry19 The first 10 items are extracted from the apache access log
15:06 Jerry19 The other 11 items are pulled out from database. Then these two parts are combined to form the log file that counter processor can deal with
15:11 Jerry19 Hi Phil, please correct me if there is anything wrong. Thank you
16:36 pdurbin Jerry19: hi sorry, standup and then meetings after. You cannot use Apache access logs. It's not supported. I had the same question myself. Here's where I asked about it: https://github.com/CDLUC3/counter-processor/issues/3
16:46 Jerry19 Thank you, Phil. I'm checking it out.
16:52 pdurbin Jerry19: sure. To configure Dataverse to create the special logs that Counter Processor wants, you need to configure the ":MDCLogPath" database settings as described at http://guides.dataverse.org/en/4.16/admin/make-data-count.html#enable-logging-for-make-data-count
16:55 Jerry19 Hi Phil, I read your thread and I think my situation is different. The logs I shared with you are not Apache access logs. They are the combination of apache access log and the corresponding dataset's metadata information.
16:58 Jerry19 As the same as yours, our Apache access logs don't include the metadata information of the dataset. I write another script to retrieve the metadata information based on the information in each line of the apache access log, and then combine the "apache access log" with the "database information"
16:59 Jerry19 2018-01-06T20:10:30+0000          98.223.104.111  -              -              -              /publications/1561/572?media=Image:thumb                10.4231/R7B8562S           -              1833       Mozilla/5.0 (Windows NT 6.2; Win64; x64; rv:52.0) Gecko/20100101 Firefox/52.0        MCD: Matlab programs for computing the Macdonald function for complex orders            Purdue University Research Repository (PURR)  re3data::10.17616/R3V90N
16:59 Jerry19 Such as the line above
16:59 Jerry19 2018-01-06T20:10:30+0000 98.223.104.111 - - - /publications/1561/572?media=Image:thumb 10.4231/R7B8562S - 1833 Mozilla/5.0 (Windows NT 6.2; Win64; x64; rv:52.0) Gecko/20100101 Firefox/52.0
17:00 Jerry19 Above part comes from our repository's server apache access log
17:00 Jerry19 MCD: Matlab programs for computing the Macdonald function for complex orders Purdue University Research Repository (PURR) re3data::10.17616/R3V90N Walter Gautschi 2014-04-22T16:50:00+0000 1 1561 https://purr.purdue.edu/publications/1561/1 2014
17:01 Jerry19 The other part as above shows comes from our database where dataset's metadata information are saved
17:01 Jerry19 The whole line above meets the requirement by the Counter Processor
17:02 Jerry19 That the log that Counter Process deals with must contains 21 items, as introduced in section:Items to log per line for processing on page https://github.com/CDLUC3/counter-processor
17:03 pdurbin Jerry19: interesting! Are you saying you wrote a script to take old Apache logs and supplement them with enough information from Dataverse to put them into the format that Counter Processor requires?
17:03 Jerry19 So what I did was I write script to create the log files that Counter Processor can deal with, then I run the counter processor on them.
17:03 Jerry19 Yes, exactly
17:04 Jerry19 Sorry that I didn't mention it at the beginning
17:15 pdurbin_m joined #dataverse
17:16 pdurbin_m Jerry19: is your script open source?!? Please say yes. :)
17:19 Jerry31 joined #dataverse
17:19 Jerry31 I accidentally quit the channel.
17:22 Jerry31 Back to my confusions, the reality is that there are more than 20 accesses to the web page of the dataset, however, the Counter Processor doesn't get any investigation number.
17:25 pdurbin_m joined #dataverse
17:26 pdurbin_m Jerry31: I got disconnected too. Typing with thumbs. Is your script open source?
17:38 Jerry31 Hi Phil, the script is not open source for now.
17:42 Jerry31 Hi Phil, can you please introduce how you run the Counter Processor on your hub? Don't you create the log files that Counter Processor requires and deals with ?
18:01 pdurbin Jerry31: hi, sorry, I was out for a walk after lunch. Harvard Dataverse has not yet put Counter Processor into production. Here's the issue we're using to track this but it's not in our current sprint: https://github.com/orgs/IQSS/projects/2#card-23210979
18:20 Jerry31 ok, thank you, Phil.
18:25 pdurbin Jerry31: sure. What I'm saying is that you're a pioneer. There's one other guy, Jim Myers, who is starting to create issues about Make Data Count, like this one: https://github.com/IQSS/dataverse/issues/6138
18:25 pdurbin So you are very, very welcome to open issues too! :)
18:30 pdurbin Jerry31: what if you run Counter Processor on the special logs produced by Dataverse? Does it work? Do you get views and downloads?
19:01 Jerry31 Hi Phil, Will do. Can you please send me the logs produced by Dataverse? I lost the information due to quit last time
19:05 pdurbin Jerry31: sorry, I don't have any logs like that handy. One thing I could do is spin up an installation of Dataverse on EC2 with Make Data Count and Counter Processor enabled and give you ssh access to it. Then we could both look at the logs. What do you think?
20:45 Jerry31 Hi Phil, probably not this time. Thank you for all the help today. Let me know if I can help on anything.
20:46 pdurbin Jerry31: ok, no problem. I'm heading home soon anyway. :)

| Channels | #dataverse index | Today | | Search | Google Search | Plain-Text | plain, newest first | summary

Connect via chat.dataverse.org to discuss Dataverse (dataverse.org, an open source web application for sharing, citing, analyzing, and preserving research data) with users and developers.