IRC log for #dataverse, 2019-09-09

Connect via chat.dataverse.org to discuss Dataverse (dataverse.org, an open source web application for sharing, citing, analyzing, and preserving research data) with users and developers.

Time	Nick	Message
07:31		poikilotherm joined #dataverse
09:10		stefankasberger joined #dataverse
09:12		juancorr joined #dataverse
09:44		stefankasberger joined #dataverse
10:23		pdurbin joined #dataverse
10:24	pdurbin	Hi, all. I did a little research into upgrading http://chat.dataverse.org from Shout over the weekend.
10:31	poikilotherm	Hi pdurbin :-)
10:39	poikilotherm	pdurbin do you think there is a chance someone will take a look at my PRs today?
10:56	pdurbin	I was going to look at the Solr one.
10:57	poikilotherm	Great
10:57	pdurbin	I just assigned myself to it.
11:02	poikilotherm	:-) :-) :-)
11:06	pdurbin	Have you seen Adam Bien's pom.xml files for Java EE 7 that have only 21 lines? This: http://www.adam-bien.com/roller/abien/entry/essential_javaee_7_pom_xml
11:08	poikilotherm	Nope.
11:08	poikilotherm	But I kinda doubt he will use it
11:09		yoh joined #dataverse
11:09	poikilotherm	In his podcast from last week about the first line of Quarkus he told us differently
11:09	poikilotherm	He now is into add everything now, start hacking and remove later. But obviously stay as much in EE as possible.
11:09	poikilotherm	s/in/with/
11:10	pdurbin	Right, stick to the provided APIs if you can. That's what I mean.
11:10	poikilotherm	:-)
11:10	poikilotherm	You better don't look at Dataverse POM :-D
11:10	pdurbin	Minimize your external dependencies. Use the framework you're on.
11:11	pdurbin	I know, I know. I just left a comment on your pull request.
11:11	* poikilotherm	goes looking
11:13	poikilotherm	Re :-)
11:28	pdurbin	poikilotherm: I moved the pom.xml pull request over. You aren't blocked on the Solr one, it seems. You're mostly just suggesting a better approach.
11:28	pdurbin	Using existing APIs.
11:29		yoh joined #dataverse
11:29	poikilotherm	Yeah. I am not all blocked, but it would be superb to have this sooner than later to go ahead
11:29	pdurbin	Yeah.
11:30	pdurbin	I think the only thing I'm wondering about is if we should take the Harvard-specific stuff out.
11:31	pdurbin	Out of schema_dv_cmb_copies.xml and schema_dv_cmb_fields.xml, I mean. Do you think you could try taking the Harvard specific fields out?
11:32	poikilotherm	I could. But that will break backward compatibility
11:33	poikilotherm	People might depend on the existing stuff and reuse Harvard specials
11:33	poikilotherm	If you guys tell me go ahead, I'll do :-)
11:33	poikilotherm	This is just about removing the fields, which is not a lot of work to do... ;-)
11:36	pdurbin	I think it would be much cleaner and leaner to remove those Harvard-specific fields. I just assigned it to you and moved it to community dev. If it's too hard or weird, please just ping me.
11:39	poikilotherm	https://github.com/IQSS/dataverse/pull/6146#issuecomment-529431257
12:13	pdurbin	Thanks, I replied with some screenshots and such. :)
12:14	poikilotherm	Yeah, I saw you made some great work :-D
12:14	poikilotherm	I'm trying to filter the TSV fields so I don't miss a field
12:18	pdurbin	Hmm.
12:19	pdurbin	I would suggest a clean installation with just the standard 6 metadata blocks. Then run your new script. That should do it, right?
12:20	pdurbin	Are you sorry you asked me to look at your pull requests? :)
12:21	pdurbin	Again, if it's too hard or weird, please let me know.
12:23	poikilotherm	It's no problem at all and I fully agree
12:23	pdurbin	phew
12:23	pdurbin	Do you think you could add that 6th block to the appendix too?
12:25	poikilotherm	cat scripts/api/data/metadatablocks/customCHIA.tsv \| grep -A5000 "#datasetField" \| grep -B5000 "#controlledVocabulary" \| grep -E -e "^\s+" \| cut -f2
12:25	poikilotherm	Getting there :-D
12:25	poikilotherm	Sure
12:25	pdurbin	Thanks!
12:25	pdurbin	Even more added value in this pull request. :)
12:27	poikilotherm	Happy to be helpfull :-)
12:27	pdurbin	That 5 vs 6 mismatch has bothered me for a long time. Very untidy. :)
12:34	poikilotherm	Ah I like greping
12:34	poikilotherm	I just generated a list of fields and used sed to kill the lines in both files
12:35	poikilotherm	It's interesting that not all fields are also having a fulltext index
12:45	poikilotherm	pdurbin where is this journal.tsv coming from?
12:45	poikilotherm	There are no obvious reference in the docs
12:50	poikilotherm	I found the Google docs link, but I have no idea on what this has been based... Every other schema has some kind of inspiration/base/...
12:50	poikilotherm	I could add this as a TODO and you push a commit correcting this.
12:56	poikilotherm	pdurbin: just pushed two commits with your requested changes
13:06	pdurbin	poikilotherm: thanks but why is mraCollection still in there? That's from customMRA.tsv
13:07	pdurbin	hbgdkiBirthWeight is still in there too. It's from custom_hbgdki.tsv
13:07	poikilotherm	That's just because I'm stupid
13:08	pdurbin	I blame grep. :)
13:08	poikilotherm	Forgot that copyField uses source=, not name==
13:08	poikilotherm	I'll amend and force push, ok?
13:09	poikilotherm	(So nobody will ever know...)
13:09	pdurbin	heh, that's fine
13:10	poikilotherm	Here you go. Shortly AFK
13:14		Jerry19 joined #dataverse
13:15	Jerry19	To whom it may concern, is it a place to ask questions about Counter Processor?
13:15	poikilotherm	pdurbin that's for you :-D
13:15	poikilotherm	Hello Jerry19
13:16	Jerry19	Hi poikilotherm
13:17	Jerry19	I'm Jerry, working at Purdue University Libraries
13:17	poikilotherm	Welcome Jerry :-)
13:18	poikilotherm	IIRC Phil (pdurbin) was the one who implemented the counter code
13:19	pdurbin	Jerry19: nice, both of my nieces are going to Purdue. One of them is majoring in animation and video game design. I'm jealous. :)
13:20	Jerry19	Nice, congratulations to them.You might also come if you want to :)
13:22	pdurbin	It would be neat to check out the labs or whatever. :)
13:23	Jerry19	Hi Pdurbin, I ran the counter processor against our datasets' access logs, and the result shows that the view number for a dataset is 0, however, I think it is supposed to return a positive number. Do you want me to post the log here?
13:23	pdurbin	Anyway, I did not write the Counter Processor code but I helped get it working with Dataverse and I'm happy to answer questions.
13:25	Jerry19	"data-type": "dataset", "yop": "2014", "uri": "https://purr.purdue.edu/publications/1561/1", "performance": [ { "period": { "begin-date": "2018-01-01", "end-date": "2018-01-31" }, "instance": [] } ] },
13:25	Jerry19	Thank you, pdurbin
13:25	pdurbin	Hmm. I'm trying to think of the best place to post it. An issue at https://github.com/IQSS/dataverse/issues would probably be better. I'm not sure if you're worried about IP addresses being in a public place though. This channel is logged and GitHub Issues is public.
13:26	pdurbin	Yeah an empty array under "instance" is no good.
13:27	Jerry19	May I discuss the issue through email with you?
13:28	pdurbin	Sure, email is fine. The best way would be to attach your log in an email to supportdataverse.org . That'll create a private support ticket.
13:29	Jerry19	Sounds good. I will send the email in a short time. Thank you in advance.
13:37	pdurbin	Sure, no problem.
13:38	pdurbin	Jerry19: while you're thinking about all this Make Data Count and Counter Processor stuff, please feel free to leave a comment at https://github.com/IQSS/dataverse/issues/6082
13:39	pdurbin	Jerry19: also, I don't know if this helps or not but when I want to play around with Counter Processor, I enable it here: https://github.com/IQSS/dataverse-ansible/blob/be0b2aef038cb8f82e3f3b1f3b602fa8ffe1ddbc/defaults/main.yml#L33
13:50	Jerry19	Thank you, pdurbin.
13:51	pdurbin	Jerry19: sure. Actually, in practice I would probably change counter enabled to true here: https://github.com/IQSS/dataverse-sample-data/blob/407aba72cfa140ae18e9ab80d2e504e0686252a9/ec2config.yaml#L33
13:51	pdurbin	and use the ec2-create-instance.sh script mentioned in the README of that "sample data" repo
14:12	pdurbin	poikilotherm: I just left you another review :)
14:17	poikilotherm	Thx pdurbin
14:17	pdurbin	poikilotherm: is it easy for you to link up that tsv?
14:17	poikilotherm	I just copied the link...?
14:17	poikilotherm	Its a google doc
14:17	pdurbin	no sorry
14:17	pdurbin	there are two links in the other examples
14:18	pdurbin	one to a doc
14:18	pdurbin	one to a tsv
14:19	poikilotherm	Ah I didn't see those... The file is a bit messy with all those links and all in one line
14:19	poikilotherm	I'll add the link
14:19	pdurbin	yeah, it is messy
14:20	pdurbin	and while you're in there you could move it down like Julian suggested.
14:22	pdurbin	... to make it clear that the journal metadata block isn't based on a standard
14:24	pdurbin	Julian just left a comment.
14:33	poikilotherm	pdurbin: I can add a line to the setup-optional-harvard.sh - would you test and maybe debug?
14:34	pdurbin	poikilotherm: sure!
14:35	pdurbin	poikilotherm: my other thought was that you could add a comment to setup-datasetfields.sh above citation.tsv saying it's required
14:35	pdurbin	and if you want a comment later that says the other tsv files are optional
14:40	poikilotherm	I'm fiddling with the env var handling... Thinking about adding getopts parsing to make it easier
14:42	poikilotherm	Passing env vars to sudo is not very easy :-D
14:42	pdurbin	poikilotherm: ok, but don't break anything :)
14:45	poikilotherm	Ok, gotta run pick up kids. More tomorrow.
14:45	poikilotherm	Cu
14:47	pdurbin	that pull request is really coming together
14:55	pdurbin	Jerry19: I just replied. One of the things I'm confused about is how your logs have lines from 2018. Are these Apache access logs? You have to use special logs that Dataverse 4.12 (shipped in April 2019) and above creates.
15:03	Jerry19	Hi Phil, yes, the logs that I sent to you are created using the Apache access logs and corresponding metadata information of the dataset in our database
15:04	Jerry19	Each line includes the 21 items that the Counter Processor requires
15:05	Jerry19	The first 10 items are extracted from the apache access log
15:06	Jerry19	The other 11 items are pulled out from database. Then these two parts are combined to form the log file that counter processor can deal with
15:11	Jerry19	Hi Phil, please correct me if there is anything wrong. Thank you
16:36	pdurbin	Jerry19: hi sorry, standup and then meetings after. You cannot use Apache access logs. It's not supported. I had the same question myself. Here's where I asked about it: https://github.com/CDLUC3/counter-processor/issues/3
16:46	Jerry19	Thank you, Phil. I'm checking it out.
16:52	pdurbin	Jerry19: sure. To configure Dataverse to create the special logs that Counter Processor wants, you need to configure the ":MDCLogPath" database settings as described at http://guides.dataverse.org/en/4.16/admin/make-data-count.html#enable-logging-for-make-data-count
16:55	Jerry19	Hi Phil, I read your thread and I think my situation is different. The logs I shared with you are not Apache access logs. They are the combination of apache access log and the corresponding dataset's metadata information.
16:58	Jerry19	As the same as yours, our Apache access logs don't include the metadata information of the dataset. I write another script to retrieve the metadata information based on the information in each line of the apache access log, and then combine the "apache access log" with the "database information"
16:59	Jerry19	2018-01-06T20:10:30+0000 98.223.104.111 - - - /publications/1561/572?media=Image:thumb 10.4231/R7B8562S - 1833 Mozilla/5.0 (Windows NT 6.2; Win64; x64; rv:52.0) Gecko/20100101 Firefox/52.0 MCD: Matlab programs for computing the Macdonald function for complex orders Purdue University Research Repository (PURR) re3data::10.17616/R3V90N
16:59	Jerry19	Such as the line above
16:59	Jerry19	2018-01-06T20:10:30+0000 98.223.104.111 - - - /publications/1561/572?media=Image:thumb 10.4231/R7B8562S - 1833 Mozilla/5.0 (Windows NT 6.2; Win64; x64; rv:52.0) Gecko/20100101 Firefox/52.0
17:00	Jerry19	Above part comes from our repository's server apache access log
17:00	Jerry19	MCD: Matlab programs for computing the Macdonald function for complex orders Purdue University Research Repository (PURR) re3data::10.17616/R3V90N Walter Gautschi 2014-04-22T16:50:00+0000 1 1561 https://purr.purdue.edu/publications/1561/1 2014
17:01	Jerry19	The other part as above shows comes from our database where dataset's metadata information are saved
17:01	Jerry19	The whole line above meets the requirement by the Counter Processor
17:02	Jerry19	That the log that Counter Process deals with must contains 21 items, as introduced in section:Items to log per line for processing on page https://github.com/CDLUC3/counter-processor
17:03	pdurbin	Jerry19: interesting! Are you saying you wrote a script to take old Apache logs and supplement them with enough information from Dataverse to put them into the format that Counter Processor requires?
17:03	Jerry19	So what I did was I write script to create the log files that Counter Processor can deal with, then I run the counter processor on them.
17:03	Jerry19	Yes, exactly
17:04	Jerry19	Sorry that I didn't mention it at the beginning
17:15		pdurbin_m joined #dataverse
17:16	pdurbin_m	Jerry19: is your script open source?!? Please say yes. :)
17:19		Jerry31 joined #dataverse
17:19	Jerry31	I accidentally quit the channel.
17:22	Jerry31	Back to my confusions, the reality is that there are more than 20 accesses to the web page of the dataset, however, the Counter Processor doesn't get any investigation number.
17:25		pdurbin_m joined #dataverse
17:26	pdurbin_m	Jerry31: I got disconnected too. Typing with thumbs. Is your script open source?
17:38	Jerry31	Hi Phil, the script is not open source for now.
17:42	Jerry31	Hi Phil, can you please introduce how you run the Counter Processor on your hub? Don't you create the log files that Counter Processor requires and deals with ?
18:01	pdurbin	Jerry31: hi, sorry, I was out for a walk after lunch. Harvard Dataverse has not yet put Counter Processor into production. Here's the issue we're using to track this but it's not in our current sprint: https://github.com/orgs/IQSS/projects/2#card-23210979
18:20	Jerry31	ok, thank you, Phil.
18:25	pdurbin	Jerry31: sure. What I'm saying is that you're a pioneer. There's one other guy, Jim Myers, who is starting to create issues about Make Data Count, like this one: https://github.com/IQSS/dataverse/issues/6138
18:25	pdurbin	So you are very, very welcome to open issues too! :)
18:30	pdurbin	Jerry31: what if you run Counter Processor on the special logs produced by Dataverse? Does it work? Do you get views and downloads?
19:01	Jerry31	Hi Phil, Will do. Can you please send me the logs produced by Dataverse? I lost the information due to quit last time
19:05	pdurbin	Jerry31: sorry, I don't have any logs like that handy. One thing I could do is spin up an installation of Dataverse on EC2 with Make Data Count and Counter Processor enabled and give you ssh access to it. Then we could both look at the logs. What do you think?
20:45	Jerry31	Hi Phil, probably not this time. Thank you for all the help today. Let me know if I can help on anything.
20:46	pdurbin	Jerry31: ok, no problem. I'm heading home soon anyway. :)

Connect via chat.dataverse.org to discuss Dataverse (dataverse.org, an open source web application for sharing, citing, analyzing, and preserving research data) with users and developers.