IRC log for #dvn, 2013-07-29

We've moved! Please join #dataverse instead. The new logs are at http://irclog.iq.harvard.edu/dataverse/today

All times shown according to UTC.

Time	Nick	Message
14:44		iqlogbot joined #dvn
14:44		Topic for #dvn is now http://thedata.org - The Dataverse Network Project \| logs at http://irclog.iq.harvard.edu/dvn/today
14:45	jwhitney	pdurbin: yes, although we do want to allow multiple files: https://docs.google.com/file/d/0B8Zfl4GMgyejMlhFOUU5M0p4c3M/edit
14:51	jwhitney	pdurbin: (these are just mockups: the file description form has some fields that should describe the study, instead)
14:51	jwhitney	pdurbin: (they're a bit out of date)
14:58	pdurbin	jwhitney: hmm, ok
14:58	pdurbin	you're back!
14:59	pdurbin	jwhitney: sorry, that was for iqlogbot :) ... logging is back http://irclog.iq.harvard.edu/dvn/2013-07-29
14:59	jwhitney	pdurbin: :)
15:00	pdurbin	jwhitney: I think it would be great if you played with the SWORD API as it stands right now. I can point you to the curl commands. It's still very rough but it'll give you an idea of its current state
15:01	pdurbin	https://github.com/IQSS/dvn/tree/develop/tools/scripts/data-deposit-api contains all the scripts and I'm happy to walk you through them
15:03	jwhitney	pdurbin: yep, ok.
15:03	pdurbin	jwhitney: the biggest thing that's on my mind is... what will the binary file you send look like? You mentioned simple zip... After I receive the zip, I should unzip it and look for files inside and ingest them one by one? And also look for a metadata file in there?
15:04	jwhitney	pdurbin: that's one approach: study metadata in the atom entry, then include file-level metadata in the zip
15:04	pdurbin	right now my implementation takes whatever binary file is sent and attempts to ingest it. So if you send an Rdata file it will be ingested as Rdata. Same for a Stata file, I assume, but I haven't tried this yet.
15:05	pdurbin	but it sounds like I should always expect a zip instead?
15:05	jwhitney	pdurbin: I think so, yes: even if there is only one file, there may be associated metadata
15:06	jwhitney	pdurbin: something like DSpace's simple archive format, maybe https://wiki.duraspace.org/display/DSDOC3x/Importing+and+Exporting+Items+via+Simple+Archive+Format#ImportingandExportingItemsviaSimpleArchiveFormat-ItemImporterandExporter
15:08	pdurbin	jwhitney: ok. Alex also seemed interested in BagIt: http://en.wikipedia.org/wiki/BagIt
15:09	pdurbin	(I've only barely heard of both of these formats.)
15:09	jwhitney	pdurbin: bagit seems more straightforward
15:09	pdurbin	straightforward is good :)
15:10	pdurbin	jwhitney: do you think we should formally support BagIt? or just use it as a model for now?
15:10	jwhitney	pdurbin: I've worked with the DSpace format, have only read about bagit.
15:13	pdurbin	jwhitney: the way upload works in DVN now is that you can upload a single file and specify the format (Rdata vs. Stata vs. etc.)
15:13	pdurbin	or you can upload a zip file that has a bunch of files in it
15:14	pdurbin	which gets unzipped... and all the files get ingested
15:14	pdurbin	so it's very simple
15:14	jwhitney	pdurbin: Ok.
15:14	pdurbin	that would probably be the easiest thing for me to support out of the gate
15:15		posixeleni joined #dvn
15:15	pdurbin	which is called "simple zip" in the SWORD spec
15:15	pdurbin	posixeleni: hi!
15:16	posixeleni	hi folks! just wanted clarification on how OJS would handle the supplementary files that you send over to DVN
15:16	pdurbin	jwhitney: are you saying you're familiar with METSDSpaceSIP? That's also in the SWORD spec as an example
15:16	posixeleni	so if I understand it correctly: OJS will allow authors to deposit multiple files
15:17	posixeleni	Then when it comes time to send it to DVN it is packaged into a simple zip file and sent via API?
15:17	pdurbin	posixeleni: right on both accounts
15:17	posixeleni	cool sorry to interrupt!
15:17	jwhitney	posixeleni: not at all!
15:18	pdurbin	posixeleni: sorry, iqlogbot was broken but jwhitney or I will paste the whole chat to a Google Doc when we're done
15:18		posixeleni joined #dvn
15:18	posixeleni	thanks so much!
15:19	pdurbin	I was saying that right now I'm just ingesting whatever binary file is sent ... but it sounds like I need to switch to expecting a zip file, which I will unzip ... and then ingest the files one by one
15:19	pdurbin	jwhitney: right?
15:20	jwhitney	pdurbin: yes, if OJS needs to send along file-level metadata, which it seems like yes
15:20	jwhitney	pdurbin: data type, at the very least.
15:21	pdurbin	jwhitney: well, even if metadata it not necessary... right now you would have to send files one by one
15:21	pdurbin	which we probably don't want
15:24	jwhitney	what's typical? I think OJS has to allow the possibility of multiple files, but if most articles will only have a single file...
15:24	pdurbin	posixeleni: it's quite common for studies to have multiple files, right?
15:25	posixeleni	more common than not since they will have the dataset and then a different file for documentation explaining the dataset (readme)
15:25	jwhitney	right, ok
15:26	pdurbin	I feel like everything I've read and watched so far suggests that a zip gets sent across during a binary deposit in the SWORD protocol.
15:28	pdurbin	It was easier for me to simply accept any file as-is (not zipped) and attach it to a study but again, I think I should change this... I should advertise via SWORD that I accept "simple zip" and then accept a zip file and unzip it
15:29	pdurbin	jwhitney: ready for a quick walk through of curl commands?
15:29	jwhitney	pdurbin: ok, sure
15:29	pdurbin	great. the starting point is https://github.com/IQSS/dvn/tree/develop/tools/scripts/data-deposit-api
15:29	jwhitney	pdurbin: yep, I have been walking through your scripts
15:30	pdurbin	the create-study-deposit-data script is a wrapper around a bunch of shell scripts that call curl: https://github.com/IQSS/dvn/blob/develop/tools/scripts/data-deposit-api/create-study-deposit-data
15:31	pdurbin	to explain each curl command, the wrapper script does the following:
15:31	pdurbin	1. retrieve service document using creditials for the journal dataverse in question
15:32	pdurbin	2. create a study based on an "atom entry" XML file
15:32	pdurbin	3. list studies (should be incremented by one each time)
15:32	pdurbin	4. add a file to the study that was just created
15:33	pdurbin	5. make sure error handling is working (sorry, I threw this in just for myself)
15:33	jwhitney	pdurbin: ok, great.
15:33	pdurbin	6. retrieve the SWORD statement for the study
15:34	pdurbin	I'm definitely not sure I'm implementing all this correctly, but it's a start :)
15:34	jwhitney	:)
15:34	pdurbin	you'll see "fakeIRI" and such in some places
15:34	pdurbin	so it's a bit of a moving target
15:35	jwhitney	pdurbin: just to make sure we're on the same page: you said, "... everything I've read and watched so far suggests that a zip gets sent across during a binary deposit in the SWORD protocol"
15:35	pdurbin	and if you think I'd doing anything wrong spec-wise, please let me know! I want to make sure I'm implementing SWORD correctly
15:35	pdurbin	I'm*
15:35	jwhitney	pdurbin: do you feel that adding content in a zip to an existing resource is not quite in line w/ the spec?
15:35	jwhitney	& same here!
15:36	jwhitney	errr. want to make sure I'm sending content in a way that makes sense...
15:36	pdurbin	jwhitney: it does feel strange... right adding the file to the study is a "replace" from a SWORD perspective
15:37	jwhitney	pdurbin: that's true.
15:37	pdurbin	because a PUT is a replace
15:38	pdurbin	I tried to get clarification on this from the SWORD mailing list: [sword-app-tech] POST atom entry, then PUT media resource - http://www.mail-archive.com/sword-app-tech@lists.sourceforge.net/msg00331.html
15:38	pdurbin	but so far it's just me writing to myself :(
15:38	pdurbin	jwhitney: something for you to chew on is... how do we replace a file that has been added to a study?
15:39	pdurbin	jwhitney: or... how do we replace 2 out of 5 files that have been added to a study?
15:44	pdurbin	jwhitney: do you sent all 5 files over again via PUT and I do a "replace" on the DVN side?
15:44	pdurbin	it gets interesting :)
15:44	jwhitney	pdurbin: that's what I've been thinking through.. if I do that, can I provide information for you to know what's changed?
15:45	pdurbin	jwhitney: you mean so you can send only the 2 changed files instead of all 5?
15:47	jwhitney	pdurbin: no, I meant I'd send all content, but with enough metadata for you to know what's changed..
15:48	jwhitney	pdurbin: but that's awkwad
15:48	jwhitney	'awkward'
15:48	pdurbin	ah. sure. well... and md5sum for each of the 5 files would help in this case, right?
15:48	jwhitney	yes
15:48	pdurbin	jwhitney: in Boston we say "awkwad" ... and "hazadous" ;)
15:48	jwhitney	jwhitney: literal lol.
15:49	jwhitney	pdurbin: oops, self-referential comment.
15:50	pdurbin	jwhitney: on thing I'm wondering is if you plan to persist into your database a unique identifier for each study and files the corresponds to the dataverse for a journal
15:51	jwhitney	pdurbin: yes, we have to know if a study's been created for an article
15:52	pdurbin	right. and studies have persistent, unique identifiers such as hdl:1902.1/12345
15:52	pdurbin	jwhitney: but what about files? ... the best I could do is expose the database id of each file... and then we have to think about how studies can have multiple versions...
15:54	pdurbin	jwhitney: anyway, for now we can focus on creating new stuff... but obviously I have a lot of questions about how stuff gets updated :)
15:55	posixeleni	jwhitney: as do i!
15:56	pdurbin	posixeleni: :) ... well I think this is on your agenda for the meeting in only a few hours :)
15:56	jwhitney	sorry, minor interruption...
15:57	jwhitney	& I'm concerned about storing metadata OJS-side that is updated DV-side.
15:58	jwhitney	so after OJS has created studies, files in DV, I think it makes sense to store IDs and refresh as requested OJS-side.
15:59	pdurbin	jwhitney: so... with the example of updating 2 of 5 files, what would happen?
16:01	pdurbin	jwhitney: we can talk about it during the meeting if it's too much to type :)
16:03	jwhitney	pdurbin: a fast think-through: author requests a view of files. OJS fetches study / file metadata from DV. Author edits study / file metadata. OJS PUTS new study metadata to EDIT-IRI. OJS PUTs new package of content to EM-IRI.
16:04	jwhitney	pdurbin: glaring holes?
16:04	jwhitney	pdurbin: assuming package includes enough information for the server to identify changed files.
16:08	pdurbin	jwhitney: so would I give you a view of the files via a SWORD statement? ... I need to look at the spec some more
16:08	jwhitney	pdurbin: like here? http://swordapp.github.io/SWORDv2-Profile/SWORDProfile.html#protocoloperations_retrievingcontent_feed
16:09	pdurbin	jwhitney: ah. so not a statement. yes, that looks right
16:12	pdurbin	jwhitney: thanks :)
16:18	jwhitney	pdurbin: you're welcome, although I meant the '?' -- a double-check on my understanding of the spec.
16:20	pdurbin	jwhitney: sure. but I think you're right
17:39	pdurbin	here's the entire conversation (including the earlier part not logged by iqlogbot) in a Google Doc: https://docs.google.com/document/d/1XbaVsDTML0RohCtY5WNJMpFcC6xYhCjyFKyKA-cQBVI/edit?usp=sharing
19:21		jwhitney joined #dvn
19:30	pdurbin	jwhitney: good meeting. :) ... not sure if you saw but I did create that Google Doc with the rest of the chat from this morning
19:30	jwhitney	pdurbin: no, but thanks

We've moved! Please join #dataverse instead. The new logs are at http://irclog.iq.harvard.edu/dataverse/today