IQSS logo

IRC log for #dvn, 2014-03-17

We've moved! Please join #dataverse instead. The new logs are at http://irclog.iq.harvard.edu/dataverse/today

| Channels | #dvn index | Today | | Search | Google Search | Plain-Text | plain, newest first | summary

All times shown according to UTC.

Time S Nick Message
01:52 axfelix joined #dvn
02:00 axfelix joined #dvn
03:18 axfelix joined #dvn
03:24 axfelix joined #dvn
03:35 axfelix joined #dvn
03:56 garnett joined #dvn
12:00 ruebot joined #dvn
12:12 pdurbin sbmarks: when you have a moment
12:13 pdurbin yoh_: welcome! I don't think I've seen you here before
13:22 sbmarks pdurbin: yo
13:24 pdurbin sbmarks: you said "you could almost provide a CVS-like model for datasets" at https://groups.google.com/d/msg/dataverse-community/Q0HOKTR80rM/US3jDWlO1C0J
13:25 sbmarks let's see....that sounds like something I would say ;)
13:25 sbmarks ah yes
13:26 sbmarks pdurbin: this of course relates to the current conversation on the list?
13:29 pdurbin sbmarks: sure does
13:30 pdurbin sbmarks: dunno if you've had a chance to check out my thought experiment
13:30 LyndsySimon joined #dvn
13:30 pdurbin thought experiment: datasets as git repos: https://docs.google.com/document/d/18WDIS8hrFJvMJBcnRuQ8NfD-VxGq32vJ9WwlEgyyWZs/edit?usp=sharing
13:31 sbmarks i haven't yet, but now is as good a time as any!
13:31 pdurbin :)
13:31 jwhitney joined #dvn
13:32 pdurbin something new (to me) is http://dataprotocols.org/data-packages/ ... Dublin Core is even mentioned
13:32 pdurbin datapackage.json, etc. I'm not sure what do make of it
13:33 pdurbin s/do/to/
13:33 pdurbin jwhitney: mornin'. meeting today, I hear
13:33 jwhitney pdurbin: yessir.
13:34 pdurbin top o' the mornin', I mean
13:44 * pdurbin checks out "Archiving Reproducible Research with R and Dataverse by Thomas J. Leeper" at http://journal.r-project.org/archive/accepted/leeper.pdf via https://twitter.com/ChrisGandrud/status/445475741158088705
13:47 sbmarks ohhhhhh my
13:47 sbmarks that dvn R package is really interesting!
13:47 sbmarks i have had folks ask about this
13:47 pdurbin sbmarks: see also rOpenSci - dvn - Sharing Reproducible Research from R - http://ropensci.org/blog/2014/02/20/dvn-dataverse-network/
13:52 yoh_ pdurbin: Hello ;)  yes -- I (Yaroslav) am new here -- took your invitation and joined the room ;)
13:53 pdurbin yoh_: oh! hi! thanks for all the links! I just added comment about datapackage.json to the Google Doc
13:53 pdurbin yoh_: I also added the logo I created over the weekend to the bottom, even though you don't like my daggers ;)
13:55 yoh_ I just like squares more ;)
13:56 pdurbin yoh_: the squares in the qr code at http://datagit.org ?
13:57 pdurbin (or https://github.com/data-git )
14:00 yoh_ yeah ;)
14:00 pdurbin bleh! ;)
14:01 pdurbin maybe there's something in that rorschach test I'm not seeing :)
14:02 pdurbin yoh_: the specification for http://dataprotocols.org/data-packages/ really goes back to "12 November 2007"?
14:08 yoh_ might well be, remember Matthew Brett talking about some of ideas a while back
14:08 pdurbin nice
14:08 pdurbin well, I'm poking through the issue tracker
14:08 pdurbin this is right up our alley: Encapsulated extensibility · Issue #103 · dataprotocols/dataprotocols - https://github.com/dataprotocols/dataprotocols/issues/103
14:08 pdurbin "A Data Package author MAY add any number of additional fields beyond those listed in the specification here."
14:10 pdurbin we have all sorts of fields in Dataverse-land. we started with social science and astrophysics (big announcement today, by the way) and now we're trying to expand into biomed, generally: http://thedata.org/blog/major-dataverse-release-coming-spring-2014
14:16 pdurbin and other fields/domains in the future. we're trying to stay flexible
14:28 skay yoh_: you were in the cc of the email conversation about data versioning?
14:28 * skay is trying to link nics to conversations I've had
14:29 * skay is Sheila
14:30 skay pdurbin: join the mailing list for data protocols and start a conversation with them about this. I want to see if they will start discussing this more
14:30 skay I also think some of them might idle in one of hte okfn irc channels, but I am not certain about that
14:30 pdurbin skay: you're not the boss of me ;)
14:30 * skay is not the boss of you!
14:31 pdurbin heh
14:31 pdurbin skay: I'll see what I can do. Can't get too mired in all this. Gotta ship.
14:31 skay yeah I know that feeling :(
14:31 skay I guess it can wait. It has waited for centuries!
14:31 pdurbin skay: it's been worked on since 2007 apparently :)
14:32 skay I was disappointed not to see more work with bibjson, or maybe it is just not surfaced
14:32 pdurbin javaeebot: lucky bibjson
14:32 javaeebot pdurbin: http://www.bibjson.org/
14:32 pdurbin "BibJSON is a convention for representing bibliographic metadata in JSON; it makes it easy to share and use bibliographic metadata online."
14:32 pdurbin sounds good
14:32 skay when I asked about it on a mailing list, they did give a timely reply
14:34 pdurbin timely is good
14:34 skay I was starting to think of creating citation information for a compendia -- and there are no exact standards for representing src or data... so on the one side I was looking through conventions with bibtex fields, the xml schema for datacite.org, conversations with Victoria...
14:34 skay and on the other side wondering how I should represent these in json
14:34 skay blargh all in my head with no other programmers to bounce ideas off of
14:35 skay that means I will make ill-considered things because it is better to have thought experiments with more than just one person
14:35 pdurbin skay: we're definitely basing our new work on datacite
14:36 skay oh, and then I had to completely drop the conversation with myself to go focus on deliverables with higher priority
14:37 skay btw squirrel: I wish there were more #openhatch friendly FLOSS java projects which is why I pinged you there
14:37 pdurbin today?
14:37 skay there are way more friendly python projects I know about than java projects, but we do have people looking for java projects from time to time
14:37 skay no, a day or two ago?
14:37 pdurbin right right
14:38 skay I didn't want to bug you too much about it. I was wondering if you would be interested at some point in having some openhatch type of labels on things.
14:39 skay it is tricky since shepharding new contributors can take time away from shipping
14:39 pdurbin amen
14:40 skay so I don't want to just drop people on you... but if you have some tasks that are independent and don't soak up time it could be cool
14:40 skay but that is just a thought experiment for the future
14:41 skay datacite -- arg everyone wants to talk xml at each other
14:41 skay I know it is so well structured but I hate it
14:42 skay oh hey, I was also looking at the schema that crossref uses for describing citations
14:42 skay which is just a few things on top of unixref something or the other
14:43 skay because I wanted to be able to talk to crossref (perhaps deposit, perhaps only query)
14:43 skay so it is another schema you could look at, but not as rich as the datacite one
14:44 skay people should take a look at all these schema to gather together some nice commonalities
14:44 skay also, it would be nice to see how things get used in the wild so that one knows how much of the work that goes in to schemas is YAGNI versus actually used
14:45 skay because you don't want to consume cycles on working with things that don't get used
14:45 skay anyway, I should also go away-from-window to focus on work work
14:47 axfelix joined #dvn
14:56 LyndsySimon joined #dvn
15:02 pdurbin Linus had a nice anti-XML rant the other day: https://plus.google.com/+LinusTorvalds/posts/X2XVf9Q7MfV
15:37 pdurbin I can't make any sense of http://www.crossref.org/schema/documentation/unixref1.0/unixref.html
15:38 pdurbin handy link to get to their home page: http://crossref.org
15:42 pdurbin hmm. "DiffKit is an application, and a framework, for comparing two tables of data, field-by-field" -- http://www.diffkit.org via https://twitter.com/thosjleeper/status/445582381488287744
15:42 pdurbin "DiffKit is like the Unix diff utility, but for tables instead of lines of text."
15:43 yoh_ skay: yes -- that was me (sorry for being lousy  with my replies)
15:49 pdurbin skay: thanks for all the comments at https://docs.google.com/document/d/18WDIS8hrFJvMJBcnRuQ8NfD-VxGq32vJ9WwlEgyyWZs/edit?usp=sharing !
16:19 skay pdurbin: bd
16:21 skay pdurbin: remember when I mentioned sumatra? I don't know if you checked in to it, but I will try a quicker run through of how someone would use it. as a user I'd set up a git repo with my data analsysis code, and arrange it so that there is a directory where hte results get dumped, call it Data/ and perhaps data is not tracked by git
16:21 skay I would trigger a run using a sumatra command which calls out to my code, and sumatra then capture things about the run and hte data
16:21 skay it creates a hash, it stores dependencies (for python and R and a few other languages)
16:21 skay and it stores hte parameters that were passed for the run
16:22 skay and the system that the job runs on
16:22 skay so this is all metadata that can be important as provenance information
16:22 skay I want to use it for researchcompendia because I want to track that data
16:22 pdurbin ah
16:23 pdurbin this is helping me understand what sumatra is
16:24 * skay that time on the desktop when you forget that focus does not follow eyes and hit ctl-p to navigate to the channel and instead pull up multiple browserconfig.properties
16:25 skay also my focus follows mouse stopped being set! arg!
16:25 LyndsySimon joined #dvn
16:27 pdurbin skay: focus!
16:47 axfelix joined #dvn
17:12 yoh_ skay, pdurbin: re sumatra -- might be of interest then for you guys a recent paper: http://journal.frontiersin.org/Journal/10.3389/fninf.2013.00044/full
17:13 pdurbin yoh_: IPython! yes! please check this out: Create IPython Notebook for Dataverse APIs · Issue #6 · IQSS/dvn-client-python - https://github.com/IQSS/dvn-client-python/issues/6
17:14 yoh_ I was pointing more to lancet ;-)
17:15 skay pdurbin: oh hey! Victoria is teaching at Berkely this semester and has given a talk, I think, to Fernando's group -- and I pitched the idea that some of us should meet to do some hacking with ipython and stuff
17:15 skay pdurbin: well I don't know if that will happen for me, but I am pitching the idea to you in case you can do it. I can enjoy it vicariously
17:16 * pdurbin reads the title again: An automated and reproducible workflow for running and analyzing neural simulations using Lancet and IPython Notebook
17:16 pdurbin "Launch jobs, organize the output, and dissect the results." -- http://ioam.github.io/lancet/
17:16 skay yoh_: do you use neural ensemble tools? this blog post is really cool http://neuralensemble.blogspot.com/2014/03/docker-image-to-run-software-neural_3.html
17:17 skay because I want to dockerize papers for the tool I am working on. this means I am definitely enjoying that people are doing similar things out int he wild
17:17 skay http://bcbio.wordpress.com/2014/03/06/improving-reproducibility-and-installation-of-genomic-analysis-pipelines-with-docker/
17:17 skay sorry, squirrel
17:18 yoh_ I am in neuroimaging, so do not use neural ensemble -- just maintain few (e.g. brian) for debian
17:18 skay oh neat! you are a debian mainter. bd.
17:18 skay bd is two thumbs up
17:19 * pdurbin had forgotten ;)
17:19 skay arg, I see I bookmarked the lancet paper 13 days ago and have not read it yet
17:20 skay and... I wish everyone working on these toolchains could join forces and collaborate on solving hte problems together
17:21 pdurbin skay: #dvn is becoming #hackingscience, which I'm still willing to log for you :)
17:22 skay pdurbin: I think you can, okay. though to be honest it is mostly a travis bot
17:22 skay I like your logs
17:23 pdurbin me too
17:23 skay handy for crossing the streams. https://lists.okfn.org/pipermail/data-protocols/2014-March/000089.html
17:23 pdurbin travis bot? for builds of what?
17:23 skay researchcompendia, which does not have nearly enough tests I am embarassed to say
17:24 pdurbin meh. send that bot to #researchcompendia then
17:24 skay good point
17:25 pdurbin bd
17:26 skay with hackingscience the idea was that Victoria was going to set up a blog/mailinglist/something for people to have technical discussions for writing toolchains for reproducible science
17:26 skay but then we are too busy to maintain a blog, etc. so maybe I should not have bothered to create a related irc channel
17:26 skay wishful thinking
17:27 pdurbin it's a good idea. I like big tents
17:36 skay pdurbin: yoh_: Rufus from okfn replied and mentioned #hyperdata as a place where people talk about data protocols and frictionless data https://lists.okfn.org/pipermail/data-protocols/2014-March/000090.html
17:37 skay and hopefully some people will check your googledoc and send you comments now that I've mentioned it on the list
17:37 axfelix joined #dvn
17:37 skay just giving you a heads up
17:45 pdurbin skay: thanks for linking to it
17:47 jwhitney joined #dvn
18:17 LyndsySimon joined #dvn
18:52 LyndsySimon joined #dvn
18:52 pdurbin I'm realizing that a link to http://blog.okfn.org/2013/07/02/git-and-github-for-data/ was sent around in an internal list shortly after it was posted but I didn't remember it. Found it my searching my mail.
18:55 pdurbin "Line-oriented text and its tools are, of course, far from perfect solutions to data storage and versioning. They will not work for datasets of every shape and size"
18:55 pdurbin yeah
18:56 pdurbin plenty of our data is not in CSV format
18:56 pdurbin Stata files, FITS files, etc.
19:07 sbmarks yeah, but
19:07 sbmarks many of those are parseable into subsettable files
19:07 sbmarks which are then exportable as CSV right?
19:08 sbmarks (or as something a little more semantically meaningful than CSV)
19:23 pdurbin sbmarks: I'm not the expert on this, but you're right. We "ingest" certain formats and make them available as CSV (or TSV or whatever).
19:44 pdurbin sbmarks: are you suggesting we put the CSV version in regular git and the binary version in something like git-annex?
19:55 pdurbin (dunno if you're familiar with git-annex)
20:10 sbmarks pdurbin: I guess I'm saying it could be one way to allow some non-CSV formats to be stored in the same way as line-oriented stuff
20:12 pdurbin yeah
20:13 pdurbin I wonder if git is an all or nothing proposition
20:13 sbmarks I'm hoping not =) I am not yet sold on the utility of it, tbh
20:13 pdurbin or if we'd let the dataset author enable git for their dataset
20:14 pdurbin sbmarks: not sold on git for versioning? not interested in `git clone`-ing datasets?
20:14 sbmarks pdurbin: I think it sounds great for people like us who use git a lot....but a lot of researchers don't
20:15 sbmarks I'm just worried about implementing things that might clash with established research workflows is all
20:16 sbmarks i for sure like the idea of an option, but I also think versioning as done currently is pretty OK
20:17 pdurbin sbmarks: so you're not so into papers like this: Source Code for Biology and Medicine | Full text | Git can facilitate greater reproducibility and increased transparency in science - http://www.scfbm.org/content/8/1/7
20:17 pdurbin not that I'm asking you to be :)
20:18 sbmarks hahaha
20:21 sbmarks I guess I would say: I can understand the logic that git promote reproducability. but I don't think it's a cure all. I think there are a lot of other pieces that are far more gaping holes than being able to clone the data. We can already somewhat easily get the data. I worry about reproducability of workflows, for example, which this doesn't necessarily address.
20:22 sbmarks (Granted, yes, it could if those are represented in a git-able manner)
20:23 pdurbin sbmarks: right. OSF is good for workflows: Open Science Framework | Home - https://osf.io
20:26 sbmarks I guess.....I'm just a bit confused as to the value-added of doing it the git way. And it's probably me failing to fully get it
20:26 pdurbin "git it" ;)
20:26 sbmarks b ^_^
20:27 sbmarks i mean, I am Joe Researcher. I can already "clone" data from a DV by downloading it, now I can do it with the API, or even through that R module
20:28 sbmarks and versioning is already supported, although maybe the point is it could possibly be more efficient
20:29 sbmarks unless the idea is that this would standardize the way these things are done, which would be a good thing
20:30 pdurbin sbmarks: sure. via the SWORD API in Dataverse. wouldn't it be cool if RStudio supported SWORD? See my post here: [sword-app-tech] Using SWORD via R - http://www.mail-archive.com/sword-app-tech@lists.sourceforge.net/msg00344.html
20:30 pdurbin Rstudio *does* support git, which is kind of what this paper is about (mentions Dataverse): GitHub: A Tool for Social Data Set Development and Verification in the Cloud by Christopher Gandrud :: SSRN - http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2199367
20:32 pdurbin sbmarks: so yeah, for me anyway, this is about standards. and not re-inventing the wheel, reinventing versioning
20:32 sbmarks fair enough. I'm not trying to be a Debbie Downer or anything.
20:32 pdurbin sbmarks: heh. no worries :)
20:32 sbmarks totally
20:33 sbmarks at the end of the day, standardization and new features are good. =) I do like anything that's going to make Dataverse more of a part of the day to day research workflow
20:34 sbmarks so now that I'm reading this paper I can see some value in having a central place to check out the data set from, and to commit back to, with a way to merge versions and etc.
20:34 sbmarks so I guess: http://i3.kym-cdn.com/photos/images/original/000/138/244/funny-barack-michelle-obama-face.jpg
20:37 pdurbin heh
20:37 sbmarks thanks for bringing me along
20:38 pdurbin well don't get excited. dunno if we can actually ship any git support
20:38 pdurbin it's just me making noise right now
20:38 sbmarks oh totally
20:38 sbmarks i just hate feeling left out of the cool kids conversation
20:38 pdurbin dream a little dream
20:38 sbmarks I'm still working on my magnum opus metadata email reply
20:39 pdurbin !
20:39 pdurbin looking forward to it
20:39 sbmarks i no rite
20:40 pdurbin sbmarks: oh, you might like this... now a Google Doc is the source of truth about metadata. we export 4 tsv files. then we `curl` those tsv files into an API endpoing to populate our metadata blocks
20:40 pdurbin "blocks"... that's what we call them
20:40 sbmarks ha!
20:40 sbmarks I like the "blocks"
20:40 sbmarks though I would love if there were user-servicaable blocks too =)
20:41 pdurbin sbmarks: well, you could hack on the tsv files but...
20:41 pdurbin you break it you buy it :)
20:41 sbmarks riiiiiiight
20:41 sbmarks hence my desire for user servicable =)
20:43 pdurbin :)
20:45 sbmarks in all seriousness, I think you guys are doing really great work right now, fwiw
20:45 sbmarks it's nice because it syncs up pretty well with an uptick we've seen in DV use, so yay
21:07 * pdurbin blushes
21:08 pdurbin definitely a team effort: http://www.iq.harvard.edu/people/filter_by/staff/data-science
21:16 axfelix joined #dvn
23:42 axfelix joined #dvn

| Channels | #dvn index | Today | | Search | Google Search | Plain-Text | plain, newest first | summary

We've moved! Please join #dataverse instead. The new logs are at http://irclog.iq.harvard.edu/dataverse/today