[02:08] *** nertzy has quit IRC (Quit: This computer has gone to sleep)
[04:27] *** myself has joined #warrior
[04:28] <yipdw> so, detection like that would be useful; we don't currently have the capacity to implement it
[04:28] <yipdw> manpower that is
[04:28] <yipdw> I suspect something like it would be easiest to implement in a central checker, perhaps at the upload points
[04:29] <yipdw> not only to avoid dealing with sharing knowledge amongst many nodes but also because that's where you see the most
[04:29] <myself> Deliberately retrieve the same thing from multiple clients, compare results?
[04:29] <yipdw> possibly; we have code to retry / retrieve the same item multiple times but we usually do not use it
[04:29] <yipdw> the main reason is time constraints
[04:29] <myself> Yeah, the bigger any system gets, the more important it becomes to have resiliency, in general. I feel like that has to be a goal, but manpower is a sensible reason to not do it right now.
[04:29] <aaaaaaaaa> The problem with warcs is that the same content can still create different files
[04:29] <yipdw> that too
[04:30] <yipdw> that said we don't have a good definition of "sufficiently different" or a way to detect it, except perhaps in gross cases like "oh we had 200s and now it's lots of 302s"
[04:31] <yipdw> fortunately that seems like it's what usually happens, probably because it's also the easiest thing for a sysadmin to do
[04:31] <aaaaaaaaa> that or, "wait a minute all the items are the exact same size"
[04:31] <myself> Compressibility is often a useful metric. Concatenate both samples and compress; if it squishes to nearly the same size as just one sample, they weren't very different! 
[04:31] <yipdw> that runs into problems when you have a lot of spam profiles
[04:32] <yipdw> etc.
[04:32] <yipdw> anyway, we do have a gigantic corpus of responses if you are interested in looking at this
[04:32] <myself> hmm, I was thinking on an individual-page level rather than a whole archive, but yeah..
[04:32] <yipdw> like comb over the archiveteam collections
[04:33] <myself> Where would I start combing? I'm interested in noodling around to see if I can make any sense of it. Is there a gallery-of-bad-responses somewhere? Heh. 
[04:33] <yipdw> there isn't one and that's something that we do need
[04:33] <yipdw> on a per-collection basis
[04:34] <myself> Developing a response-badness heuristic would be a start at noticing all sorts of problems as they happen.
[04:34] <yipdw> so https://archive.org/search.php?query=archiveteam will give you way too much data
[04:34] <yipdw> you may want to see perhaps, uh
[04:34] <myself> Yummy firehose..
[04:34] <yipdw> maybe the Hyves and Splinder collections
[04:34] <yipdw> I name those two because we ended up breaking the sites a few times
[04:35] <yipdw> and I think they may have broken in creative ways
[04:35] <yipdw> but even that's several metric asstons of data
[04:35] <myself> what's that in Shitloads Avoirdupois? 
[04:35] <yipdw> I really don't know where you'd start unless you had a heuristic, which is chicken-and-egging it
[04:36] <yipdw> OR you just dig through the data, One Terabyte of Kilobyte Age-style
[04:36] <yipdw> and from there you notice "oh this is all sorts of busted"
[04:36] <yipdw> do note that that sort of search takes years (and indeed the OToKA bit is ongoing)
[04:37] <yipdw> that said, if you want to just load up some WARCs (always a good start), try webarchiveplayer -> https://github.com/ikreymer/webarchiveplayer
[04:37] *** aaaaaaaaa has quit IRC (Leaving)
[04:38] <yipdw> if you're comfortable with administering python webapps try pywb -> https://github.com/ikreymer/pywb
[04:38] <myself> I'm not, but the first looks like a good place to start noodling around. 
[04:38] <yipdw> you can also try IA's wayback -> https://github.com/internetarchive/wayback <- but in my experience pywb is much easier to get working and is more advanced wrt replay capabilities
[04:39] <yipdw> maybe there's some things IA's wayback handles that pywb doesn't, I haven't checked that deeply
[04:40] <yipdw> I was mostly impressed when I loaded up a YouTube grab done by ArchiveBot in pywb and saw the video playing
[04:40] <yipdw> that was a DAAAAMN moment
[04:40] <myself> I'm mostly just looking to use these to see if I can find error pages in Hyves or Splinder, right? and then look to see if I can detect those in a generic way.
[04:40] <yipdw> yesah
[04:41] <yipdw> you can't really find errors in WARCs if you can't load them up
[04:41] <myself> so the differences between players shouldn't matter much
[04:41] <yipdw> it can
[04:41] <yipdw> some players will choke on requests for resources that are actually in the WARC
[04:41] <yipdw> hyves/splinder are probably not using any features that would trigger that? but who knows
[04:41] <yipdw> I've seen the "choke" thing happen on infinite-scroll pages
[07:34] *** Atom__ has quit IRC (Read error: Connection reset by peer)
[07:36] *** Atom__ has joined #warrior
[13:17] *** Atom__ has quit IRC (Read error: Connection reset by peer)
[13:26] *** Atom__ has joined #warrior
[13:40] *** Atom__ has quit IRC (Ping timeout: 306 seconds)
[15:02] *** phuzion_ is now known as phuzion
[15:06] *** Fusl has quit IRC (hub.efnet.us irc.umich.edu)
[15:06] *** trs80 has quit IRC (hub.efnet.us irc.umich.edu)
[15:11] *** Fusl has joined #warrior
[16:00] *** trs80 has joined #warrior
[18:18] *** Start has quit IRC (Quit: Disconnected.)
[18:57] *** aaaaaaaaa has joined #warrior
[19:22] *** aaaaaaaa_ has joined #warrior
[19:23] *** aaaaaaaaa has quit IRC (Ping timeout: 600 seconds)
[19:24] *** aaaaaaaa_ is now known as aaaaaaaaa
[20:41] *** atlogbot has quit IRC (Remote host closed the connection)
[20:42] *** atlogbot has joined #warrior
[20:53] *** aaaaaaaa_ has joined #warrior
[20:53] *** aaaaaaaaa has quit IRC (Read error: Connection reset by peer)
[21:19] *** aaaaaaaa_ is now known as aaaaaaaaa
[21:48] *** aaaaaaaa_ has joined #warrior
[21:48] *** aaaaaaaaa has quit IRC (Read error: Connection reset by peer)
[21:50] *** aaaaaaaa_ is now known as aaaaaaaaa
[22:14] *** nertzy has joined #warrior
[23:02] *** nertzy has quit IRC (Quit: This computer has gone to sleep)