[02:08] *** nertzy has quit IRC (Quit: This computer has gone to sleep) [04:27] *** myself has joined #warrior [04:28] so, detection like that would be useful; we don't currently have the capacity to implement it [04:28] manpower that is [04:28] I suspect something like it would be easiest to implement in a central checker, perhaps at the upload points [04:29] not only to avoid dealing with sharing knowledge amongst many nodes but also because that's where you see the most [04:29] Deliberately retrieve the same thing from multiple clients, compare results? [04:29] possibly; we have code to retry / retrieve the same item multiple times but we usually do not use it [04:29] the main reason is time constraints [04:29] Yeah, the bigger any system gets, the more important it becomes to have resiliency, in general. I feel like that has to be a goal, but manpower is a sensible reason to not do it right now. [04:29] The problem with warcs is that the same content can still create different files [04:29] that too [04:30] that said we don't have a good definition of "sufficiently different" or a way to detect it, except perhaps in gross cases like "oh we had 200s and now it's lots of 302s" [04:31] fortunately that seems like it's what usually happens, probably because it's also the easiest thing for a sysadmin to do [04:31] that or, "wait a minute all the items are the exact same size" [04:31] Compressibility is often a useful metric. Concatenate both samples and compress; if it squishes to nearly the same size as just one sample, they weren't very different! [04:31] that runs into problems when you have a lot of spam profiles [04:32] etc. [04:32] anyway, we do have a gigantic corpus of responses if you are interested in looking at this [04:32] hmm, I was thinking on an individual-page level rather than a whole archive, but yeah.. [04:32] like comb over the archiveteam collections [04:33] Where would I start combing? I'm interested in noodling around to see if I can make any sense of it. Is there a gallery-of-bad-responses somewhere? Heh. [04:33] there isn't one and that's something that we do need [04:33] on a per-collection basis [04:34] Developing a response-badness heuristic would be a start at noticing all sorts of problems as they happen. [04:34] so https://archive.org/search.php?query=archiveteam will give you way too much data [04:34] you may want to see perhaps, uh [04:34] Yummy firehose.. [04:34] maybe the Hyves and Splinder collections [04:34] I name those two because we ended up breaking the sites a few times [04:35] and I think they may have broken in creative ways [04:35] but even that's several metric asstons of data [04:35] what's that in Shitloads Avoirdupois? [04:35] I really don't know where you'd start unless you had a heuristic, which is chicken-and-egging it [04:36] OR you just dig through the data, One Terabyte of Kilobyte Age-style [04:36] and from there you notice "oh this is all sorts of busted" [04:36] do note that that sort of search takes years (and indeed the OToKA bit is ongoing) [04:37] that said, if you want to just load up some WARCs (always a good start), try webarchiveplayer -> https://github.com/ikreymer/webarchiveplayer [04:37] *** aaaaaaaaa has quit IRC (Leaving) [04:38] if you're comfortable with administering python webapps try pywb -> https://github.com/ikreymer/pywb [04:38] I'm not, but the first looks like a good place to start noodling around. [04:38] you can also try IA's wayback -> https://github.com/internetarchive/wayback <- but in my experience pywb is much easier to get working and is more advanced wrt replay capabilities [04:39] maybe there's some things IA's wayback handles that pywb doesn't, I haven't checked that deeply [04:40] I was mostly impressed when I loaded up a YouTube grab done by ArchiveBot in pywb and saw the video playing [04:40] that was a DAAAAMN moment [04:40] I'm mostly just looking to use these to see if I can find error pages in Hyves or Splinder, right? and then look to see if I can detect those in a generic way. [04:40] yesah [04:41] you can't really find errors in WARCs if you can't load them up [04:41] so the differences between players shouldn't matter much [04:41] it can [04:41] some players will choke on requests for resources that are actually in the WARC [04:41] hyves/splinder are probably not using any features that would trigger that? but who knows [04:41] I've seen the "choke" thing happen on infinite-scroll pages [07:34] *** Atom__ has quit IRC (Read error: Connection reset by peer) [07:36] *** Atom__ has joined #warrior [13:17] *** Atom__ has quit IRC (Read error: Connection reset by peer) [13:26] *** Atom__ has joined #warrior [13:40] *** Atom__ has quit IRC (Ping timeout: 306 seconds) [15:02] *** phuzion_ is now known as phuzion [15:06] *** Fusl has quit IRC (hub.efnet.us irc.umich.edu) [15:06] *** trs80 has quit IRC (hub.efnet.us irc.umich.edu) [15:11] *** Fusl has joined #warrior [16:00] *** trs80 has joined #warrior [18:18] *** Start has quit IRC (Quit: Disconnected.) [18:57] *** aaaaaaaaa has joined #warrior [19:22] *** aaaaaaaa_ has joined #warrior [19:23] *** aaaaaaaaa has quit IRC (Ping timeout: 600 seconds) [19:24] *** aaaaaaaa_ is now known as aaaaaaaaa [20:41] *** atlogbot has quit IRC (Remote host closed the connection) [20:42] *** atlogbot has joined #warrior [20:53] *** aaaaaaaa_ has joined #warrior [20:53] *** aaaaaaaaa has quit IRC (Read error: Connection reset by peer) [21:19] *** aaaaaaaa_ is now known as aaaaaaaaa [21:48] *** aaaaaaaa_ has joined #warrior [21:48] *** aaaaaaaaa has quit IRC (Read error: Connection reset by peer) [21:50] *** aaaaaaaa_ is now known as aaaaaaaaa [22:14] *** nertzy has joined #warrior [23:02] *** nertzy has quit IRC (Quit: This computer has gone to sleep)