#warrior 2015-08-31,Mon

↑back Search

Time Nickname Message
02:08 🔗 nertzy has quit IRC (Quit: This computer has gone to sleep)
04:27 🔗 myself has joined #warrior
04:28 🔗 yipdw so, detection like that would be useful; we don't currently have the capacity to implement it
04:28 🔗 yipdw manpower that is
04:28 🔗 yipdw I suspect something like it would be easiest to implement in a central checker, perhaps at the upload points
04:29 🔗 yipdw not only to avoid dealing with sharing knowledge amongst many nodes but also because that's where you see the most
04:29 🔗 myself Deliberately retrieve the same thing from multiple clients, compare results?
04:29 🔗 yipdw possibly; we have code to retry / retrieve the same item multiple times but we usually do not use it
04:29 🔗 yipdw the main reason is time constraints
04:29 🔗 myself Yeah, the bigger any system gets, the more important it becomes to have resiliency, in general. I feel like that has to be a goal, but manpower is a sensible reason to not do it right now.
04:29 🔗 aaaaaaaaa The problem with warcs is that the same content can still create different files
04:29 🔗 yipdw that too
04:30 🔗 yipdw that said we don't have a good definition of "sufficiently different" or a way to detect it, except perhaps in gross cases like "oh we had 200s and now it's lots of 302s"
04:31 🔗 yipdw fortunately that seems like it's what usually happens, probably because it's also the easiest thing for a sysadmin to do
04:31 🔗 aaaaaaaaa that or, "wait a minute all the items are the exact same size"
04:31 🔗 myself Compressibility is often a useful metric. Concatenate both samples and compress; if it squishes to nearly the same size as just one sample, they weren't very different!
04:31 🔗 yipdw that runs into problems when you have a lot of spam profiles
04:32 🔗 yipdw etc.
04:32 🔗 yipdw anyway, we do have a gigantic corpus of responses if you are interested in looking at this
04:32 🔗 myself hmm, I was thinking on an individual-page level rather than a whole archive, but yeah..
04:32 🔗 yipdw like comb over the archiveteam collections
04:33 🔗 myself Where would I start combing? I'm interested in noodling around to see if I can make any sense of it. Is there a gallery-of-bad-responses somewhere? Heh.
04:33 🔗 yipdw there isn't one and that's something that we do need
04:33 🔗 yipdw on a per-collection basis
04:34 🔗 myself Developing a response-badness heuristic would be a start at noticing all sorts of problems as they happen.
04:34 🔗 yipdw so https://archive.org/search.php?query=archiveteam will give you way too much data
04:34 🔗 yipdw you may want to see perhaps, uh
04:34 🔗 myself Yummy firehose..
04:34 🔗 yipdw maybe the Hyves and Splinder collections
04:34 🔗 yipdw I name those two because we ended up breaking the sites a few times
04:35 🔗 yipdw and I think they may have broken in creative ways
04:35 🔗 yipdw but even that's several metric asstons of data
04:35 🔗 myself what's that in Shitloads Avoirdupois?
04:35 🔗 yipdw I really don't know where you'd start unless you had a heuristic, which is chicken-and-egging it
04:36 🔗 yipdw OR you just dig through the data, One Terabyte of Kilobyte Age-style
04:36 🔗 yipdw and from there you notice "oh this is all sorts of busted"
04:36 🔗 yipdw do note that that sort of search takes years (and indeed the OToKA bit is ongoing)
04:37 🔗 yipdw that said, if you want to just load up some WARCs (always a good start), try webarchiveplayer -> https://github.com/ikreymer/webarchiveplayer
04:37 🔗 aaaaaaaaa has quit IRC (Leaving)
04:38 🔗 yipdw if you're comfortable with administering python webapps try pywb -> https://github.com/ikreymer/pywb
04:38 🔗 myself I'm not, but the first looks like a good place to start noodling around.
04:38 🔗 yipdw you can also try IA's wayback -> https://github.com/internetarchive/wayback <- but in my experience pywb is much easier to get working and is more advanced wrt replay capabilities
04:39 🔗 yipdw maybe there's some things IA's wayback handles that pywb doesn't, I haven't checked that deeply
04:40 🔗 yipdw I was mostly impressed when I loaded up a YouTube grab done by ArchiveBot in pywb and saw the video playing
04:40 🔗 yipdw that was a DAAAAMN moment
04:40 🔗 myself I'm mostly just looking to use these to see if I can find error pages in Hyves or Splinder, right? and then look to see if I can detect those in a generic way.
04:40 🔗 yipdw yesah
04:41 🔗 yipdw you can't really find errors in WARCs if you can't load them up
04:41 🔗 myself so the differences between players shouldn't matter much
04:41 🔗 yipdw it can
04:41 🔗 yipdw some players will choke on requests for resources that are actually in the WARC
04:41 🔗 yipdw hyves/splinder are probably not using any features that would trigger that? but who knows
04:41 🔗 yipdw I've seen the "choke" thing happen on infinite-scroll pages
07:34 🔗 Atom__ has quit IRC (Read error: Connection reset by peer)
07:36 🔗 Atom__ has joined #warrior
13:17 🔗 Atom__ has quit IRC (Read error: Connection reset by peer)
13:26 🔗 Atom__ has joined #warrior
13:40 🔗 Atom__ has quit IRC (Ping timeout: 306 seconds)
15:02 🔗 phuzion_ is now known as phuzion
15:06 🔗 Fusl has quit IRC (hub.efnet.us irc.umich.edu)
15:06 🔗 trs80 has quit IRC (hub.efnet.us irc.umich.edu)
15:11 🔗 Fusl has joined #warrior
16:00 🔗 trs80 has joined #warrior
18:18 🔗 Start has quit IRC (Quit: Disconnected.)
18:57 🔗 aaaaaaaaa has joined #warrior
19:22 🔗 aaaaaaaa_ has joined #warrior
19:23 🔗 aaaaaaaaa has quit IRC (Ping timeout: 600 seconds)
19:24 🔗 aaaaaaaa_ is now known as aaaaaaaaa
20:41 🔗 atlogbot has quit IRC (Remote host closed the connection)
20:42 🔗 atlogbot has joined #warrior
20:53 🔗 aaaaaaaa_ has joined #warrior
20:53 🔗 aaaaaaaaa has quit IRC (Read error: Connection reset by peer)
21:19 🔗 aaaaaaaa_ is now known as aaaaaaaaa
21:48 🔗 aaaaaaaa_ has joined #warrior
21:48 🔗 aaaaaaaaa has quit IRC (Read error: Connection reset by peer)
21:50 🔗 aaaaaaaa_ is now known as aaaaaaaaa
22:14 🔗 nertzy has joined #warrior
23:02 🔗 nertzy has quit IRC (Quit: This computer has gone to sleep)

irclogger-viewer