[00:04] luckcolor: seems sensible to me [00:07] *** dashcloud has joined #archiveteam [00:08] http://fos.textfiles.com/pipeline.html [00:08] As per Pipeline, we've begun moving nunij and area51 in (it says no mover script, that's not true) [00:08] It's a function that I'm not using [00:19] hm, on http://fos.textfiles.com/ARCHIVETEAM/ I see various items just labeled as "archiveteam", like: archiveteam_20160902230555 which (according to the idx file) seems to consist of a bunch of torrents. Those are area51, I presume? [00:19] *** hictooth has quit IRC (Ping timeout: 268 seconds) [00:23] *** godane1 has quit IRC (Quit: Leaving.) [00:23] ah archiveteam_20160902230555 has a title that identifies it as part of the "Torrent Time Capsule" [00:24] others of the unlabeled (by identifier) ones are part of BayImg. One is the grab of thomas.congress.gov: archiveteam_201607050000 [00:25] and another Friends Reunited: archiveteam_2016062210391011 [00:25] which I think has a page on the wiki; I should add a link [00:25] Thanks, detective [00:26] you knew I would [00:26] (and there already is a link) [00:27] There's what you can do and what you should do [00:27] eh, this one *seemed* to fit both categories. If not, I'm happy to know otherwise [00:30] *** WinterFox has joined #archiveteam [00:35] *** pfallenop has quit IRC (Read error: Operation timed out) [00:43] Update on tumblr and flickr projects. I have now uploaded an original and deduplicate WARC here https://archive.org/download/flickrestdeduphmsdfofjdsd [00:43] I asked in #warrior for help to see if these are correct. [00:43] I have also asked the wayback team at Internet Archive if they can have a look at these two WARCs. [00:44] If they are confirmed to be good, this deduplication script will be used in the tumblr and flickr projects. [00:44] xmc, PurpleSym ^ [00:49] *** kristian_ has joined #archiveteam [01:08] for anyone running googlecode [01:08] do your items also only get 503 and then abort? [01:16] *** maelstrom has joined #archiveteam [01:19] Please let me know as soon as possible why your items are aborting with googlecode [01:23] Mine are just ratelimited. [01:26] hmm [01:26] well let me know if you do get any please [01:26] * arkiver is afk for the night [01:26] will do [01:26] thanks! [01:27] if someone can confirm the 503's, I'll send a mail [01:33] arkiver: Ah, I got a 503! [01:34] *** Brah has joined #archiveteam [01:34] but I lost the logs :-( [01:34] *** Brah has quit IRC (Client Quit) [01:43] arkiver: got the 503's: http://paste.nerds.io/sokikubejo.js [01:43] please send the email [01:53] *** VADemon has quit IRC (Quit: left4dead) [02:08] *** ndiddy has quit IRC (Ping timeout: 632 seconds) [02:19] *** kristian_ has quit IRC (Quit: Leaving) [02:19] arkiver: JesseW: is that not the bot block page? [02:24] i think so, yes [02:24] that's why arkiver is going to write an email [02:28] *** kristian_ has joined #archiveteam [02:48] *** kristian_ has quit IRC (Quit: Leaving) [03:40] *** BlueMaxim has quit IRC (Read error: Operation timed out) [03:43] *** pfallenop has joined #archiveteam [04:04] *** Sk1d has quit IRC (Ping timeout: 194 seconds) [04:08] *** signius has quit IRC (Read error: Operation timed out) [04:10] *** Sk1d has joined #archiveteam [04:24] *** signius has joined #archiveteam [04:28] *** Aranje has quit IRC (Ping timeout: 260 seconds) [04:48] *** maelstrom has quit IRC (Quit: Leaving) [05:08] *** godane has joined #archiveteam [06:08] arkiver: neat. is it better to shovel around 10x more data than we are going to wind up with, or to figure out how to not fetch so much in the first place? [06:29] *** Honno has joined #archiveteam [06:57] *** BlueMaxim has joined #archiveteam [07:12] *** DFJustin has quit IRC (Remote host closed the connection) [07:12] *** DFJustin has joined #archiveteam [07:12] *** swebb sets mode: +o DFJustin [07:40] arkiver: warcat says “Bad payload digest.” for revisit and warcinfo records in the deduplicated WARC. [07:42] *** JesseW has quit IRC (Ping timeout: 370 seconds) [07:48] *** Simpbrain has quit IRC (Read error: Operation timed out) [07:53] *** ravetcofx has quit IRC (Ping timeout: 501 seconds) [08:03] *** metal_cam has joined #archiveteam [08:30] *** vOYtEC has joined #archiveteam [08:36] *** schbirid has joined #archiveteam [08:39] *** Simpbrain has joined #archiveteam [09:47] *** tuankiet has joined #archiveteam [10:05] *** atomotic has joined #archiveteam [10:17] *** atomotic has quit IRC (Quit: Textual IRC Client: www.textualapp.com) [10:20] *** kristian_ has joined #archiveteam [10:59] *** REiN^ has joined #archiveteam [11:03] *** bRick5772 has joined #archiveteam [11:20] *** VADemon has joined #archiveteam [11:30] *** bRick5772 has quit IRC (Quit: Leaving.) [11:51] *** Morbus has quit IRC (http://www.disobey.com/) [11:56] *** Morbus has joined #archiveteam [11:56] *** kristian_ has quit IRC (Quit: Leaving) [12:13] *** signius has quit IRC (Read error: Operation timed out) [12:24] joepie91: yeah, but we're not crawling google code too fast, so I don't think it is caused by that [12:24] JesseW: mail sent/ [12:24] .* [12:27] xmc: the HTTP headers returned by flickr on the images are not the for different URLs with the same pauload (same image on c1 and c2 for example). [12:29] So if we would generate the record of these other image URLs instead of crawling them we'd have to fake or loose some of the headers [12:29] What do you think? [12:32] PurpleSym: I think this is caused by warcat not keeping revisit records in mind when checking the payload digest https://github.com/chfoo/warcat/blob/master/warcat/tool.py#L295-L300 and https://github.com/chfoo/warcat/blob/master/warcat/verify.py#L56-L67 [12:32] I'm not 100% sure though [12:32] The payload digest of the revisit records is the same as the payload digest of the record with the original data, where the revisit record is pointing too [12:33] You’re right, that’s a bug. [12:38] so [12:38] anything on the bioware forums? [12:41] *** WinterFox has quit IRC (Read error: Operation timed out) [12:45] I see it's in archivebot, nvm [12:49] pipeline status page borked? http://fos.textfiles.com/pipeline.html [12:53] *** Simpbrain has quit IRC (Read error: Connection reset by peer) [13:15] hmm, let me retype xmc: the HTTP headers returned by flickr on the images are not the for different URLs with the same pauload (same image on c1 and c2 for example). [13:16] xmc: the HTTP headers returned by flickr for the images are not the same for different URLs with the same payload (same image on c1 and c2 for example). [13:37] *** phuzion has quit IRC (Read error: Operation timed out) [13:40] *** phuzion has joined #archiveteam [14:22] *** BlueMaxim has quit IRC (Quit: Leaving) [14:25] *** VADemon has quit IRC (Quit: left4dead) [14:41] *** polm has quit IRC (Quit: leaving) [14:48] *** signius has joined #archiveteam [15:34] *** _hyperion has joined #archiveteam [15:38] *** _hyperion is now known as arkiver2 [15:39] *** arkiver2 has quit IRC (Quit: BitchX: try our Windows Me and Windows XP flavors too!) [16:02] *** MMovie2 has joined #archiveteam [16:03] *** MMovie has quit IRC (Read error: Operation timed out) [16:21] *** ravetcofx has joined #archiveteam [16:41] *** JesseW has joined #archiveteam [16:44] *** VADemon has joined #archiveteam [17:06] *** metalcamp has joined #archiveteam [17:10] *** metal_cam has quit IRC (Read error: Operation timed out) [17:11] *** metal_cam has joined #archiveteam [17:16] *** metalcamp has quit IRC (Read error: Operation timed out) [17:25] *** Infreq has quit IRC (Read error: Operation timed out) [17:29] *** Infreq has joined #archiveteam [18:37] *** metalcamp has joined #archiveteam [18:42] *** metal_cam has quit IRC (Read error: Operation timed out) [19:23] *** RichardG has quit IRC (Read error: Connection reset by peer) [19:24] *** RichardG has joined #archiveteam [19:25] *** metalcamp has quit IRC (Read error: Operation timed out) [19:43] *** ndiddy has joined #archiveteam [19:54] *** JesseW has quit IRC (Ping timeout: 370 seconds) [20:09] *** RichardG has quit IRC (Read error: Operation timed out) [20:10] *** RichardG has joined #archiveteam [20:21] *** daank has joined #archiveteam [20:21] *** daank has quit IRC (Client Quit) [20:34] *** metalcamp has joined #archiveteam [20:48] *** Aranje has joined #archiveteam [20:52] *** schbirid has quit IRC (Quit: Leaving) [20:58] *** ravetcofx has quit IRC (Read error: Operation timed out) [21:04] *** metalcamp has quit IRC (Read error: Operation timed out) [21:21] *** vOYtEC has quit IRC (Ping timeout: 633 seconds) [21:46] *** all_ has joined #archiveteam [21:47] *** all_ has quit IRC (Client Quit) [22:15] *** RichardG has quit IRC (Read error: Operation timed out) [22:16] *** RichardG has joined #archiveteam [22:19] *** WinterFox has joined #archiveteam [22:24] *** VADemon has quit IRC (Quit: left4dead) [22:45] *** WinterFox has quit IRC (Read error: Operation timed out) [23:12] *** JesseW has joined #archiveteam [23:14] *** Honno has quit IRC (Read error: Operation timed out) [23:33] *** arkiver2_ has joined #archiveteam [23:33] *** arkiver2_ has quit IRC (Client Quit) [23:51] ah, huh [23:51] ok [23:54] *** verizon has joined #archiveteam [23:56] hello =) [23:56] sorry, i don't like verizon [23:56] me neither [23:57] but who is the less worse [23:57] that is the question [23:57] so... any ideas