[00:06] *** pir^2 has joined #archiveteam [00:08] Does anyone have a complete archive of eff.org? [00:11] *** primus104 has quit IRC (Leaving.) [00:11] robots.txt been fixed yet? [00:12] *** mistym_ has joined #archiveteam [00:13] *** mistym_ has quit IRC (Client Quit) [00:13] wyatt8750 - http://web.archive.org/web/20150407003912/http://wyatt8740.no-ip.org/git-glkterm-win32.tar.xz still gives me the robots error [00:13] grr [00:14] feex eeet! (I'll be patient, just checking in) [00:14] wait, I just saw it work! [00:14] holy crap [00:14] wyatt: [00:15] https://web.archive.org/web/*/http://wyatt8740.no-ip.org/git-glkterm-win32.tar.xz [00:18] BTW Wayback (Heritrix) has a limit for file size IIRC. Maybe 5 MB? Can't remember. [00:24] wyatt8740 - is that good now? [00:32] For saving raw data like XML dumps (not HTML mostly), is using a tarball or WARC better? [00:33] WARC if you want it to be browseable in the Wayback Machine or something. Tarball if it's just data you want to have archived, but isn't worthwhile to have in the Wayback Machine. [00:35] *** erbylnt has joined #archiveteam [00:36] Is there any reason to have a bunch of zip files, or XML, JSON, etc. in the Wayback Machine? [00:37] If you'd normally get to them by clicking links on a webpage, maybe. If not, not really. [00:39] *** mistym_ has joined #archiveteam [00:40] pir^2: WARC for HTTP only makes sense if you're recording request-response [00:41] if you're not, faking it is not a good idea [00:43] *** pir^2 has quit IRC (Ping timeout: 370 seconds) [00:44] *** pir^2 has joined #archiveteam [00:45] I wasn't going to fake it... I was thinking of what settings to use for wget (or maybe wpull) [00:46] --warc-file is usually enough [00:46] oh wait, the other settings [00:46] ? [00:47] well, whether to use --warc-file or not [00:49] I previously used it for https://archive.org/details/dmoz-rdf-20150327 got a 27GB file and it's still not "approved" in the Archive Team collection, aand for the data I am going to save (not dmoz but similar "data dumps") preserving the HTTP requests is probably not all that important [00:50] if you have a bunch of URLs you can throw them into a file and !ao < FILE them to archivebot [00:50] For me, it depends on how I expect people will find the data. If they'd run into it in the Wayback Machine, then sure, I'll do WARCs. If they're more likely to find it by either a new link to it or browsing the IA website, then I wouldn't bother with WARCs. [00:51] that solves at least two open questions [00:52] Who can help debug a pipeline script? [00:58] *** Wolfie has joined #archiveteam [00:58] *** Wolfie has left [01:05] *** dugo has joined #archiveteam [01:07] *** dugo_ has quit IRC (Read error: Operation timed out) [01:13] Does IA have a secret archive of YouTube videos? [01:14] https://archive.org/details/youtubecrawl [01:18] *** yipdw_ is now known as yipdw [01:18] Hardly seems secret. :) [01:18] But also quite dated. [01:19] Secret as in darkened (or whatever they call it) - " These files are currently not publicly accessible." [01:20] Ah, hadn't tried to download any of them. [01:20] It's not clear what is actually in the collection. Is the "crawl" just metadata or actual, full, videos? [01:20] * aschmitz shuffles over to -bs [01:38] I looked into it and still wish I could figure out how this youtube archive worked [01:38] and I wish it were public :\ [01:38] I would like to relive the better days of youtube [01:39] ...before the dark times. before the EMPIRE [01:39] * wyatt8750 goes to -bs [01:54] *** erbylnt has quit IRC (Read error: Connection reset by peer) [02:12] *** pir^2 has quit IRC (-) [02:31] *** beardicus has quit IRC (My MacBook Pro has gone to sleep. ZZZzzz…) [02:47] *** mistym_ has quit IRC (Remote host closed the connection) [02:48] *** donkus has quit IRC (Read error: Connection reset by peer) [02:48] *** donkus has joined #archiveteam [03:12] *** mistym_ has joined #archiveteam [03:37] *** chfoo has quit IRC (Remote host closed the connection) [03:40] *** chfoo has joined #archiveteam [04:03] *** Ravenloft has joined #archiveteam [04:19] *** aaaaaaaaa has quit IRC (Leaving) [05:11] *** philpem has joined #archiveteam [05:17] *** BlueMaxim has quit IRC (Read error: Connection reset by peer) [05:18] *** Control-S has joined #archiveteam [05:19] *** BlueMaxim has joined #archiveteam [05:24] *** Ctrl-S has quit IRC (Read error: Operation timed out) [05:24] *** Control-S is now known as Ctrl-S [05:39] *** SN4T14_ has quit IRC (Ping timeout: 306 seconds) [05:45] *** Stiletto has joined #archiveteam [05:45] *** SN4T14 has joined #archiveteam [06:17] *** donkus has quit IRC (Ping timeout: 512 seconds) [06:47] *** a3nm has joined #archiveteam [06:47] WHAT FORSOOTH, PRITHEE TELL ME THE SECRET WORD [06:48] the secret word be "yahoosucks" m'lord [06:48] thanks [06:50] *** a3nm has quit IRC (Client Quit) [06:50] *** a3_nm has joined #archiveteam [06:50] *** a3_nm is now known as a3nm [06:51] *** mistym_ has quit IRC (Remote host closed the connection) [06:52] make a difference! [07:13] *** primus104 has joined #archiveteam [07:23] *** signius has quit IRC (Read error: Operation timed out) [07:36] *** signius has joined #archiveteam [07:50] *** primus104 has quit IRC (Leaving.) [07:53] *** schbirid has joined #archiveteam [08:06] *** Ymgve has joined #archiveteam [09:55] *** primus104 has joined #archiveteam [10:25] *** MMovie2 has joined #archiveteam [10:27] *** MMovie has quit IRC (Ping timeout: 306 seconds) [10:41] *** BlueMaxim has quit IRC (Quit: Leaving) [11:06] *** anomie has quit IRC (Read error: Operation timed out) [11:20] *** anomie has joined #archiveteam [11:29] *** Froggypwn has quit IRC (Read error: Connection reset by peer) [11:30] *** Froggypwn has joined #archiveteam [11:32] didn't you know? yahoo uses bing results now [12:15] yes [12:18] now? more like since years ago [12:19] yes [12:23] since yahoo did rm -rf on /search/, expected from yahoo [12:25] yada yada yada [12:41] *** atomotic has joined #archiveteam [12:44] *** beardicus has joined #archiveteam [12:50] *** sankin has joined #archiveteam [13:01] *** erbylnt has joined #archiveteam [13:26] *** Ravenloft has quit IRC (Ping timeout: 265 seconds) [13:41] *** primus104 has quit IRC (Leaving.) [13:55] *** K4k has joined #archiveteam [14:54] *** mistym_ has joined #archiveteam [14:54] *** mistym_ has quit IRC (Remote host closed the connection) [15:05] *** atomotic has quit IRC (Quit: Textual IRC Client: www.textualapp.com) [15:07] *** mistym_ has joined #archiveteam [15:07] *** Froggypwn has quit IRC (Read error: Connection reset by peer) [15:10] *** Froggypwn has joined #archiveteam [15:11] Any idea why internet archive doesn't do TIFF images? [15:21] johtso: in what sense? [15:21] in the deriving other image formats sense, don't see it listed in the derivation chart [15:22] johtso: IA derives happily from TIFF [15:22] I only see it listed in the text items section though [15:22] will it also do the deriving if it's just an image item? [15:23] I doubt it cares [15:23] Derivers just let imagemagick whatever is needed and imagemagick swallows TIFFs happily [15:23] At worst you'll have to manually select the format [15:24] Example TIFF book https://archive.org/details/VocabolarioDellaLinguaItaliana2 [15:30] Certainly derives from TIFF. [15:45] *** primus104 has joined #archiveteam [15:46] *** erbylnt has quit IRC (Ping timeout: 370 seconds) [15:48] *** primus105 has joined #archiveteam [15:50] *** mr-b has joined #archiveteam [15:51] *** donkus has joined #archiveteam [15:53] *** primus104 has quit IRC (Read error: Operation timed out) [16:03] *** mistym_ has quit IRC (Remote host closed the connection) [16:23] *** Start-mob has joined #archiveteam [16:35] *** erbylnt has joined #archiveteam [16:40] *** Start has quit IRC (Disconnected.) [16:55] *** Start-mob has quit IRC (Remote host closed the connection) [16:55] *** aaaaaaaaa has joined #archiveteam [16:55] *** Start-mob has joined #archiveteam [16:58] *** Froggypwn has quit IRC (Read error: Connection reset by peer) [16:59] *** Froggypwn has joined #archiveteam [17:02] *** Start-mob has quit IRC (Ping timeout: 370 seconds) [17:28] *** scyther has joined #archiveteam [17:49] *** nertzy has quit IRC (Remote host closed the connection) [17:50] *** habi has joined #archiveteam [17:56] http://archive.ex.fm gone dark? [17:59] *** habi has left [18:07] *** nertzy has joined #archiveteam [18:21] *** nertzy has quit IRC (Quit: This computer has gone to sleep) [18:32] *** nertzy has joined #archiveteam [18:44] *** Start has joined #archiveteam [18:45] *** Start has quit IRC (Read error: Connection reset by peer) [18:45] *** Start has joined #archiveteam [18:55] *** Start has quit IRC (Disconnected.) [19:03] *** mistym has quit IRC (Quit: Leaving) [19:07] *** Start has joined #archiveteam [19:17] *** nertzy has quit IRC (Quit: This computer has gone to sleep) [19:20] *** mistym has joined #archiveteam [19:25] *** Start has quit IRC (Disconnected.) [19:26] *** FMecha has joined #archiveteam [19:26] just some notice about 4chan's stuff [19:27] first, /sp/ is now archived again by totally.not4plebs.org (spin-off of 4plebs) [19:27] second, archive.moe is no longer archiving /g/ [19:28] they are archiving /fit/ from heinessen [19:29] *** monod has joined #archiveteam [19:38] *** cadbury_ has joined #archiveteam [19:43] *** FMecha has quit IRC (http://www.kiwiirc.com/ - A hand crafted IRC client) [19:45] *** Selanda has quit IRC (Ping timeout: 255 seconds) [19:46] *** Start-mob has joined #archiveteam [19:49] *** SN4T14_ has joined #archiveteam [19:51] *** mistym has quit IRC (Remote host closed the connection) [19:53] *** SN4T14 has quit IRC (Ping timeout: 306 seconds) [19:59] *** Start-mob has quit IRC (Remote host closed the connection) [20:01] *** atomotic has joined #archiveteam [20:09] *** Start-mob has joined #archiveteam [20:12] *** mistym has joined #archiveteam [20:18] *** dashcloud has quit IRC (Ping timeout: 260 seconds) [20:18] *** Start-mob has quit IRC (Leaving) [20:18] *** Start-mob has joined #archiveteam [20:18] *** SimpBrain has quit IRC (Quit: Leaving) [20:20] *** dashcloud has joined #archiveteam [20:24] *** K4k has quit IRC (Quit: WeeChat 1.0.1) [20:26] *** nertzy has joined #archiveteam [20:32] *** Start has joined #archiveteam [20:33] *** Start-mob has quit IRC (Leaving) [20:34] *** Start-mob has joined #archiveteam [20:36] *** Start-mob has quit IRC (Remote host closed the connection) [20:44] *** nertzy has quit IRC (Quit: This computer has gone to sleep) [20:54] *** atomotic has quit IRC (Quit: My Mac has gone to sleep. ZZZzzz…) [20:54] *** sankin has quit IRC (Leaving.) [20:56] *** atomotic has joined #archiveteam [20:59] *** scyther has quit IRC (Read error: Connection reset by peer) [20:59] *** Selanda has joined #archiveteam [21:06] *** acridAxid has quit IRC (Quit: Quitting) [21:09] *** acridAxid has joined #archiveteam [21:09] *** BlueMaxim has joined #archiveteam [21:12] *** dashcloud has quit IRC (Ping timeout: 260 seconds) [21:13] *** atomotic has quit IRC (Quit: Textual IRC Client: www.textualapp.com) [21:14] *** atomotic has joined #archiveteam [21:16] *** dashcloud has joined #archiveteam [21:21] *** Start has quit IRC (Read error: Connection reset by peer) [21:21] *** Start has joined #archiveteam [21:23] *** Start has quit IRC (Read error: Connection reset by peer) [21:23] *** Start has joined #archiveteam [21:26] *** Start-mob has joined #archiveteam [21:38] *** atomotic has quit IRC (Quit: Textual IRC Client: www.textualapp.com) [21:42] *** Start-mob has quit IRC (Ping timeout: 370 seconds) [21:46] *** Start_ has joined #archiveteam [21:46] *** Start has quit IRC (Read error: Connection reset by peer) [21:51] *** Start_ is now known as Start [21:53] *** nertzy has joined #archiveteam [21:53] *** Selanda has quit IRC (Read error: Operation timed out) [22:07] *** Selanda has joined #archiveteam [22:08] *** monod has quit IRC (Ping timeout: 512 seconds) [22:11] *** T31m_ has quit IRC (Read error: Connection reset by peer) [22:12] *** T31M has joined #archiveteam [22:20] *** Selanda has quit IRC (Quit: leaving) [22:23] *** Start has quit IRC (Disconnected.) [22:27] *** Start-mob has joined #archiveteam [22:27] *** nertzy has quit IRC (Quit: This computer has gone to sleep) [22:33] *** Selanda has joined #archiveteam [22:34] *** Start-mob has quit IRC (Remote host closed the connection) [22:38] ===== When running the warrior, please don't use proxies or Tor with it. It may cause junk (such as Tor exit is blocked pages) to be archived instead of actual content. ====== [22:41] *** nertzy has joined #archiveteam [22:55] *** Start has joined #archiveteam [23:07] *** philpem has quit IRC (Ping timeout: 260 seconds) [23:09] *** Start-mob has joined #archiveteam [23:12] *** wp494_ has joined #archiveteam [23:15] *** wp494 has quit IRC (Ping timeout: 740 seconds) [23:17] *** nertzy has quit IRC (Quit: This computer has gone to sleep) [23:18] arkiver: i've got ops again in #helpus [23:24] *** Ymgve has quit IRC () [23:35] *** nertzy has joined #archiveteam [23:58] *** primus has quit IRC (Read error: Operation timed out)