[00:00] *** arkhive has left [00:16] arkiver: status is Stalled, but it was linked in #revspace oday [00:16] today* [00:16] so idk [00:51] still confused - why does archive.org retroactively apply a robots.txt? [00:51] it just means that a domain parker can ruin more days than they already have [01:01] I presume it probably means less trouble for them, less site owners yelling and causing issues [01:09] Is there a way to get the content that was blocked retroactively? Not got any site in mind at the moment but it just came up in conversation on another IRC channel [01:09] wyatt8740: officially, no. [01:09] I know archive team often has something stashed away, but there's no public way, is there? [01:09] ah, okay [01:10] * joepie91 coughs [01:10] * joepie91 coughs very loudly and inconspicuously in the general direction of wyatt8740 [01:12] cover your mouth joepie91 [01:12] * joepie91 spreads bacteria [01:37] wyatt8740: because it's *really hard* to tell, in general, if the current person with access to a domain has no rights to content previously hosted there. And if they *do* have rights, then IA doesn't want to piss them off by making available material they don't want made available. [01:38] I understand the second use case [01:38] I tried to find the old runescape website circa 2006 once [01:38] that was understandable though disappointing [01:40] which use case do you not understand, then? [01:58] the basic thing is IA has a very small staff and following robots.txt is an easy way to get people off their backs without having to spend a lot of time processing requests from people to take down stuff [01:58] it's unfortunate but I do see the reasoning [01:59] Could try getting domain parkers to make their robots.txt IA-friendly, but good luck with that [02:00] the issue of parked domains comes up from time to time and it would probably be possible to have some better handling of that but we aren't them so there's not much point in asking us [02:00] although, I think most of the major squatters only break certain, usually irrelevant subdirectories [02:01] there are also known bugs with the way IA intereprets robots.txt files -- I know of at least one example where the file doesn't disallow access, but the Wayback Machine does. [02:01] but it's a low priority [02:02] That doesn't sound like something that should be hard to fix, but I'm not them [02:03] anyway it's good that third-part WARCs like ours are still available even if the Wayback Machine blocks playback [02:03] third-party* [02:04] https://archive.org/donate/ [02:04] yep. :-) [02:04] i'm pretty damn impressed by what 170 people can do [02:04] When I get some money I plan to :p [02:05] bwn: I did, during the telethon. They sent me cookies. :-) [02:05] 170? That's it? Neat [02:05] And a lot of that are the manual work of book scanning. [02:05] which is important, don't get me wrong [02:06] for sure [02:07] i can't imagine, i've almost been driven insane making photocopies at work ;) [02:10] JesseW: heh. Tumblr. Yeah, being owned by Yahoo is worrisome, but it's one of their more successful assets. However, it has no business model [02:10] at least I don't think it does. I never saw any ads [02:11] It has quite a lot of ads in the *dashboard* now. [02:11] hmm [02:11] If you just read people's blogs, you won't see them, though. [02:11] It's not a *good* business model. [02:11] I've "archived" some porn [02:11] :p [02:13] I would throw this in ArchiveBot but it looks like it's all on Wayback anyway with all the images http://positivedoodles.tumblr.com/ [02:20] I'm mostly interested in saving various really thoughtful and insightful essays that people have written, which will get obliterated once Tumblr dies. [02:37] *** bwn has quit IRC (Ping timeout: 492 seconds) [03:37] For a while my no-ip site was blocked by robots.txt because some other no-ip site blocked the archiver [03:38] I did scan complete atari service manuals at work once [03:38] still got an IBM ASCII serial terminal I have to scan the manual for, but it's not in a 3 ring binder so I need to set up a camera + tripod + light style book 'scanner'. [03:39] (the archive bot has already crawled the atari manuals a number of times) [03:40] It'd be nice if I could get a job at some place like the IA. But I have no degree and am in Indiana so technology jobs are scarce, especially for people without one. [03:41] C programming and knowledge of analog electronics don't get you far without that paper [03:41] :( [03:41] 170 is the number at the internet archive, right? [03:41] not here [04:06] *** Sk1d has quit IRC (Ping timeout: 194 seconds) [04:12] *** Sk1d has joined #archiveteam-bs [04:27] JesseW: That's a good cause. Fortunately it does look like a lot of it is in the Wayback machine in some form. [04:27] just from IA crawling [04:40] *** BlueMaxim has quit IRC (Read error: Operation timed out) [04:48] *** Stilett0 is now known as Stiletto [04:49] *** BlueMaxim has joined #archiveteam-bs [04:50] there is a lot in there, but there is also a lot *not* in there. Tumblr is REALLY BIG [04:51] Tumbr is big like Wordpress.com or blogger.com are big. :-( [04:52] *** bwn has joined #archiveteam-bs [06:11] *** GLaDOS has quit IRC (Read error: Connection reset by peer) [06:15] *** GLaDOS has joined #archiveteam-bs [06:45] *** BlueMaxim has quit IRC (Read error: Operation timed out) [06:46] *** BlueMaxim has joined #archiveteam-bs [06:51] *** JesseW has quit IRC (Read error: Operation timed out) [06:58] *** tomwsmf-a has quit IRC (Ping timeout: 259 seconds) [06:58] *** Jonimus has quit IRC (Read error: Operation timed out) [06:59] *** Honno has joined #archiveteam-bs [07:01] *** Honno_ has joined #archiveteam-bs [07:03] *** Jonimus has joined #archiveteam-bs [07:03] *** swebb sets mode: +o Jonimus [07:06] *** metalcamp has joined #archiveteam-bs [07:08] *** Honno has quit IRC (Read error: Operation timed out) [07:17] *** BlueMaxim has quit IRC (Read error: Connection reset by peer) [07:18] *** BlueMaxim has joined #archiveteam-bs [07:46] *** bwn has quit IRC (Ping timeout: 492 seconds) [08:37] *** bwn has joined #archiveteam-bs [08:37] *** Jonimus has quit IRC (Read error: Operation timed out) [08:50] *** Jonimus has joined #archiveteam-bs [08:50] *** swebb sets mode: +o Jonimus [09:08] https://www.newscientist.com/round-up/best-of-new-scientist-all-free-until-13-april-2016 O.o [10:00] *** vitzli has joined #archiveteam-bs [11:14] *** mksplg has quit IRC (WeeChat 0.4.2) [11:19] *** vitzli has quit IRC (Leaving) [11:26] HCross: any ideas on how to archive that? [11:27] the "by logging in for free" bit seems like it could be slightly problematic [11:27] Not sure how to get it, but https://www.newscientist.com/search/?s=* has a load of articles [11:28] HCross: yeah, I know someone who's currently crawling those search results to get a list of URLs [12:05] Looking at this site’s source code the first thing you’ll see is: . Not kidding. [12:06] (Only when logged in.) [12:08] *** BlueMaxim has quit IRC (Quit: Leaving) [12:55] As for downloading the site: `curl -D - -F 'log=erkuiteiae@dontsendmespam.de' -F 'pwd=Scij@swem' -F 'rememberme=forever' 'https://www.newscientist.com/ns-login.php'`, extract newscientist-auth-cookie, wget/wpull/grab-site(?). [13:06] thanks PurpleSym :) [13:07] You can use that account, btw. It’s a throwaway. [13:14] *** vitzli has joined #archiveteam-bs [13:42] *** GLaDOS has quit IRC (Quit: Oh crap, I died.) [13:46] *** GLaDOS has joined #archiveteam-bs [13:53] *** Honno_ has quit IRC (Read error: Operation timed out) [14:00] *** RichardG has quit IRC (Quit: Keyboard not found, press F1 to continue) [14:03] *** RichardG has joined #archiveteam-bs [14:11] *** lytv has quit IRC (Ping timeout: 633 seconds) [14:17] *** lytv has joined #archiveteam-bs [14:57] This is pretty awesome: it's now possible to accurately display NFOs and other items using DOS-specific fonts natively in the web browser: https://defacto2.wordpress.com/2016/04/05/ascii-nfo-art/ [15:09] *** Honno has joined #archiveteam-bs [15:28] *** remsen has quit IRC (Read error: Operation timed out) [15:28] *** remsen has joined #archiveteam-bs [15:35] *** Jonimus has quit IRC (ircd.choopa.net irc.choopa.net) [15:35] *** Coderjoe has quit IRC (ircd.choopa.net irc.choopa.net) [15:35] *** mr-b has quit IRC (ircd.choopa.net irc.choopa.net) [15:35] *** lbft_ has quit IRC (ircd.choopa.net irc.choopa.net) [15:35] *** beardicus has quit IRC (ircd.choopa.net irc.choopa.net) [15:35] *** kvieta has quit IRC (ircd.choopa.net irc.choopa.net) [15:35] *** Mayonaise has quit IRC (ircd.choopa.net irc.choopa.net) [15:38] *** lbft has joined #archiveteam-bs [15:42] *** Jonimus has joined #archiveteam-bs [15:42] *** mr-b has joined #archiveteam-bs [15:42] *** beardicus has joined #archiveteam-bs [15:42] *** kvieta has joined #archiveteam-bs [15:42] *** Mayonaise has joined #archiveteam-bs [15:42] *** irc.choopa.net sets mode: +o Jonimus [15:42] *** swebb sets mode: +o Jonimus [15:44] *** balrog has quit IRC (Read error: Operation timed out) [15:47] *** balrog has joined #archiveteam-bs [15:47] *** swebb sets mode: +o balrog [15:48] *** dxrt has quit IRC (Ping timeout: 370 seconds) [15:48] *** dxrt has joined #archiveteam-bs [15:49] *** Honno_ has joined #archiveteam-bs [15:52] *** Start has quit IRC (Quit: Disconnected.) [15:55] *** Start has joined #archiveteam-bs [16:00] *** Honno has quit IRC (Read error: Operation timed out) [16:00] *** JesseW has joined #archiveteam-bs [16:13] *** Mayonaise has quit IRC (Read error: Operation timed out) [16:13] *** dxrt has quit IRC (Read error: Operation timed out) [16:14] *** mr-b has quit IRC (Read error: Operation timed out) [16:14] *** Coderjoe has joined #archiveteam-bs [16:14] *** mr-b_ has joined #archiveteam-bs [16:14] *** mr-b_ is now known as mr-b [16:14] *** dxrt has joined #archiveteam-bs [16:15] *** kvieta has quit IRC (Read error: Operation timed out) [16:15] *** beardicus has quit IRC (Read error: Operation timed out) [16:17] bsmith093: added to the list [16:18] *** Jonimus has quit IRC (Read error: Operation timed out) [16:20] JesseW, I have a small hash 'database' with about 1 million files, crc32+md5+sha1+sha2+sha3 :P + whirlpool/aich/ed2k, where available (I changed the format). This was my interest in IA's checksums - so it could be possible to crosscheck/find already archived files [16:21] nice! [16:23] and it was the reason behind visualization part, but I've not touched it yet [16:26] my main interest is mapping between md5/sha1 and sha256, everything else is future-proofing, hope to do ipfs hashes too [16:29] IPFS hashes aren't unique, sadly. [16:30] They are hashes of a particular *representation* of a bitstream, not of the unique bitstream itself. [16:30] So they aren't so useful for reference. [16:30] yes, I hope to get around with a proper database [16:31] Similar to BitTorrent magnet links, which are a hash of a particular torrent's info block, which can be arbitrarily modified while still referring to the same actual files. [16:31] ipfs hashes for files are mostly ok, except they could refer to different hashing algorithms [16:32] 'database' is currently looks like: https://archive.org/download/isohunt.moonshine.2016/isohunt.moonshine.2016.rhash.txt [16:34] no, the problem with ipfs hashes is that it's hashing the particular structure of sub-blocks, which can freely change arbitrarily at any time (for efficiency or other reasons) [16:34] hashes are going to be in one table, filenames go into another, and ipfs hashes go into hash->ipfs mapping table [16:34] (as I understand it) [16:35] I thought hashes for files/large objects are stable [16:35] until the file itself has not been changed [16:35] not as I understand it [16:36] but I don't understand IPFS very well at all (yet) [16:36] I don't either :) [16:41] *** schbirid has joined #archiveteam-bs [16:42] *** kvieta has joined #archiveteam-bs [16:42] *** Jonimus has joined #archiveteam-bs [16:42] *** swebb sets mode: +o Jonimus [17:02] *** Mayonaise has joined #archiveteam-bs [17:18] *** beardicus has joined #archiveteam-bs [17:19] *** vitzli has quit IRC (Leaving) [17:37] *** Jonimus has quit IRC (Ping timeout: 633 seconds) [17:40] *** Honno_ has quit IRC (Read error: Operation timed out) [17:40] *** Mayonaise has quit IRC (Ping timeout: 633 seconds) [17:41] *** Mayonaise has joined #archiveteam-bs [17:43] *** Honno has joined #archiveteam-bs [17:46] *** Sanky has joined #archiveteam-bs [17:49] *** kvieta has quit IRC (Ping timeout: 633 seconds) [17:52] TIL that IA received (in 2014) a copy of the BBC's website as of 1995 -- and it's available through the Wayback Machine: https://web.archive.org/web/19950301190227/http://www.bbcnc.org.uk/bbctv/bbctv.html [17:53] *** kvieta has joined #archiveteam-bs [17:56] and the WARC is available: https://archive.org/details/bbcnc.org.uk-19950301 [18:00] *** Mayonaise has quit IRC (Ping timeout: 633 seconds) [18:00] also, there's a typo in the description for some of the early crawls: https://archive.org/details/IA-001140-c -- this is not from 19**56**. :-) [18:20] *** kvieta has quit IRC (Read error: Operation timed out) [18:22] *** beardicus has quit IRC (Read error: Operation timed out) [18:46] *** kvieta has joined #archiveteam-bs [18:46] *** dashcloud has quit IRC (Read error: Connection reset by peer) [18:49] *** dashcloud has joined #archiveteam-bs [18:55] *** Mayonaise has joined #archiveteam-bs [19:04] *** beardicus has joined #archiveteam-bs [19:06] *** BlueMaxim has joined #archiveteam-bs [19:17] *** Jonimus has joined #archiveteam-bs [19:17] *** swebb sets mode: +o Jonimus [19:19] *** JesseW has quit IRC (Ping timeout: 370 seconds) [19:24] https://imguwut.com/trump.mp4 [19:25] *** schbirid has quit IRC (Quit: Leaving) [19:25] SketchCow yipdw_ swebb DFJustin godane anybody have a myspleen archive? i've heard there were toonheads episodes in there somewhere. [19:26] animation history is cool as hell, and it does not deserve to rot behind private trackkers and on old nhs taper [19:26] *vhs tapes [19:33] *** Stiletto is now known as Stilett0 [19:35] what is toonheads? [19:35] ToonHeads was an animated showcase of Metro-Goldwyn-Mayer & Warner Bros. cartoon shorts, prominently by animators and voice actors like: Mel Blanc, Tex Avery, Hugh Harman, Rudy Ising, David H. DePatie, Friz Freleng, Chuck Jones, William Hanna, Joseph Barbera, and Daws Butler uncut. [19:36] there is a incomplete set on myspleen [19:36] 49 episodes of it i think [19:36] and there has been no release anywhere publicly (like dvd/blu-ray/etc) since it aired? [19:37] *** dashcloud has quit IRC (Read error: Operation timed out) [19:40] *** Stilett0 has quit IRC (Ping timeout: 260 seconds) [19:43] godane: holy crap it's been years and it's still there?! grab it please? [19:44] atrocity: none whatsoever, and there probably never will be, except for the single partial episode on the looneytunes golden collention [19:45] hmm [19:50] why was there only a partial episode on that? lol [19:50] atrocity godane i have one episode i found on mediafire archive.org/details/ToonheadsArchive [19:50] atrocity: because wb sucks! [19:51] *** dashcloud has joined #archiveteam-bs [19:52] lol [19:55] godane: ping, can you get it off myspleen? [19:59] yes [20:02] godane: thanks! fos, please? [20:02] fos? [20:03] ok [20:03] FOS is a server we use for projects [20:04] ahh [20:04] bsmith093: what made ou think of toonheads and myspleen? lol [20:06] atrocity: i've been looking for that show for years, on and off. i found a thread on a 4chan archive from 2011 that said a bit of it was on myspleen, but that does me no good, as myslpeens been closed to invites for a while now [20:07] ahh, ok. just random, hah [20:08] godane: there's actually already a myspleen folder on fos; that's what made me think of asking here. [20:20] *** bwn has quit IRC (Ping timeout: 246 seconds) [20:42] *** JesseW has joined #archiveteam-bs [20:43] *** Stiletto has joined #archiveteam-bs [20:51] *** tomwsmf-a has joined #archiveteam-bs [20:55] *** bwn has joined #archiveteam-bs [21:05] *** Honno has quit IRC (Read error: Operation timed out) [21:17] Back home tonight, then a week of solid work. [21:23] Welcome home (soon) [21:27] SketchCow: BTW, could you give me a link to a real example of a non-indexed item on IA? I have been told they exist, but haven't seen an actual one. [21:42] *** ErkDog_ has joined #archiveteam-bs [21:44] *** ErkDog has quit IRC (Read error: Operation timed out) [21:44] *** ErkDog_ is now known as ErkDog [21:47] ugh, i hate having to resize partitions [21:57] *** dashcloud has quit IRC (Read error: Operation timed out) [22:01] *** dashcloud has joined #archiveteam-bs [22:33] ditto [22:34] Ha ha no. [22:34] But yes, they exist. [22:34] Not the most effective thing [22:34] People just keep giving away the links to guys in IRC channels [22:39] Well, I presumed we wouldn't mention the actual URL in the public channel. And I certainly wouldn't mention it anywhere in public. I am just interested in seeing one to check if they operate similarly to the various other categories of items I have already seen (specifically, fully-public items, stream-only items, items with private files, darked items, deleted items and unused identifiers). [22:39] As part of my IA census efforts. [22:41] SketchCow: Admittedly, I'm also curious about what the general type(s) of things are considered safe to make public, but not safe enough to make indexed by the internal search engine. But I certainly don't care about any specific ones (not knowing what they are). [22:50] *** pikhq has quit IRC (Ping timeout: 506 seconds) [22:50] *** i0npulse has quit IRC (Ping timeout: 506 seconds) [22:52] *** pikhq has joined #archiveteam-bs [22:54] *** i0npulse has joined #archiveteam-bs [23:26] *** metalcamp has quit IRC (Ping timeout: 244 seconds) [23:30] *** Atros has joined #archiveteam-bs [23:31] *** atrocity has quit IRC (Ping timeout: 244 seconds) [23:32] *** midas has quit IRC (Ping timeout: 244 seconds) [23:34] *** RichardG has quit IRC (Ping timeout: 244 seconds) [23:34] *** midas has joined #archiveteam-bs [23:34] *** RichardG has joined #archiveteam-bs [23:48] *** espes__ has quit IRC (Ping timeout: 244 seconds) [23:54] *** espes__ has joined #archiveteam-bs [23:56] *** SimpBrain has quit IRC (Ping timeout: 244 seconds) [23:56] *** SimpBrain has joined #archiveteam-bs