[00:02] *** PotcFdk has joined #archiveteam [00:10] *** Ghost_of_ has quit IRC (Read error: Connection reset by peer) [00:13] *** w0rp has joined #archiveteam [00:13] *** mutoso has joined #archiveteam [00:13] *** wednesday has joined #archiveteam [00:13] *** Atom-- has joined #archiveteam [00:13] *** ats_ has joined #archiveteam [00:13] *** zino has joined #archiveteam [00:13] *** z00nx has joined #archiveteam [00:13] *** nekomune has joined #archiveteam [00:13] *** ersi has joined #archiveteam [00:13] *** midas has joined #archiveteam [00:13] *** Nemo_bis has joined #archiveteam [00:13] *** Jon has joined #archiveteam [00:13] *** Rye has joined #archiveteam [00:13] *** Fletcher has joined #archiveteam [00:13] *** filippo__ has joined #archiveteam [00:13] *** espes___ has joined #archiveteam [00:13] *** useretai- has joined #archiveteam [00:13] *** will has joined #archiveteam [00:13] *** irc.underworld.no sets mode: +oo midas Nemo_bis [00:20] *** WinterFox has joined #archiveteam [00:35] *** afics has quit IRC (Read error: Operation timed out) [00:35] *** afics has joined #archiveteam [00:36] *** kyan has joined #archiveteam [00:54] *** mismatchm has quit IRC (Ping timeout: 360 seconds) [00:58] *** megaminxw has joined #archiveteam [01:28] *** megaminxw has quit IRC (Quit: Leaving.) [01:34] *** nickname has joined #archiveteam [01:35] Has angelfire.com ever been archived? [01:35] *** nickname has quit IRC (Client Quit) [02:10] *** VADemon has quit IRC (Quit: left4dead) [02:11] *** dashcloud has quit IRC (Read error: Operation timed out) [02:12] *** dashcloud has joined #archiveteam [02:25] *** schbirid2 has joined #archiveteam [02:28] *** username1 has quit IRC (Read error: Operation timed out) [02:30] nickname -- is there any mention of angelfire on the wiki? [02:36] JesseW, nickname: http://archiveteam.org/index.php?title=Angelfire [02:37] cool [04:00] *** brayden has joined #archiveteam [04:10] a large chunk of it was run through archivebot before but the whole thing hasn't been specifically archived [04:40] *** Coderjoe has quit IRC (Read error: Connection reset by peer) [04:40] *** Coderjoe has joined #archiveteam [05:07] *** SN4T14 has joined #archiveteam [05:15] *** SN4T14 has quit IRC (Read error: Operation timed out) [05:48] I'm thinking about producing a file of (reported) IA file hashes, to be (hopefully) widely distributed, and regularly re-generated, as a way of providing some 3rd-party accountability for changes in the contents of IA items. I'd love some comments on both the general idea, and the specifics of the format. [05:50] The format I'm currently thinking of is a 2-column, tab-delimited file, with the first column being the identifer, a forward slash, and the file path, and the 2nd column being the md5 hash, as a 32 digit hex number. [05:51] This isn't the *most* minimal in size, but I think it's a good compromise between minimal size and simplicity [05:58] #internetarchive.bak is doing this to some extent [05:58] in the sense that git-annex needs content fingerprinting to know what's changed anyway and there's multiple copies of various shards [05:59] *** zino has quit IRC (Ping timeout: 252 seconds) [06:01] why use md5 when there are better hashes? [06:04] *** SN4T14 has joined #archiveteam [06:05] xmc: Because, for whatever reason, Jake at IA only included the md5 hash in the census he did in March 2015. [06:05] o [06:05] ok [06:06] So in order to make a comparison with that, I need to use the same [06:06] MD5 is a little bit more reliable than a CRC [06:06] IA currently reports md5, sha1 and crc32 hashes for its files. [06:07] yipdw: I know IA.BAK is doing similar (and much more so, making actual full copies of the data) -- but it doesn't (yet) cover most of IA's public collections. [06:08] tab is probably fine provided no identifier names contain tabs [06:09] identifiers are normally restricted to [-_.0-9A-Za-z] iirc [06:09] no identifier name *or* file name. [06:09] iiuc they can contain anything url-safe but IA doesn't like you to do that [06:09] the march 2015 census file is ... somewhat odd. [06:10] It has 113 duplicate items [06:12] and only 13,075,195 identifiers, although the list of retrieved identifiers has 14,921,581. I don't know what happened to the missing 1,846,386. [06:13] ... as with all censuses, sometimes there is an undercount :P [06:14] heh [06:15] any thoughts about whether I should have three columns: identifier filename hash OR two columns: identifier/filename hash? [06:16] three [06:16] three, it's uniform [06:16] ok, that's definitive. :-) three it is. [07:03] I've now started converting the existing census file into the hash format, as a way of seeing how big it is. [07:03] (the census file is 21GB) [07:05] Would it make sense for me to run an updated census from my machine, or would it be better to bug someone with access to FOS to run it there (to minimize network delay)? [07:06] yipdw, xmc, other people with FOS access...? [07:06] i don't actually have fos access, sorry [07:07] I don't think fos is a good idea [07:07] it's often heavily loaded [07:07] you'd probably do pretty well with a VPS on a fast backbone and 4-8 concurrent connections [07:08] scale up as needed or until network operations at IA bans the IP [07:08] (I don't know what that point is, but I don't think 4-8 is close) [07:09] ha [07:09] ok, makes sense [07:11] are you calcuating MD5 sums or fetching them? [07:11] might want to investigate https://blog.archive.org/2013/07/04/how-to-use-the-virtual-machine-for-researchers/ [07:11] if the former how do you distinguish between a change and an error on your end [07:12] oh neat, I didn't know they had that VM setup [07:13] OH [07:13] xmc: yep, I just asked for one. [07:13] wait that's fos [07:13] cool [07:13] or rather the same class as fos [07:13] yipdw: i thought it ... yeah it kind of predates fos i guess [07:13] big honkin machine [07:13] well if it's not the same thing as fos, it's at least in the same subdomain [07:14] I'm just fetching md5s. [07:14] I'm not actually doing any verifying that any actual content *matches* the reported hashes -- merely whether what is reported has changed. [07:15] FOS is garbage box and the one before it was garbage box too [07:16] if that's a garbage box I'm going dumpster diving at IA [07:16] ah, you're fetching metadata [07:16] JesseW is the NSA [07:19] * JesseW has been found out! [07:19] * JesseW hands out dark glasses to everyone [07:21] i'm up to 513k items as of today [07:22] also kpfa has all 2007 mp3s uploaded now [07:22] bbl [07:30] *** BlueMaxim has quit IRC (Read error: Connection reset by peer) [07:30] *** BlueMaxim has joined #archiveteam [07:49] *** kyan has quit IRC (Ping timeout: 633 seconds) [08:03] *** balrog has quit IRC (Read error: Operation timed out) [08:31] *** antomatic has joined #archiveteam [08:31] *** HCross2 has joined #archiveteam [08:31] *** wutno has joined #archiveteam [08:31] *** GLaDOS has joined #archiveteam [08:31] *** _desu___ has joined #archiveteam [08:31] *** Peetz0r_ has joined #archiveteam [08:31] *** aliz has joined #archiveteam [08:31] *** wp494 has joined #archiveteam [08:31] *** edsu has joined #archiveteam [08:31] *** hive-mind has joined #archiveteam [08:31] *** Famicoma1 has joined #archiveteam [08:31] *** ivan` has joined #archiveteam [08:31] *** PepsiMax_ has joined #archiveteam [08:31] *** SilSte has joined #archiveteam [08:31] *** Kazzy has joined #archiveteam [08:31] *** efnet.port80.se sets mode: +oooo wp494 edsu ivan` Kazzy [08:31] *** mistym has joined #archiveteam [08:31] *** tjg has joined #archiveteam [08:31] *** zhongfu has joined #archiveteam [08:31] *** pikhq has joined #archiveteam [08:31] *** Kenshin has joined #archiveteam [08:31] *** Rickster has joined #archiveteam [08:31] *** Atluxity has joined #archiveteam [08:31] *** Ctrl-S___ has joined #archiveteam [08:31] *** VonGuard has joined #archiveteam [08:31] *** zyphlar_ has joined #archiveteam [08:31] *** beeper has joined #archiveteam [08:31] *** kevin has joined #archiveteam [08:31] *** bauruine has joined #archiveteam [08:31] *** karissa__ has joined #archiveteam [08:31] *** sigkell has joined #archiveteam [08:31] *** Gfy has joined #archiveteam [08:31] *** Fusl has joined #archiveteam [08:31] *** hades_ has joined #archiveteam [08:31] *** joepie91 has joined #archiveteam [08:31] *** SadDM has joined #archiveteam [08:31] *** efnet.port80.se sets mode: +oooo Kenshin Atluxity joepie91 SadDM [08:31] *** redlob has joined #archiveteam [08:31] *** JSharp___ has joined #archiveteam [08:31] *** Vito` has joined #archiveteam [08:31] *** deathy has joined #archiveteam [08:31] *** abartov__ has joined #archiveteam [08:31] *** johtso has joined #archiveteam [08:34] *** unstable has joined #archiveteam [08:38] *** JesseW has quit IRC (Read error: Operation timed out) [08:44] *** megaminxw has joined #archiveteam [08:50] *** megaminxw has quit IRC (Quit: Leaving.) [08:51] *** megaminxw has joined #archiveteam [08:56] *** BlueMaxim has quit IRC (Read error: Connection reset by peer) [08:57] *** BlueMaxim has joined #archiveteam [09:02] *** balrog has joined #archiveteam [09:20] *** atomotic has joined #archiveteam [10:16] *** wickedpla has joined #archiveteam [10:17] *** wp494 has quit IRC (Ping timeout: 260 seconds) [11:06] *** arkiver3 has joined #archiveteam [11:12] *** atomotic has quit IRC (Ping timeout: 260 seconds) [11:29] *** atomotic has joined #archiveteam [11:30] *** arkiver3 has quit IRC (Ping timeout: 252 seconds) [11:59] *** BlueMaxim has quit IRC (Quit: Leaving) [12:10] *** MMovie has quit IRC (Read error: Operation timed out) [12:30] *** atomotic has quit IRC (Quit: Textual IRC Client: www.textualapp.com) [12:48] *** MMovie has joined #archiveteam [12:53] *** eightfold has joined #archiveteam [12:53] hi there [12:53] any channel for general archive.org talk? [12:56] *** Atom-- has quit IRC (Read error: Connection reset by peer) [12:57] are you looking for search-tips, or what? [12:57] i wonder how i can search https://archive.org/details/opensource_audio?and[]=subject%3A%22ambient%22&sort=-downloads [12:57] but only with a certain cc-license [12:58] (not ND or NC) [13:00] i also wonder what rules apply when there’s no “usage” information, as with this: https://archive.org/details/pertin-nce_053 [13:00] which license applies then? [13:02] *** VADemon has joined #archiveteam [13:05] maybe someone knows [13:06] hang on, and maybe you'll see [13:06] *** Atom-- has joined #archiveteam [13:09] eightfold: there's always #internetarchive [13:09] phuzion: thanks! [13:09] No problem. [13:09] *** atomotic has joined #archiveteam [13:42] *** RichardG has quit IRC (Quit: Keyboard not found, press F1 to continue) [13:47] *** RichardG has joined #archiveteam [13:55] *** mistym has quit IRC (Remote host closed the connection) [14:03] eightfold: http://cryto.net/~joepie91/blog/2013/03/21/licensing-for-beginners/ (this answers your "what if no license is specified" question) [14:05] joepie91: i guess you are referring to “Copyright is a whitelisting system. This means that by default, no one has the right to do anything with what you made, unless you explicitly indicate that they do.” [14:05] but some of these are listed under “community audio”, which is also known as “open source audio” [14:05] *** nertzy has joined #archiveteam [14:07] eightfold: right. but anybody can upload into the community categories (which is most likely precisely why they were renamed to 'community' instead of 'open source') [14:08] *** WinterFox has quit IRC (Remote host closed the connection) [14:09] joepie91: ok. so technically archive.org would be fine for hosting all the copyrighted material of an online video broadcasting service? [14:10] eightfold: go ahead - if it gets abusemail, it'll get "darked" (which means it is made inaccessible and remains that way for the time being, but still stored in the archive) [14:11] *** nertzy has quit IRC (Quit: Leaving) [14:12] joepie91: of course i wont go ahead, i’m not an online broadcasting service. but what you’re saying is that the internet archive is totally fine with storing other peoples copyrighted works? [14:13] eightfold: that is not the official position, but you'll find that very few people around these regions care about copyright, and that getting stuff archived is the first priority [14:13] you won't get in trouble for doing it anyway. [14:14] (personally I consider copyright to be extremely harmful, but that's more of an ethical discussion and less related to the practical aspects of your question, also I'm not the Internet Archive :P) [14:17] looks like I am completing gamefront-grab [14:18] Need me to lend a hand? [14:18] Ive got some spare power [14:19] me too :) [14:20] Kazzy: are you grabbing google-code as "Kaz"? [14:22] *** nertzy has joined #archiveteam [14:36] *** megaminxw has quit IRC (Quit: Leaving.) [14:55] SketchCow: https://www.youtube.com/channel/UCxfh-2aOR5hZUjxJLQ2CIHw - bunch of videos and livestreams of conference talks, including one that's going on about JS in Svalbard right now [15:02] *** K4k has joined #archiveteam [15:12] Atluxity: yeah [15:12] did i break something? [15:21] *** Amitari has joined #archiveteam [15:23] On the Geocities article on the wiki, it says "The content is still provided via the same patched torrent above. However, bear in mind Dragan Espenschied has completely redone the thing. Super superior. He spent a year on it.", does anyone know where I can get that improved version? [15:23] I'm considering getting an extra hard drive just to store and seed Geocities. [15:32] I've also had a really cool idea for experiencing all these old archived websites. [15:34] Taking the source code from an early open-source release of Netscape, and updating it a bit to work on more modern OSes. [15:35] *** Amitari has quit IRC (Leaving) [15:37] have you seen http://oldweb.today/ [15:39] Kazzy: no, I just wondered what your concurrent setting on it is [15:40] 10 concurrent [15:40] didn't see anything relating to having to limit it so i'm not blocked by google, so just went as it [15:41] *** atomotic has quit IRC (Quit: Textual IRC Client: www.textualapp.com) [15:46] PSA: bash.org is gone, our latest copy is from 2015-11-17, but I don't think it's been updated recently anyway [15:49] Kazzy: on just 1 host? [15:49] yep [15:49] wow... [15:49] ? [15:50] just thought you was so effective at getting items [15:50] I run 1 concurrent on my 49 hosts [15:50] joepie91: RIP bash.org [15:51] yep [15:51] was a matter of time I suppose [15:51] in hindsight [15:51] Yeah [15:51] well, i did away with the whole increasing delay thing, so it doesn't sit idle for like 5 minutes to get a job [15:52] aha [15:52] thats your trick :D [15:53] try to keep that host doing as much as possible, nice to not have it sitting idle generating noise, might as well be working [15:56] *** antonizoo has quit IRC (Remote host closed the connection!) [16:14] *** toad2 has joined #archiveteam [16:20] *** toad1 has quit IRC (Read error: Operation timed out) [16:43] *** nertzy2 has joined #archiveteam [16:43] *** nertzy has quit IRC (Read error: Connection reset by peer) [16:51] *** schbirid2 has quit IRC (Quit: Leaving) [16:59] *** atomotic has joined #archiveteam [17:07] *** Emcy has joined #archiveteam [17:14] *** RichardG has quit IRC (Ping timeout: 258 seconds) [17:25] *** eightfold has quit IRC (Quit: eightfold) [17:27] *** zhongfu has quit IRC (Ping timeout: 260 seconds) [17:32] *** atomotic has quit IRC (Quit: Textual IRC Client: www.textualapp.com) [17:43] *** zino has joined #archiveteam [17:45] *** K4k has quit IRC (WeeChat 1.3) [17:54] *** K4k has joined #archiveteam [18:02] *** VADemon has quit IRC (left4dead) [18:14] *** schbirid has joined #archiveteam [18:17] *** JesseW has joined #archiveteam [18:50] *** dashcloud has quit IRC (Quit: No Ping reply in 210 seconds.) [18:55] *** dashcloud has joined #archiveteam [18:56] *** terburg has joined #archiveteam [19:11] *** RichardG has joined #archiveteam [19:15] *** K4k has quit IRC (Ping timeout: 250 seconds) [19:18] *** K4k has joined #archiveteam [19:19] *** atomotic has joined #archiveteam [19:24] *** K4k_ has joined #archiveteam [19:25] *** K4k has quit IRC (Ping timeout: 360 seconds) [19:45] *** scyther has joined #archiveteam [19:45] *** scyther has quit IRC (Connection closed) [19:45] *** nertzy2 has quit IRC (Quit: This computer has gone to sleep) [20:02] with jake of IA's advice, I'm running an updated IA census [20:02] should be done in a couple of days. :-) [20:03] What's that? IA census? [20:04] *** scyther has joined #archiveteam [20:04] ersi: http://archiveteam.org/index.php?title=Internet_Archive_Census [20:05] Oh, right! Of course [20:05] list of the sizes, formats and md5s of *all* the (public) files on IA [20:05] I want to see what's changed since last March. :-) [20:05] currently, I'm just running a recheck of the existing 14 million itemlist. Jake will get me an updated last later. [20:07] *** terburg has quit IRC (Quit: terburg) [20:11] *** Ghost_of_ has joined #archiveteam [20:12] you can get access to my iron if it helps you [20:13] It seems like its bound by IA's limits, so I'm fine. [20:40] *** JesseW has quit IRC (Ping timeout: 246 seconds) [20:45] *** atomotic has quit IRC (Quit: Textual IRC Client: www.textualapp.com) [21:01] *** BlueMaxim has joined #archiveteam [21:06] *** JetBalsa has joined #archiveteam [21:18] *** terburg has joined #archiveteam [21:49] *** terburg has quit IRC (terburg) [21:49] *** scyther has quit IRC (Read error: Connection reset by peer) [22:24] CUTE OVERLOAD [22:30] *** mistym has joined #archiveteam [22:34] So, I am going to start making little piles of musicbrainz [22:34] Because that thing is past 1.1tb [22:35] *** nertzy2 has joined #archiveteam [23:12] *** K4k_ has quit IRC (Ping timeout: 360 seconds) [23:13] *** nertzy2 has quit IRC (Quit: This computer has gone to sleep)