[00:02] *** www2 has quit IRC (Read error: Operation timed out) [00:03] *** GLaDOS has quit IRC (Ping timeout: 272 seconds) [00:04] *** GLaDOS has joined #archiveteam-bs [00:04] *** swebb sets mode: +o GLaDOS [00:15] *** cbb has joined #archiveteam-bs [00:17] *** www2 has joined #archiveteam-bs [00:26] *** cbb2 has joined #archiveteam-bs [00:27] *** cbb has quit IRC (Ping timeout: 265 seconds) [00:27] *** cbb2 is now known as cbb [00:29] *** www2 has quit IRC (Read error: Operation timed out) [00:31] *** kadercavd has joined #archiveteam-bs [00:31] http://strawpoll.me/3100584/r bana oy ver seninle sex sohbeti edicem söz [00:32] *** chfoo sets mode: +b *!*KaderCavd@213.74.159.* [00:32] *** kadercavd was kicked by xmc (spammer) [00:32] TAG TEAM [00:33] *** xmc sets mode: +o yipdw [00:33] TAG YOU'RE IT [00:33]  [00:33] wow lag [00:33] *** yipdw sets mode: +o xmc [00:33] wait are we playing no tagbacks [00:34] no lagbacks [00:35] if my students start to use that word I will be very D: [00:36] of course I write that and use "very D:" at the same time so haha fuck my neologism hypocrisy [00:46] *** www2 has joined #archiveteam-bs [01:26] *** www2 has quit IRC (Read error: Operation timed out) [01:30] *** primus104 has quit IRC (Leaving.) [01:31] *** mistym has quit IRC (Leaving...) [01:34] *** GLaDOS has quit IRC (Ping timeout: 272 seconds) [01:34] *** ersi has quit IRC (Read error: Operation timed out) [01:34] *** GLaDOS has joined #archiveteam-bs [01:34] *** swebb sets mode: +o GLaDOS [01:35] *** ersi has joined #archiveteam-bs [01:35] *** swebb sets mode: +o ersi [01:49] *** mistym has joined #archiveteam-bs [01:54] *** Lord_Nigh has quit IRC (Read error: Operation timed out) [01:56] *** cbb has quit IRC (Quit: cbb) [01:57] *** Lord_Nigh has joined #archiveteam-bs [02:09] *** LordNigh2 has joined #archiveteam-bs [02:11] *** Lord_Nigh has quit IRC (Ping timeout: 272 seconds) [02:11] *** LordNigh2 is now known as Lord_Nigh [02:35] *** Nertsy has joined #archiveteam-bs [02:37] *** chfoo has quit IRC (Remote host closed the connection) [02:55] *** Nertsy has quit IRC (Quit: Leaving) [02:56] *** chfoo has joined #archiveteam-bs [03:01] *** Nertsy has joined #archiveteam-bs [03:25] *** dx has quit IRC (Read error: Operation timed out) [03:41] *** hashtag has joined #archiveteam-bs [03:42] *** hashtag has left [03:53] uploaded: https://archive.org/details/G4_Icons_S01E01 [04:11] *** RainbowCo has joined #archiveteam-bs [04:38] *** Lord_Nigh has quit IRC (Read error: Connection reset by peer) [04:39] *** Lord_Nigh has joined #archiveteam-bs [04:47] *** dx has joined #archiveteam-bs [04:58] *** aaaaaaaaa has quit IRC (Leaving) [05:14] *** rejon has joined #archiveteam-bs [05:58] *** Start is now known as StartAway [06:05] *** dx has quit IRC (Ping timeout: 369 seconds) [06:18] *** dx has joined #archiveteam-bs [06:56] are there any projects to archive tumblr going on? [06:57] I've got a script that sort of does it for up to a few hundred accounts [07:01] no because tumblr is massive and nobody is willing to pay for the cost of storing it [07:01] instead we grab individual tumblrs on a special-case basis [07:02] okay [07:03] http://archive.fart.website/archivebot/viewer/?q=tumblr is a partial list [07:03] note that a significant portion of those are shallow grabs that target specific posts [07:09] is there a standard method for scraping tumblr stuff you guys use? [07:10] #archivebot mostly [08:10] There's a great statement by brokep about how unhappy he is where The Pirate Bay has gone [08:10] (Not the fact it disappeared for the moment - he actually approves of that.) [08:10] He thinks that it got handed down and handed down until it hit the lowest common denominator, and now it was just ads and stunts. [08:14] what will replace the pirate bay? [08:14] libre-piratebay? [08:17] *** mistym has quit IRC (Remote host closed the connection) [08:19] *** brayden has quit IRC (Read error: Operation timed out) [08:19] Oh god who knows [08:21] i think a mesh network that works sort of like twitter/reddit/facebook is in order [08:21] i have been thinking of this for awhile [08:21] my idea is git-like mesh network [08:22] main concern is spammers of various sorts. MPAA would need to be unable to break network just by setting up a few K VM instances [08:22] SketchCow: theres actually thepiratebay.ee which seems to be run by someone else (possibly as a honeypot since https doesn' work)? [08:22] and is still up [08:22] It's down [08:22] .ee is? [08:23] http://thepiratebay.ee/recent looked up to me a few minutes ago [08:23] where the people you follow on a twitter-like app would be mirrored to your phone/desktop [08:24] but given that it doesn't appear to be run by the main tpb 'team' i can't vouch for the accuracy of the torrents there. then again, neither could tpb [08:24] is it worth scraping that? [08:25] (i'm guessing yes) [08:26] it lookslike it might have a copy of the db up until < 6 hours before the main site was taken down [08:26] since i see nothing newer than 12/09 10:32 [08:28] if that's the case, then .ee should be scraped as quickly as we can manage it [08:28] starting with id 10,000,000 and going up [08:28] since we have everything from 9,999,999 and down [08:29] from other sources [08:29] main discussion is in #yarharfiddlededee [08:30] it isnt really down anyway, the police grabbed a loadbalancer [08:30] and a dns server [08:31] so... maybe .ee is actually the 'real thing'? that doesn't make sense, since .ee has no https, and .ee has that weird $5-per-year popup when trying to get magnet links (which you can enter any 6 digit numerical password to and it will work) [08:31] nah [08:31] (696969 works) [08:31] ee can be anything, im not sure [08:31] the dns server runs the internal dns traffic [08:31] ee seems to follow the same numbering tpb did [08:34] .. and now .ee is down [08:34] the domain name or ip or both? [08:34] no, its not down [08:34] its ... unstable [08:34] random 502 errors [08:34] seems other people have found it [08:35] the ids DO exactly match the post-10000000 ids from the actual .se site [08:35] can i ask for help setting up the warrior thing here? [08:35] so i think scraping .ee is a VERY good idea [08:37] All metro newspapers of brazil-brasilia of 2013 uploaded [08:37] its extremely sad that the .ee site, which has its origins (and may still technically be) a mirror of the original intended for scamming and phishing, may be the last running copy of all magnet ids after 10,000,000 [08:39] the magnets are definitely valid! [08:39] when I try to set the rate limit using the command from the wiki I get an error. "Syntax error: Invalid parameter '--name'" [08:40] but: it is missing the comments from thepiratebay.se [08:40] all of them [08:40] afaict [08:41] so it itself is a skimmed copy [08:41] well.... can we skim the skimmed copy? [08:43] we could also troll google cache until it comes down for the .se site and get comments etc [08:43] that's maybe a week tops [08:45] *** primus104 has joined #archiveteam-bs [08:55] this is amazing stuff: http://vimeo.com/12672088 [09:41] *** ivan has joined #archiveteam-bs [09:44] I have been rsync'ing gentoo's distfiles very frequently for two years without deleting anything that they remove, but it is of low value to me and I don't have fast upstream to dump the 500GB+ somewhere [09:45] if you want something from it, let me know before I delete it [09:45] ivan: wait, what are distfiles? [09:46] all the tarballs that the ebuilds grab [09:46] are these available elsewhere in this format? [09:46] old stuff gets removed from distfiles [09:46] (historically) [09:46] presumably 99.9% of it is in git repos elsewhere [09:47] hrm. are they github-generated tarballs? [09:47] or actual releases? [09:47] they're mostly official releases [09:47] any chance you can upload them over a long period of time? [09:47] (I probably have an rsync target for it) [09:48] no but maybe I can mail a drive to someone in the US [09:48] * joepie91 is not in the US [09:49] what sort of content is this? [09:50] Ctrl-S: basically, historical (source) releases of software [09:50] definitely an archival value to it imo [09:50] could you just upload the differences between versions? [09:50] that's... actually not a bad idea [09:50] sort of like wikis do [09:50] ivan: are you aware of any methods for having a 'source' file and having delta'd "derived" files? [09:51] every few hundred versions you upload the full version, and between those just the differences [09:51] that seems like it could work here - take the first release as base, then delta every release after it [09:51] I've programmed bizarre compression schemes like this before and it's not worth it [09:51] I know that it's technically possible, but no idea if it already exists for this particular usecase [09:51] I'm happy to pay the 6 bucks to mail it [09:51] ivan: not so much compression, as diff'ing :P [09:51] should be extremely efficient for this kind of data [09:52] otoh [09:52] is there a tool that does it? [09:52] you could accomplish a near-identical result by just having one tarball per piece of software and having each release (uncompressed) in its own directory [09:52] and using a compression format with a per-archive dictionary [09:53] ivan: problem is mostly a mailing target :P [09:53] I mean, unless you're planning on mailing it to NL... heh [09:54] https://ludios.org/tmp/gentoo-distfiles.txt [09:55] 37MB beware of browser crash [09:57] that's a lot of packages :P [09:57] ivan: should probably ask again in a few hours, when US-ians wake up [09:58] (or ship it to NL) [09:58] maybe SketchCow will take it [09:58] is this like all gentoo files ever [09:59] * joepie91 throws `sort` at it [09:59] * joepie91 watches it eat a core [10:00] Void_: about 2 years, it seems [10:03] throw it in a git repo and let it handle the delta compression? [10:03] that doesn't work for compressed tarballs [10:07] uploaded: https://archive.org/details/news.kbs.co.kr-search-news-code-1-to-10000-20141207 [10:08] that has South Korean news from 1999-01-01 to 1999-02-21 [10:10] that ups it by about 10k: https://web.archive.org/web/*/http://news.kbs.co.kr/news/NewsView.do?SEARCH_NEWS_CODE=* [10:10] only 2872 urls from those url types [10:10] *** brayden has joined #archiveteam-bs [10:55] *** schbirid has joined #archiveteam-bs [11:15] *** www2 has joined #archiveteam-bs [11:45] *** BlueMaxim has quit IRC (Quit: Leaving) [11:50] *** www2 has quit IRC (Read error: Operation timed out) [11:55] *** dx has quit IRC (Ping timeout: 265 seconds) [11:58] *** dx has joined #archiveteam-bs [12:06] *** Kadercavd has joined #archiveteam-bs [12:06] I swear to love Promise me here vote [12:06]     My name is Mark Bass http://strawpoll.me/3100584 Vote [12:06] no. [12:06] *** Kadercavd has quit IRC (Client Quit) [12:07] *** Kadercavd has joined #archiveteam-bs [12:08] NO. [12:08] *** Kadercavd has quit IRC (Client Quit) [12:12] lol. [12:29] *** www2 has joined #archiveteam-bs [12:32] I.. wht [12:32] what * [12:32] http://m.lg.com/ph/inside-lg/christmas-beat [12:33] apparently LG is using PDFy now? [12:35] oh god please goatse them [12:53] isn't that a good thing that LG's using it? we'll have all of their manuals backed up to IA automatically then [13:14] joepie91: how much bandwidth is pdfy using nowadays? [13:18] *** sirkov has quit IRC (Ping timeout: 370 seconds) [13:20] *** sirkov has joined #archiveteam-bs [13:48] *** sankin has joined #archiveteam-bs [13:53] *** sankin has quit IRC (Client Quit) [14:04] *** sankin has joined #archiveteam-bs [14:24] *** lrkj has quit IRC (Ping timeout: 612 seconds) [15:38] *** StartAway has quit IRC (Read error: Operation timed out) [15:50] *** mistym has joined #archiveteam-bs [15:51] *** mistym has quit IRC (Remote host closed the connection) [15:57] *** BiggieJo1 has quit IRC (Read error: Connection reset by peer) [15:57] *** Nertsy has quit IRC (Quit: Nertsy) [16:07] *** Nertsy has joined #archiveteam-bs [16:13] *** Start has joined #archiveteam-bs [16:18] *** aaaaaaaaa has joined #archiveteam-bs [16:58] *** Start has quit IRC (Read error: No route to host) [16:58] *** dx has quit IRC (Ping timeout: 246 seconds) [17:02] *** Start has joined #archiveteam-bs [17:02] *** dx has joined #archiveteam-bs [17:15] *** mistym has joined #archiveteam-bs [17:40] *** Start has quit IRC (Read error: Connection reset by peer) [17:49] *** GLaDOS has quit IRC (Ping timeout: 272 seconds) [17:50] *** GLaDOS has joined #archiveteam-bs [17:50] *** swebb sets mode: +o GLaDOS [18:03] *** mistym has quit IRC (Remote host closed the connection) [18:04] *** mistym has joined #archiveteam-bs [18:27] midas: 1.17TB last month [18:27] peaked at 325mbps yesterday [18:27] dashcloud: heheh [18:29] *** mistym has quit IRC (Remote host closed the connection) [18:30] *** mistym has joined #archiveteam-bs [18:31] *** brayden has quit IRC (Ping timeout: 607 seconds) [18:52] *** Start has joined #archiveteam-bs [18:58] *** rejon has quit IRC (Ping timeout: 480 seconds) [19:00] *** www2 has quit IRC (Ping timeout: 335 seconds) [19:01] Has anybody here looked at Google's newspaper archive? [19:01] I'm looking for way to download parts of a page without crawling through the rendered DOM and figuring out what images I need [19:02] *And* then re-assembling them. [19:17] *** ete_ has joined #archiveteam-bs [19:23] /buffer ccc [19:23] freaking whitespaces [19:41] *** aaaaaaaa_ has joined #archiveteam-bs [19:45] *** phuzion has quit IRC (Read error: Operation timed out) [19:47] *** xtr-201 has quit IRC (Read error: Operation timed out) [19:47] *** aaaaaaaaa has quit IRC (Read error: Operation timed out) [19:47] *** aaaaaaaa_ has quit IRC (Client Quit) [19:47] *** Start has quit IRC (Read error: Operation timed out) [19:47] *** aaaaaaaa_ has joined #archiveteam-bs [19:47] *** phuzion has joined #archiveteam-bs [19:48] *** xtr-201 has joined #archiveteam-bs [19:57] *** aaaaaaaa_ has quit IRC (Ping timeout: 480 seconds) [20:02] *** BlueMaxim has joined #archiveteam-bs [20:05] *** mistym_ has joined #archiveteam-bs [20:28] *** logchfoo starts logging #archiveteam-bs at Wed Dec 10 20:28:35 2014 [20:28] *** logchfoo has joined #archiveteam-bs [20:42] *** Arkiver2 is now known as arkiver [20:48] *** brayden has joined #archiveteam-bs [20:58] *** kyan has quit IRC (Read error: Connection reset by peer) [21:30] *** kyan_ has joined #archiveteam-bs [21:33] *** www2 has joined #archiveteam-bs [21:36] *** APerti has joined #archiveteam-bs [21:39] *** APerti_ has quit IRC (Ping timeout: 370 seconds) [21:49] *** Start has joined #archiveteam-bs [21:58] *** schbirid has quit IRC (Leaving) [22:06] * Void_ uses a dirty scanner to poke saddm [22:06] huh [22:08] *** mistym_ has quit IRC (Quit: Leaving...) [22:09] *** ivan- is now known as ivan`- [22:26] *** Start has quit IRC (Read error: Operation timed out) [22:34] *** SN4T14_ has joined #archiveteam-bs [22:39] *** SN4T14 has quit IRC (Ping timeout: 369 seconds) [23:19] *** mistym has joined #archiveteam-bs [23:23] *** dashcloud has quit IRC (Ping timeout: 265 seconds) [23:23] *** nico has quit IRC (Ping timeout: 265 seconds) [23:24] *** Insomnia1 has quit IRC (Ping timeout: 265 seconds) [23:24] *** Insomnia_ has joined #archiveteam-bs [23:24] *** wm_ has quit IRC (Ping timeout: 265 seconds) [23:28] *** dashcloud has joined #archiveteam-bs [23:39] *** nico has joined #archiveteam-bs [23:46] *** wm_ has joined #archiveteam-bs [23:46] *** Start has joined #archiveteam-bs