[00:00] i was hope for more full mirror of xkcd.org [00:00] not just index files and xkcd.org [00:00] *html files [03:28] is anyone working on the knol scrape? [04:34] ! [04:34] @ERROR: Unknown module 'db48x' [04:41] heh [05:43] ah, wikipedia... [05:43] http://imgur.com/gallery/sZA5k [05:48] Someone doesn't like me [05:48] I did a netstat, and there's about 1000 SYN_SENTs [06:00] meaning what? [06:04] one kind of DDoS is a synflood, might be that [06:04] or just DoS [06:05] is that what half-open connections mean [06:07] bsmith093: yep [06:37] should AT be looking into pastebin-type sites? [06:37] since quite a few pastes are marked as "forever" for "how long to keep" [06:41] arrith: there are quite a few pastes that should not have been pasted as forever [06:45] isnt that sort of random even for us [06:46] balrog: couldn't the same be said for geocities sites? [06:46] bsmith093: i'm sure people think FF.net archiving is random :P [06:46] probably not to the same extent [06:47] besides pastebin id= 8 chars mixedcase thats 52 factorial combos [06:47] i think [06:48] point is even if their each one bit, thats a hell of a lot of data, makes a tb look like nothing [06:49] 3.03423e+13 [06:49] combos [06:54] hmm [06:54] bsmith093: if the averate is 1kb, that's only 27 petabytes [06:55] bsmith093: but to be honest, you're assuming that the namespace is anywhere near fully used [06:55] oh well, exCUSE me , mr disk space is *free*, do u have several multi tb drives to fill to the brim? [06:56] its probably not anywhere near exhausted but still thats a lot [06:56] ;) [06:57] even if its only 1 bit per, and 25percent full, thats 7.5 tb [06:57] 30tb max/4 [06:58] besides imagine brute forcing THAT keyspace [06:59] I'd be suprised if it was 0.25% full [07:01] 1000q/s, which i doubt u could sustain, means ~961 **years** [07:01] i'm not sure how to estimate it but yeah, people putting even a few petabytes onto a pastebin seems strange [07:01] in terms of storage, you can compress it. just recently i got 300MB of text to 800KB [07:06] 2.405370053 years at 1000q/s if .25%full [07:06] ehat text [07:09] that gives us a good yard stick [07:09] bsmith093: 53459728531456 possibilities, given 52 choices and 8 positions [07:09] has pastebin received 1000 submissions per second in the last few years of operation? [07:09] doubtful [07:09] probably not [07:10] probably 10 or 20 an hour [07:10] i meant how fast we could pull them [07:10] yea, but I'm trying to gauge how full it might be [07:10] for example: 2 possible symbols (0 and 1), 8 positions: 2 to the 8th power = 256 [07:10] even .25% seems like a large overestimate [07:11] 1 year = 8 765.81277 hours [07:11] wikipedia says that it's been around for 9 years [07:11] google says (20 per hour) * (9 years) = 1 577 846.3 [07:11] that would be trivial to archive [07:12] couple gig tops [07:12] just about the only thing we would have to do is modify our new universal tracker to show kilobytes instead of megabytes [07:12] im impressed at the human races, (first world) lack of abuse of the commons of a free anything paste site [07:12] :) [07:13] we have a universal tracker [07:13] ??? [07:13] yea, it runs splinder.heroku.com, memac.heroku.com, etc [07:26] ffnet.heroku.com ? [07:29] yea, we could probably set something up for fanfiction.net [07:32] how fast can u run a curl script, were using this script http://pastebin.com/M2dgrAUE with this number list to sort good ids from bad ones id list 0000000-9999999 [07:34] its not really distributable, or paralellizable, bu tgivit a fat pipe and itll goto town [07:35] the total #of stroies is probably <3million [07:35] maybe 5m [07:50] bsmith093: once underscor is done with his thing i'd rather rewrite it in python or perl then get it working with the universal tracker [07:51] hmm, so how long might that take, on his end [07:51] also is he still here? [07:51] the download.py from fanfictiondownloader misses a lot of the site and doesn't put it in a nice of format as underscor's [07:51] true, but im not using that anymore [07:51] i'm not sure if he's still here. he's working at that thing though [07:52] oh? using wget-warc? [07:55] umm no, using this www.fanficdowloader.net app [07:55] its a blob, but it can read in link lists, which is perfect for me [07:58] www.fanfictiondownloader.net app [08:13] Hi guys, whats the latest? [08:15] Something about FanFiction.Net? [08:21] yes, possibly [08:22] when can we start archiving it? [08:23] * Hydriz loves to archive random stuff [08:26] http://www.fanfictiondownloader.net [08:27] eh, no tracker? [08:27] no, ot yet, turns out distributed stuff is harder than it looks [08:28] I see [08:28] heres the list of links to put thorugh the app [08:28] catch you guys another day, need to go now [10:07] good news everyone on crankygeeks [10:09] i maybe able to get episodes 119-125 of crankgeeks [10:10] there still hosted on pcmag.com it looks like [10:11] just no web page for those episodes for some reason [11:20] fun project idea: video advert (from magazines) collection [12:10] A friend of mine call us Diogenes Team. I laugh. [12:18] sisyphus would fit [12:38] Did you know THE ARCHIVERS? http://thearchivers.blogspot.com/ [16:20] hey hey [16:20] any book archiving enthusi-asts in here? [16:23] surprise me [16:32] uploading scans of mutt n jeff comics circa 1914 right now [17:49] 416 days since I started to download Jamendo. [17:49] 28000 albums, 1.1TB. [17:50] I'm about 50%. [17:53] wow [18:00] :) [18:00] i got mine spread to 2 disks [18:01] that reminds me, i uploaded a filelist for you some weeks ago [18:07] man, just thinking of jamendo makes me sad and angry [18:07] such incompetent idiots [18:09] ouch, good thing i am not root [18:09] mv: cannot move `sbin' to a subdirectory of itself, `sbin/sbin' [18:09] why incompetent? [18:10] failure to get the platform stable in years [18:10] everything is buggy [18:10] and ugly [18:11] oh look "Jamendo is currently under maintainance, sad isn't it ?"... [18:11] lul [18:11] perfect timing :) [18:33] Internet Archive doesn't host porn. When historians in the future look back, they will be asexual Internet society. [18:33] Thanks Internet Archive. [18:33] will see* [18:49] emijrp: They do [18:49] They just don't disseminate it [18:51] And when are they going to disseminate it? 100 years? 250? [18:53] Another problem with Wayback Machine is that you find what you know it existed. [18:53] You can "google" the weayback machine, right? [18:53] can't* [18:53] http://i.imgur.com/JakW6.jpg [18:54] emijrp: It falls under the same thing as the copyrighted music and movie archive [18:54] Whenever it becomes legal to distribute [18:55] please tell me archive.org grabs all scene releases [18:55] and, no, as far as I know you can't google the wayback machine [18:55] And copyrighted websites? Wayback Machine is full of that [18:56] Hey, I'm not the decision maker here [18:56] emijrp: What's so hard to understand? [18:56] Also, if you update your robots.txt, your website will automatically disappear [18:56] *We* don't care about copyrighted material, IA have to. [18:56] Also, Joe Schmoe is a lot easier to deal with than Universal Pictures [18:56] i never really got it either [18:56] especially when jason started uploading all those magazines and manuals [18:57] We should stop discussing this here [18:57] seems to be a abandonware-ish view [18:57] #archiverights if you want [18:57] oO [18:57] Jason prefers to not have these discussions in #archiveteam [18:58] It's very simple; We don't host crap, IA does. Hence they need to care about stuff we don't. IA still stores crap it can't display [18:58] Yeah [18:58] aye [18:59] At least a quarter of the stuff IA stores is not publicly available due to rights encumberment. [18:59] also they surely respond to dmca and what not so they are doing nothing wrong (on the contrary :) ) [19:00] I know all that. [19:02] Schbirid: A scene archive would be really cool [19:02] I've contemplated it [19:02] But the current release rate is too fast [19:02] IA doesn't want to dedicate that much money/resources to something like that when there are older artifacts to preserve [19:03] 200tb of mobileme though ... heh [19:03] Hahah [19:03] Yeah [19:03] That one'll be interesting [19:04] We're currently burning at ~26TB a week [19:04] (IA is) [19:05] i love reading "IA" [19:05] in german it is the name of the donkey from winnie the pooh :) [19:05] eeyore or what it is originally called [19:06] hahaha really? [19:07] http://www.archive.org/details/archiveteam-yahoovideo [19:08] where: &w_collection=archiveteam-yahoovideo | size: 11,161,347,515 KB| redrows: 0 (0.077 seconds) [19:09] Nice size! [19:09] emijrp: For knowing that, you sure seem to ask a lot about it. [19:09] Why is that public? [19:09] Because no one has whined [19:09] They are copyrgihted videos. [19:09] Again, what's so hard to understand? [19:10] If copyright owners complain, the collection/files will be delisted. IA will still have it stored though. [19:10] they may be copyrighted videos that were uploaded to a public website for the purpose of sharing them [19:11] Lawyers Team. Shut up. [19:11] No one has complained. [19:11] That's the biggest reason [19:11] Same with the magazines [19:11] and the manuals [19:12] If we give him 500 examples more, maybe he'll get it [19:12] lol [19:12] Just maybe [19:12] Give me 500 examples of porn sites that have complained to IA about distributing their content. [19:13] underscor said that it is the reason why porn is not public [19:13] It's their own business what they choose to list or not [19:13] No he didn't [19:13] It's ONE reason it MAY be delisted [19:14] emijrp: In the case of porn [19:14] It isn't listed because the administration feels that it "tarnishes" the image of IA [19:14] to be entirely frank. [19:14] and they don't want to be the go-to place for easy porn access, I would imagine [19:15] ^ [19:15] I don't blame them one bit. [19:15] underscor: thats the point [19:15] Ok, so we make it available [19:15] All the porn IA has is "stolen" subscription content from sites that are still up [19:15] Don't give me lessons of law. I came from Wikipedia, the copyright-smarty-lawyers-trollish community. [19:15] Within a week it would all be down [19:15] emijrp: physical museums have only a small fraction of their collection available for public viewing at any given time. It's the same kind of idea. It's still THERE. If someone needs access to it for a good reason, they can have it. [19:16] ^ [19:16] Just not everyone and their grandma every day all the time. [19:16] * underscor imagines dnova and his grandma searching through porn archives [19:16] hah [19:16] not QUITE what I was hoping to put in anyone's mind [19:16] hahahaha [19:17] emijrp: Plus, the request volume for archived porn is low. [19:17] If we had content from a site that no longer existed, AND if someone asked about it [19:17] then we' [19:17] d disseminate it [19:19] And, in fact, before I started "volunteering" at the archive [19:20] They weren't saving copyrighted music or movies. [19:20] At all. [19:22] Aside from the standard definition TV recording [19:46] *cricket cricket cricket* [19:46] haha [19:52] we are all searching for porn on ia [19:53] Too bad there's no PD porn yet [19:54] I mean, porn existed in 1923 didn't it? [19:56] What? There's been porn since the first camera. Well, before, if you count various physical artifacts. [19:56] There was a Creative Commons porn clip, but it was deleted in the official website. [19:56] So there's definitely PD porn out there. [19:57] That is the only open porn I heard ever. [19:57] Obviously, pre-1923 porn materials (movies, pics) are PD. [19:59] Afghanistan is the sole country in the world without copyright laws. [20:00] But America went there to give democracy. [20:01] https://en.wikipedia.org/wiki/Afghanistan_and_copyright_issues [20:03] http://www.freedomporn.org/smut/Category:public_domain [20:04] Interesting wiki. [20:13] any one got an idea about "CIX conferences" and if they were archived somewhere? [20:14] ah compulink [20:15] http://web.archive.org/web/19971211045936/http://www.compulink.co.uk/ [20:46] oops [20:50] .join #archiverights [23:54] hrm [23:54] wget is using so much memory [23:54] for splinder? [23:54] wget-warc uses a lot with the huge profiles [23:54] this is mobileme, actually [23:54] finishing up my incompletes [23:55] this one has 43k lines in its urls.txt [23:55] how much memory? [23:55] grrrrrr [23:55] 220 ftp.nodc.noaa.gov FTP server hello there friendly person [23:55] 331 Guest login ok, send your complete e-mail address as password. [23:55] 530- [23:55] Name (ftp.nodc.noaa.gov:abuie): anonymous [23:55] Password: [23:55] 530- Sorry, there is currently a limit of 10 ftp users [23:55] 530- on this system. Try again later. [23:55] 10.2 gigs [23:55] 530- [23:55] 530 Login incorrect. [23:55] Login failed. [23:55] Really? [23:55] 10 users? [23:55] oh shit man [23:55] What the hell [23:56] lol underscor [23:56] out of the 8 gigs that the machine has [23:56] they haven't changed the settings since 1994 [23:56] dnova: hahaha [23:56] db48x2: ouch :( [23:57] I thought this was bad [23:57] 26664 splinda 15 0 330m 268m 1316 S 0.0 15.7 2:33.98 wget-warc [23:57] guess it could be worse . [23:57] heh