#archiveteam 2011-12-03,Sat

↑back Search

Time Nickname Message
00:00 🔗 godane i was hope for more full mirror of xkcd.org
00:00 🔗 godane not just index files and xkcd.org
00:00 🔗 godane *html files
03:28 🔗 bsmith093 is anyone working on the knol scrape?
04:34 🔗 db48x !
04:34 🔗 db48x @ERROR: Unknown module 'db48x'
04:41 🔗 Hydriz heh
05:43 🔗 Coderjoe ah, wikipedia...
05:43 🔗 Coderjoe http://imgur.com/gallery/sZA5k
05:48 🔗 NotGLaDOS Someone doesn't like me
05:48 🔗 NotGLaDOS I did a netstat, and there's about 1000 SYN_SENTs
06:00 🔗 bsmith093 meaning what?
06:04 🔗 arrith one kind of DDoS is a synflood, might be that
06:04 🔗 arrith or just DoS
06:05 🔗 bsmith093 is that what half-open connections mean
06:07 🔗 db48x bsmith093: yep
06:37 🔗 arrith should AT be looking into pastebin-type sites?
06:37 🔗 arrith since quite a few pastes are marked as "forever" for "how long to keep"
06:41 🔗 balrog arrith: there are quite a few pastes that should not have been pasted as forever
06:45 🔗 bsmith093 isnt that sort of random even for us
06:46 🔗 arrith balrog: couldn't the same be said for geocities sites?
06:46 🔗 arrith bsmith093: i'm sure people think FF.net archiving is random :P
06:46 🔗 balrog probably not to the same extent
06:47 🔗 bsmith093 besides pastebin id= 8 chars mixedcase thats 52 factorial combos
06:47 🔗 bsmith093 i think
06:48 🔗 bsmith093 point is even if their each one bit, thats a hell of a lot of data, makes a tb look like nothing
06:49 🔗 bsmith093 3.03423e+13
06:49 🔗 bsmith093 combos
06:54 🔗 db48x hmm
06:54 🔗 db48x bsmith093: if the averate is 1kb, that's only 27 petabytes
06:55 🔗 db48x bsmith093: but to be honest, you're assuming that the namespace is anywhere near fully used
06:55 🔗 bsmith093 oh well, exCUSE me , mr disk space is *free*, do u have several multi tb drives to fill to the brim?
06:56 🔗 bsmith093 its probably not anywhere near exhausted but still thats a lot
06:56 🔗 db48x ;)
06:57 🔗 bsmith093 even if its only 1 bit per, and 25percent full, thats 7.5 tb
06:57 🔗 bsmith093 30tb max/4
06:58 🔗 bsmith093 besides imagine brute forcing THAT keyspace
06:59 🔗 db48x I'd be suprised if it was 0.25% full
07:01 🔗 bsmith093 1000q/s, which i doubt u could sustain, means ~961 **years**
07:01 🔗 arrith i'm not sure how to estimate it but yeah, people putting even a few petabytes onto a pastebin seems strange
07:01 🔗 arrith in terms of storage, you can compress it. just recently i got 300MB of text to 800KB
07:06 🔗 bsmith093 2.405370053 years at 1000q/s if .25%full
07:06 🔗 bsmith093 ehat text
07:09 🔗 db48x that gives us a good yard stick
07:09 🔗 Coderjoe bsmith093: 53459728531456 possibilities, given 52 choices and 8 positions
07:09 🔗 db48x has pastebin received 1000 submissions per second in the last few years of operation?
07:09 🔗 db48x doubtful
07:09 🔗 bsmith093 probably not
07:10 🔗 db48x probably 10 or 20 an hour
07:10 🔗 bsmith093 i meant how fast we could pull them
07:10 🔗 db48x yea, but I'm trying to gauge how full it might be
07:10 🔗 Coderjoe for example: 2 possible symbols (0 and 1), 8 positions: 2 to the 8th power = 256
07:10 🔗 db48x even .25% seems like a large overestimate
07:11 🔗 bsmith093 1 year = 8 765.81277 hours
07:11 🔗 db48x wikipedia says that it's been around for 9 years
07:11 🔗 db48x google says (20 per hour) * (9 years) = 1 577 846.3
07:11 🔗 db48x that would be trivial to archive
07:12 🔗 bsmith093 couple gig tops
07:12 🔗 db48x just about the only thing we would have to do is modify our new universal tracker to show kilobytes instead of megabytes
07:12 🔗 bsmith093 im impressed at the human races, (first world) lack of abuse of the commons of a free anything paste site
07:12 🔗 bsmith093 :)
07:13 🔗 bsmith093 we have a universal tracker
07:13 🔗 bsmith093 ???
07:13 🔗 db48x yea, it runs splinder.heroku.com, memac.heroku.com, etc
07:26 🔗 bsmith093 ffnet.heroku.com ?
07:29 🔗 db48x yea, we could probably set something up for fanfiction.net
07:32 🔗 bsmith093 how fast can u run a curl script, were using this script http://pastebin.com/M2dgrAUE with this number list to sort good ids from bad ones id list 0000000-9999999
07:34 🔗 bsmith093 its not really distributable, or paralellizable, bu tgivit a fat pipe and itll goto town
07:35 🔗 bsmith093 the total #of stroies is probably <3million
07:35 🔗 bsmith093 maybe 5m
07:50 🔗 arrith bsmith093: once underscor is done with his thing i'd rather rewrite it in python or perl then get it working with the universal tracker
07:51 🔗 bsmith093 hmm, so how long might that take, on his end
07:51 🔗 bsmith093 also is he still here?
07:51 🔗 arrith the download.py from fanfictiondownloader misses a lot of the site and doesn't put it in a nice of format as underscor's
07:51 🔗 bsmith093 true, but im not using that anymore
07:51 🔗 arrith i'm not sure if he's still here. he's working at that thing though
07:52 🔗 arrith oh? using wget-warc?
07:55 🔗 bsmith093 umm no, using this www.fanficdowloader.net app
07:55 🔗 bsmith093 its a blob, but it can read in link lists, which is perfect for me
07:58 🔗 bsmith093 www.fanfictiondownloader.net app
08:13 🔗 Hydriz Hi guys, whats the latest?
08:15 🔗 Hydriz Something about FanFiction.Net?
08:21 🔗 db48x yes, possibly
08:22 🔗 Hydriz when can we start archiving it?
08:23 🔗 * Hydriz loves to archive random stuff
08:26 🔗 bsmith093 http://www.fanfictiondownloader.net
08:27 🔗 Hydriz eh, no tracker?
08:27 🔗 bsmith093 no, ot yet, turns out distributed stuff is harder than it looks
08:28 🔗 Hydriz I see
08:28 🔗 bsmith093 heres the list of links to put thorugh the app
08:28 🔗 Hydriz catch you guys another day, need to go now
10:07 🔗 godane good news everyone on crankygeeks
10:09 🔗 godane i maybe able to get episodes 119-125 of crankgeeks
10:10 🔗 godane there still hosted on pcmag.com it looks like
10:11 🔗 godane just no web page for those episodes for some reason
11:20 🔗 Schbirid fun project idea: video advert (from magazines) collection
12:10 🔗 emijrp A friend of mine call us Diogenes Team. I laugh.
12:18 🔗 Schbirid sisyphus would fit
12:38 🔗 emijrp Did you know THE ARCHIVERS? http://thearchivers.blogspot.com/
16:20 🔗 rude___ hey hey
16:20 🔗 rude___ any book archiving enthusi-asts in here?
16:23 🔗 emijrp surprise me
16:32 🔗 rude___ uploading scans of mutt n jeff comics circa 1914 right now
17:49 🔗 emijrp 416 days since I started to download Jamendo.
17:49 🔗 emijrp 28000 albums, 1.1TB.
17:50 🔗 emijrp I'm about 50%.
17:53 🔗 underscor wow
18:00 🔗 Schbirid :)
18:00 🔗 Schbirid i got mine spread to 2 disks
18:01 🔗 Schbirid that reminds me, i uploaded a filelist for you some weeks ago
18:07 🔗 Schbirid man, just thinking of jamendo makes me sad and angry
18:07 🔗 Schbirid such incompetent idiots
18:09 🔗 Schbirid ouch, good thing i am not root
18:09 🔗 Schbirid mv: cannot move `sbin' to a subdirectory of itself, `sbin/sbin'
18:09 🔗 emijrp why incompetent?
18:10 🔗 Schbirid failure to get the platform stable in years
18:10 🔗 Schbirid everything is buggy
18:10 🔗 Schbirid and ugly
18:11 🔗 Schbirid oh look "Jamendo is currently under maintainance, sad isn't it ?"...
18:11 🔗 emijrp lul
18:11 🔗 Schbirid perfect timing :)
18:33 🔗 emijrp Internet Archive doesn't host porn. When historians in the future look back, they will be asexual Internet society.
18:33 🔗 emijrp Thanks Internet Archive.
18:33 🔗 emijrp will see*
18:49 🔗 underscor emijrp: They do
18:49 🔗 underscor They just don't disseminate it
18:51 🔗 emijrp And when are they going to disseminate it? 100 years? 250?
18:53 🔗 emijrp Another problem with Wayback Machine is that you find what you know it existed.
18:53 🔗 emijrp You can "google" the weayback machine, right?
18:53 🔗 emijrp can't*
18:53 🔗 dnova http://i.imgur.com/JakW6.jpg
18:54 🔗 underscor emijrp: It falls under the same thing as the copyrighted music and movie archive
18:54 🔗 underscor Whenever it becomes legal to distribute
18:55 🔗 Schbirid please tell me archive.org grabs all scene releases
18:55 🔗 underscor and, no, as far as I know you can't google the wayback machine
18:55 🔗 emijrp And copyrighted websites? Wayback Machine is full of that
18:56 🔗 underscor Hey, I'm not the decision maker here
18:56 🔗 ersi emijrp: What's so hard to understand?
18:56 🔗 underscor Also, if you update your robots.txt, your website will automatically disappear
18:56 🔗 ersi *We* don't care about copyrighted material, IA have to.
18:56 🔗 underscor Also, Joe Schmoe is a lot easier to deal with than Universal Pictures
18:56 🔗 Schbirid i never really got it either
18:56 🔗 Schbirid especially when jason started uploading all those magazines and manuals
18:57 🔗 underscor We should stop discussing this here
18:57 🔗 Schbirid seems to be a abandonware-ish view
18:57 🔗 underscor #archiverights if you want
18:57 🔗 Schbirid oO
18:57 🔗 underscor Jason prefers to not have these discussions in #archiveteam
18:58 🔗 ersi It's very simple; We don't host crap, IA does. Hence they need to care about stuff we don't. IA still stores crap it can't display
18:58 🔗 underscor Yeah
18:58 🔗 Schbirid aye
18:59 🔗 underscor At least a quarter of the stuff IA stores is not publicly available due to rights encumberment.
18:59 🔗 Schbirid also they surely respond to dmca and what not so they are doing nothing wrong (on the contrary :) )
19:00 🔗 emijrp I know all that.
19:02 🔗 underscor Schbirid: A scene archive would be really cool
19:02 🔗 underscor I've contemplated it
19:02 🔗 underscor But the current release rate is too fast
19:02 🔗 underscor IA doesn't want to dedicate that much money/resources to something like that when there are older artifacts to preserve
19:03 🔗 dnova 200tb of mobileme though ... heh
19:03 🔗 underscor Hahah
19:03 🔗 underscor Yeah
19:03 🔗 underscor That one'll be interesting
19:04 🔗 underscor We're currently burning at ~26TB a week
19:04 🔗 underscor (IA is)
19:05 🔗 Schbirid i love reading "IA"
19:05 🔗 Schbirid in german it is the name of the donkey from winnie the pooh :)
19:05 🔗 Schbirid eeyore or what it is originally called
19:06 🔗 underscor hahaha really?
19:07 🔗 emijrp http://www.archive.org/details/archiveteam-yahoovideo
19:08 🔗 underscor where: &w_collection=archiveteam-yahoovideo | size: 11,161,347,515 KB| redrows: 0 (0.077 seconds)
19:09 🔗 underscor Nice size!
19:09 🔗 ersi emijrp: For knowing that, you sure seem to ask a lot about it.
19:09 🔗 emijrp Why is that public?
19:09 🔗 ersi Because no one has whined
19:09 🔗 emijrp They are copyrgihted videos.
19:09 🔗 ersi Again, what's so hard to understand?
19:10 🔗 ersi If copyright owners complain, the collection/files will be delisted. IA will still have it stored though.
19:10 🔗 dnova they may be copyrighted videos that were uploaded to a public website for the purpose of sharing them
19:11 🔗 emijrp Lawyers Team. Shut up.
19:11 🔗 underscor No one has complained.
19:11 🔗 underscor That's the biggest reason
19:11 🔗 underscor Same with the magazines
19:11 🔗 underscor and the manuals
19:12 🔗 ersi If we give him 500 examples more, maybe he'll get it
19:12 🔗 underscor lol
19:12 🔗 ersi Just maybe
19:12 🔗 emijrp Give me 500 examples of porn sites that have complained to IA about distributing their content.
19:13 🔗 emijrp underscor said that it is the reason why porn is not public
19:13 🔗 ersi It's their own business what they choose to list or not
19:13 🔗 ersi No he didn't
19:13 🔗 ersi It's ONE reason it MAY be delisted
19:14 🔗 underscor emijrp: In the case of porn
19:14 🔗 underscor It isn't listed because the administration feels that it "tarnishes" the image of IA
19:14 🔗 underscor to be entirely frank.
19:14 🔗 dnova and they don't want to be the go-to place for easy porn access, I would imagine
19:15 🔗 underscor ^
19:15 🔗 dnova I don't blame them one bit.
19:15 🔗 emijrp underscor: thats the point
19:15 🔗 underscor Ok, so we make it available
19:15 🔗 underscor All the porn IA has is "stolen" subscription content from sites that are still up
19:15 🔗 emijrp Don't give me lessons of law. I came from Wikipedia, the copyright-smarty-lawyers-trollish community.
19:15 🔗 underscor Within a week it would all be down
19:15 🔗 dnova emijrp: physical museums have only a small fraction of their collection available for public viewing at any given time. It's the same kind of idea. It's still THERE. If someone needs access to it for a good reason, they can have it.
19:16 🔗 underscor ^
19:16 🔗 dnova Just not everyone and their grandma every day all the time.
19:16 🔗 * underscor imagines dnova and his grandma searching through porn archives
19:16 🔗 dnova hah
19:16 🔗 dnova not QUITE what I was hoping to put in anyone's mind
19:16 🔗 underscor hahahaha
19:17 🔗 underscor emijrp: Plus, the request volume for archived porn is low.
19:17 🔗 underscor If we had content from a site that no longer existed, AND if someone asked about it
19:17 🔗 underscor then we'
19:17 🔗 underscor d disseminate it
19:19 🔗 underscor And, in fact, before I started "volunteering" at the archive
19:20 🔗 underscor They weren't saving copyrighted music or movies.
19:20 🔗 underscor At all.
19:22 🔗 underscor Aside from the standard definition TV recording
19:46 🔗 underscor *cricket cricket cricket*
19:46 🔗 underscor haha
19:52 🔗 Schbirid we are all searching for porn on ia
19:53 🔗 underscor Too bad there's no PD porn yet
19:54 🔗 underscor I mean, porn existed in 1923 didn't it?
19:56 🔗 Paradoks What? There's been porn since the first camera. Well, before, if you count various physical artifacts.
19:56 🔗 emijrp There was a Creative Commons porn clip, but it was deleted in the official website.
19:56 🔗 Paradoks So there's definitely PD porn out there.
19:57 🔗 emijrp That is the only open porn I heard ever.
19:57 🔗 emijrp Obviously, pre-1923 porn materials (movies, pics) are PD.
19:59 🔗 emijrp Afghanistan is the sole country in the world without copyright laws.
20:00 🔗 emijrp But America went there to give democracy.
20:01 🔗 emijrp https://en.wikipedia.org/wiki/Afghanistan_and_copyright_issues
20:03 🔗 Paradoks http://www.freedomporn.org/smut/Category:public_domain
20:04 🔗 emijrp Interesting wiki.
20:13 🔗 Schbirid any one got an idea about "CIX conferences" and if they were archived somewhere?
20:14 🔗 Schbirid ah compulink
20:15 🔗 Schbirid http://web.archive.org/web/19971211045936/http://www.compulink.co.uk/
20:46 🔗 db48x oops
20:50 🔗 bsmith094 .join #archiverights
23:54 🔗 db48x2 hrm
23:54 🔗 db48x2 wget is using so much memory
23:54 🔗 dnova for splinder?
23:54 🔗 dnova wget-warc uses a lot with the huge profiles
23:54 🔗 db48x2 this is mobileme, actually
23:54 🔗 db48x2 finishing up my incompletes
23:55 🔗 db48x2 this one has 43k lines in its urls.txt
23:55 🔗 dnova how much memory?
23:55 🔗 underscor grrrrrr
23:55 🔗 underscor 220 ftp.nodc.noaa.gov FTP server hello there friendly person
23:55 🔗 underscor 331 Guest login ok, send your complete e-mail address as password.
23:55 🔗 underscor 530-
23:55 🔗 underscor Name (ftp.nodc.noaa.gov:abuie): anonymous
23:55 🔗 underscor Password:
23:55 🔗 underscor 530- Sorry, there is currently a limit of 10 ftp users
23:55 🔗 underscor 530- on this system. Try again later.
23:55 🔗 db48x2 10.2 gigs
23:55 🔗 underscor 530-
23:55 🔗 underscor 530 Login incorrect.
23:55 🔗 underscor Login failed.
23:55 🔗 underscor Really?
23:55 🔗 underscor 10 users?
23:55 🔗 dnova oh shit man
23:55 🔗 underscor What the hell
23:56 🔗 dnova lol underscor
23:56 🔗 db48x2 out of the 8 gigs that the machine has
23:56 🔗 dnova they haven't changed the settings since 1994
23:56 🔗 underscor dnova: hahaha
23:56 🔗 underscor db48x2: ouch :(
23:57 🔗 dnova I thought this was bad
23:57 🔗 dnova 26664 splinda 15 0 330m 268m 1316 S 0.0 15.7 2:33.98 wget-warc
23:57 🔗 dnova guess it could be worse .
23:57 🔗 db48x2 heh

irclogger-viewer