[00:19] so is there a general tool that will scrape urls from a google search for you? [02:18] no kidding about metadata being a love note to the future... [02:18] I have a couple hundred unlabled VHS tapes, a number of unlabled hard drives (offline), a bunch of unlabeled floppies, etc... [02:19] but something is a little odd about the shirt coloring around the text, like they had the heat press turned up too high, pressed too long, or with too much pressure [02:33] can we archive lachlan cranswick's page? [02:33] he died and i don't know how long it will be up [02:34] he was a physicist from australia with a bunch of topics on his page ranging from physics to poetry [02:34] lachlan.bluehaze.com.au/ [02:34] proof: www.abc.net.au/news/2010-06-16/aussie-scientists-body-found-in-canadian-river/869416 [02:34] more proof www.cbc.ca/news/canada/ottawa/story/2010/02/04/deep-river-lachlan-cranswick.html [02:40] go ahead [02:41] i don't have the space [02:41] Nothing stops an individual from archiving something they feel is important [02:41] ... aside from that :-\ [02:41] yes there are a shitload of things [02:41] time, space, bandwidth [02:41] knowledge [02:42] do you feel this is an important website? [02:43] hmm [02:43] it has been slightly modified from how he left it by the person adding the note that it is being left how he left it [02:45] yea that was probably a woman [02:45] you know, it's not like he gave his admin passwords to his nuclear physicist hacker co-workers [02:45] i was wget'ing for an hour and it was over 600mb [02:46] nice robots.txt... only thing disallowed is the web server statistics [02:47] how an i estimate the website's size? [02:47] in it's entirety [02:48] sum of all files [02:50] without fetching everything? not really possible [02:50] curl has -I [02:50] curl displays the file size and last modification time only. [02:50] you would have to spider everything with head requests as a minimum [02:51] doesn't seem to work for index.html [02:51] i've a mirroring underway [02:51] do you think he has stats in that hidden folder mentioned in robots.txt? [02:51] yeah, the server seems to not be sending content lengths for anything [02:51] \fuck/ [02:52] yes, there are website statistics (like hit counts and such) [02:52] are there size statistics? [02:53] oh crap, I think I might want to respect that robots.txt entry. [02:53] the statistics reports appear to be generated on request [02:53] what does that mean for me? [02:54] btw archiveteam has a really big list of likely-to-die pages, do you think you will manage to save them all? [02:54] oh yay. no last-modified header eather [02:56] * Coderjoe tells wget to ignore /repwork, where the realtime report scripts are located [02:58] robots.txt has content-length:36 with accept-ranges:bytes, does that mean 36 bytes? [02:59] content-length: 36 means 36 bytes [02:59] whee. deep.html is over 1MB [03:01] what are you doing? [03:01] watching wget take forever to grab some files [03:01] wow [03:01] btw another website where the owner died that has a shitload of web history on it is fravia's website @ searchlores.org [03:01] other-links.html is at 6MB and growing [03:02] at 80KB/s this is going to take awhile [03:02] better known by his nickname Fravia, was a software reverse engineer and "seeker" known for his web ... [03:02] Coderjoe: what are you mirroring? [03:02] lachlan.bluehaze.com.au [03:02] any way i can mirror it directly to a vps? [03:03] run the wget on the vps? [03:03] no run the curl on my site that pushes every file to a vps then deletes it, i can run that in memory [03:03] the bottleneck isn't on my end [03:03] or run it all in memory .. i have 16 giga [03:04] if i had a place to store all this information i could do it myself [03:04] or together [03:04] if you are running linux, you can mount a ramfs and wget to that [03:04] you don't have much free disk space? [03:05] like 50mb [03:05] i have a 16giga ramfs, but then what [03:05] everyone reboots sometimes [03:05] tar it up? [03:06] yeesh. only 50mb? [03:06] actually less, let me check [03:06] last thing i did was tar up my repository of code [03:06] i'm grabbing it to a disk with 400GB free [03:06] hmm [03:06] set up an ftp server and i'll dump to it? [03:07] i am getting content lengths and mtims on things likw jpg files [03:07] it can't be that big, his website is pretty old [03:07] makes me think the html is eing processed [03:18] mmm [03:19] 404 [03:19] amusing 404 page [03:19] http://lachlan.bluehaze.com.au/ccp14admin/security/index.html [03:19] you know... i probably would be better off using wget-warc [03:21] his email address is invalid [03:23] Paradoks: is your password "username"? [03:23] are you fetching it in parallel? [03:24] do you ever get harassed by your isp? [03:24] not for a number of years [03:25] comcast disconnected me without notice a few years back. (and I mean notice as in "we're disconnecting you", not like they claim they did, which was 2 months prior for a warning) [03:25] Coderjoe: for overuse? [03:25] yep [03:25] verizon doesn't harrass [03:26] harass * [03:26] I wish I had FiOS though. it is available here. [03:26] this was back when they had the unpublished limit, before they started with the overage charge [03:26] i was reading in a thread today, one guy says he's being harassed because he got caught a few times for pirating torrents [03:26] then someone else replies 'tell them to fuck off or else you'll chose another isp' [03:26] I now have at&t uverse. they were supposedly going to put caps on it, but it doesn't appear to have happened yet [03:26] you think they are likely to sue or obey? [03:27] (possibly because they have trouble distingushing TV traffic from the user's own internet traffic) [03:27] that who is likely to sue or obey? [03:27] that isp [03:27] the isp in the guy in my story [03:28] it is not the ISP's responsibility to sue over copyright infringement. it is the infringed party's. [03:28] oh true [03:29] hmm. this site might be large... looks like he published a lot of logs (with photos) of his travels [03:30] wget downloaded those first for me, and it was just over 500 from 2002 to 2007 [03:30] as i remember they only went to 2008, and the later years only had a few pics in them [03:31] you can also delete some of the gifs, he had copies of jps as gives for thumbnails or quicker page loading [03:31] (they're copies but one of the copies is lower quality) [03:31] i will not be deleting anything [03:32] i wonder if embedded archives in his pictures [03:35] asking, you shoud tell the user on the forum to stay off the piratebay [03:38] what? [03:39] "one guy says he's being harassed because he got caught a few times for pirating torrents" [03:41] oh that, i didn't save the link [03:41] he didn't necessarily have to be on the pirate bay [03:41] public trackers in genral [03:41] sure [04:26] Coderjoe: did you leave it running this whole time? how much did it download? [04:27] currently at 250M [04:27] and in the reports [04:28] (i told wget to ignore the directory where the realtime report scripts live) [05:44] Uploaded 4 terabytes of Yahoo Video! [05:44] :O [05:44] Now I'm adding more, doing up .tars for it, and so on. [05:45] So hopefully tomorrow, I can set the thing to start uploading them again. [05:53] I've uploaded 41 sets of videos so far. [05:53] I forgot it's only, like, 9.7 million users. [05:53] all in flv? [05:53] http://www.textfiles.com/videoyahoo/USERSCRAPE/USERLISTS/ [05:53] I believe so, yes. [05:54] its too bad you guys are so strict on altering the content, you would achieve better compression if you converted flv to another format before compressing as an archive [05:54] there are lossless functions if you want [05:54] Yes, that's really too bad. [05:54] Remember when you see those old books? [05:54] It's too bad they didn't cut out the parts that weren't that politically expedient. [05:54] Or took out the black people [05:55] it's lossless [05:55] Or thought they could rewrite them in shorthand to save space [05:55] no but how about tripping margins [05:55] Is that like tripping balls? [05:55] what? [05:55] you mean lossless compression isn't good enough? [05:56] You are free to download the files we're uploading, compress them any way you want, and make a new set. [05:56] Give me another week or two, still uploading. [05:56] and then throw it away in /dev/null? fantastic! [05:56] That doesn't sound very lossless [05:57] the files are lossless, your perogative is lossy [05:58] perhaps i didn't understand clearly (i'm new here), you want to mirror them as a compressed torrent, or exactly as the user interface was for yahoo.com/videos was [05:59] I want to take these 15-20 terabytes of Yahoo! Video we downloaded over 4-5 months and put them on archive.org. [05:59] And let people do whatever they want with them [05:59] And I'm something like 10 terabytes in, things are going swimmingly. [06:00] after them download it, or browser all those videos on archive.org [06:00] ...? [06:00] That wasn't english [06:00] after that download it, or browse all those videos on archive.org [06:00] Good question. Not sure. [06:00] I'm sure something will happen with them. [06:01] do you have a small dataset uploaded? [06:01] I am probably going to write something to download the files, run some number crunching, upload the resulting number crunching. [06:01] I have a fucking huge dataset uploaded, that's even better than a small one. [06:01] are the videos divided into folders of users? [06:02] I guess there's a small one in the big one. [06:02] Yes. [06:02] where? [06:02] what's the url [06:02] http://www.archive.org/details/archiveteam-yahoovideo [06:02] i'll get back to you in a few days [06:49] lachlan mirror still going. grabbing lots of jpgs now.. currently at 748M [07:06] jeez [07:07] is there a way to retrieve usenet posts that had the x-no-distribute header? [07:08] if you have the article id or server-specific index number, perhaps [07:08] otherwise, you have to pull full headers (not xover) of all the articles to find the ones that have headers like that [12:01] now at 1.6G and still going [12:55] Coderjoe: Of course "username" is my password. What's the point of having a password if it doesn't logically fit the username? (I have no idea why my User ID is "password". If I set it, I set it years ago.) [13:52] Where should i upload my backup of an old website(only 30mb zipped)? [15:10] Bear_: you're bearh? [15:11] msg SketchCow, he'll hook you up with an rsync account [15:11] I can host it temporarily if you can't keep a stable connection up [15:13] kk [15:13] Sorry, I wasn't looking at my irc client. [16:18] anyone know a good tool to download images off a flickr account without being the owner of that account? [16:22] Schbirid: I have some [16:23] Linux only though [16:23] splendid [16:23] i remember the term "flickr fuckr" but could not find it anymore [16:23] Let me go find them [16:24] Hmm, I think this is the current version [16:24] Gimme a username to test [16:24] :) [16:25] random flickr says "minkee" [16:27] Schbirid: Thanks [16:27] One second [16:27] Making a few changes [16:27] you rock [16:27] i'll be back in ~30 minutes [16:27] Ok [16:28] If you want to fully extract everything, you need ruby and the json and yaml gems [16:28] (just fyi) [16:55] Coderjoe: did it finish? [16:56] wow 1.6 [16:57] Schbirid: you can't if they're marked private, but if you're going to index flirkr, javascript can do it [16:57] javascript is great because you can use it like the firefox XPS attacks against irc networks [16:58] you get random users to go on your web site and the code runs [16:58] they leave the site open and it scrapes and send back the pics to you [16:58] there's also javascript ddos websites done in the same fashion [16:58] or bandwidth killers, those load up as many pictures as possible from the website [16:59] its really easy to get people to use that software to help you since you don't have to explain them much [17:00] like if u want grandma to help you, you can do it this way [17:20] or you can not violate flickr's TOS by doing it the legitimate way [17:22] Schbirid: http://tracker.archive.org/flickr/ [17:22] There's an example of the output too [17:22] Need ruby, libjson-ruby, and libyaml-ruby [17:23] (if you're on debian or ubuntu) [17:39] underscor: works partially, it does not seem to like usernames like 54421772@N03 but looks like it is downloading the images alright [17:39] thank you! [17:47] Make sure you put the username in quotes [17:47] But other than that, yay [17:49] i did, it still seemed to use the @ as seperator somewhere [17:50] wait, actually this is not those usernames [17:50] got a normal one with the same error. let me pastebin it [18:00] underscor: http://pastebin.com/Hfk4tKBi [18:00] i did "gem install json", not yaml since google told me that stuff is builtin. using ruby 1.9.something (latest on archlinux) [18:03] oh okay, yeah [18:03] 1.8 needs it [18:04] Ah, numeric username was borking some regexps [18:05] Think I fixed it, let me test [18:06] underscor: that's legitimate [18:06] it's as legitimate as using wget to scrape it all [18:06] except you're using javascript, and you can distribute easily [18:08] except it's a lot more work to write the backend stuff to control it all [18:08] Also, you run into XSS issues too [18:09] Schbirid: flickrgrabr updated, should work now [18:10] thanks! [18:24] That's a funny use of Ruby: as a json to yaml converter. (For people who don't like json, presumably. :) [18:28] underscor: yes, that fixed it :) [18:29] awesome [18:29] alard: ;) [19:05] (It's only singleuser, so if someone's already connected it'll drop new ones) [19:05] ? :) [19:05] Someone willing to test it my telnetting to 71.126.138.142:42 [19:05] Whee, got my parallax propeller running hangman over telnet [19:15] wtf is a parallax propeller? [19:15] Tried it, lost. Nice. (I'm probably not familiar enough with the type of words it uses.) [19:15] it's an 8 core microchip [19:15] http://tracker.archive.org/wordbank.txt [19:16] closure: running at 80mhz with (I think) 32k ram [19:17] an interesting microcontroller [19:17] underscor: Ha. I recognize only a few. [19:17] :) [19:18] I did learn the word 'gonkulator', so it's a very educational game. [19:18] asking: still going. now at 2.7G [19:19] alard: hahha [19:41] http://laughingsquid.com/fifteen-people-youll-see-at-every-video-gamecomicnerd-convention/ [19:47] http://sfist.com/attachments/SFist_AndrewD/OccupySF_Oct15_19_steverhodes.jpg?487 [19:47] woops wrong chan [19:48] lemonkey: got my PM? [19:52] underscor: https://gist.github.com/3e333ef4b583117928ee (Sorry, couldn't help it.) [19:53] :D [19:53] That's awesome [20:46] I just acquired some old flyers/pictures zines from a basement. They're a bit damp. Any tips on how to dry these guys out? [21:48] anyone good with archive conversion? [21:50] specifically cbz cbr to pdf? [22:36] underscor: I was wondering, would you be able (and willing) to set up a listerine-like thing for the downloading of MobileMe? It would be really helpful.