#archiveteam 2011-10-16,Sun

↑back Search

Time Nickname Message
00:19 🔗 dashcloud so is there a general tool that will scrape urls from a google search for you?
02:18 🔗 Coderjoe no kidding about metadata being a love note to the future...
02:18 🔗 Coderjoe I have a couple hundred unlabled VHS tapes, a number of unlabled hard drives (offline), a bunch of unlabeled floppies, etc...
02:19 🔗 Coderjoe but something is a little odd about the shirt coloring around the text, like they had the heat press turned up too high, pressed too long, or with too much pressure
02:33 🔗 asking can we archive lachlan cranswick's page?
02:33 🔗 asking he died and i don't know how long it will be up
02:34 🔗 asking he was a physicist from australia with a bunch of topics on his page ranging from physics to poetry
02:34 🔗 asking lachlan.bluehaze.com.au/
02:34 🔗 asking proof: www.abc.net.au/news/2010-06-16/aussie-scientists-body-found-in-canadian-river/869416
02:34 🔗 asking more proof www.cbc.ca/news/canada/ottawa/story/2010/02/04/deep-river-lachlan-cranswick.html
02:40 🔗 Coderjoe go ahead
02:41 🔗 asking i don't have the space
02:41 🔗 Coderjoe Nothing stops an individual from archiving something they feel is important
02:41 🔗 Coderjoe ... aside from that :-\
02:41 🔗 asking yes there are a shitload of things
02:41 🔗 asking time, space, bandwidth
02:41 🔗 asking knowledge
02:42 🔗 asking do you feel this is an important website?
02:43 🔗 Coderjoe hmm
02:43 🔗 Coderjoe it has been slightly modified from how he left it by the person adding the note that it is being left how he left it
02:45 🔗 asking yea that was probably a woman
02:45 🔗 asking you know, it's not like he gave his admin passwords to his nuclear physicist hacker co-workers
02:45 🔗 asking i was wget'ing for an hour and it was over 600mb
02:46 🔗 Coderjoe nice robots.txt... only thing disallowed is the web server statistics
02:47 🔗 asking how an i estimate the website's size?
02:47 🔗 asking in it's entirety
02:48 🔗 asking sum of all files
02:50 🔗 Coderjoe without fetching everything? not really possible
02:50 🔗 asking curl has -I
02:50 🔗 asking curl displays the file size and last modification time only.
02:50 🔗 Coderjoe you would have to spider everything with head requests as a minimum
02:51 🔗 asking doesn't seem to work for index.html
02:51 🔗 Coderjoe i've a mirroring underway
02:51 🔗 asking do you think he has stats in that hidden folder mentioned in robots.txt?
02:51 🔗 Coderjoe yeah, the server seems to not be sending content lengths for anything
02:51 🔗 asking \fuck/
02:52 🔗 Coderjoe yes, there are website statistics (like hit counts and such)
02:52 🔗 asking are there size statistics?
02:53 🔗 Coderjoe oh crap, I think I might want to respect that robots.txt entry.
02:53 🔗 Coderjoe the statistics reports appear to be generated on request
02:53 🔗 asking what does that mean for me?
02:54 🔗 asking btw archiveteam has a really big list of likely-to-die pages, do you think you will manage to save them all?
02:54 🔗 Coderjoe oh yay. no last-modified header eather
02:56 🔗 * Coderjoe tells wget to ignore /repwork, where the realtime report scripts are located
02:58 🔗 asking robots.txt has content-length:36 with accept-ranges:bytes, does that mean 36 bytes?
02:59 🔗 Coderjoe content-length: 36 means 36 bytes
02:59 🔗 Coderjoe whee. deep.html is over 1MB
03:01 🔗 asking what are you doing?
03:01 🔗 Coderjoe watching wget take forever to grab some files
03:01 🔗 Coderjoe wow
03:01 🔗 asking btw another website where the owner died that has a shitload of web history on it is fravia's website @ searchlores.org
03:01 🔗 Coderjoe other-links.html is at 6MB and growing
03:02 🔗 Coderjoe at 80KB/s this is going to take awhile
03:02 🔗 asking better known by his nickname Fravia, was a software reverse engineer and "seeker" known for his web ...
03:02 🔗 asking Coderjoe: what are you mirroring?
03:02 🔗 Coderjoe lachlan.bluehaze.com.au
03:02 🔗 asking any way i can mirror it directly to a vps?
03:03 🔗 Coderjoe run the wget on the vps?
03:03 🔗 asking no run the curl on my site that pushes every file to a vps then deletes it, i can run that in memory
03:03 🔗 Coderjoe the bottleneck isn't on my end
03:03 🔗 asking or run it all in memory .. i have 16 giga
03:04 🔗 asking if i had a place to store all this information i could do it myself
03:04 🔗 asking or together
03:04 🔗 Coderjoe if you are running linux, you can mount a ramfs and wget to that
03:04 🔗 Coderjoe you don't have much free disk space?
03:05 🔗 asking like 50mb
03:05 🔗 asking i have a 16giga ramfs, but then what
03:05 🔗 asking everyone reboots sometimes
03:05 🔗 Coderjoe tar it up?
03:06 🔗 Coderjoe yeesh. only 50mb?
03:06 🔗 asking actually less, let me check
03:06 🔗 asking last thing i did was tar up my repository of code
03:06 🔗 Coderjoe i'm grabbing it to a disk with 400GB free
03:06 🔗 Coderjoe hmm
03:06 🔗 asking set up an ftp server and i'll dump to it?
03:07 🔗 Coderjoe i am getting content lengths and mtims on things likw jpg files
03:07 🔗 asking it can't be that big, his website is pretty old
03:07 🔗 Coderjoe makes me think the html is eing processed
03:18 🔗 Coderjoe mmm
03:19 🔗 Coderjoe 404
03:19 🔗 Coderjoe amusing 404 page
03:19 🔗 Coderjoe http://lachlan.bluehaze.com.au/ccp14admin/security/index.html
03:19 🔗 Coderjoe you know... i probably would be better off using wget-warc
03:21 🔗 asking his email address is invalid
03:23 🔗 Coderjoe Paradoks: is your password "username"?
03:23 🔗 asking are you fetching it in parallel?
03:24 🔗 asking do you ever get harassed by your isp?
03:24 🔗 Coderjoe not for a number of years
03:25 🔗 Coderjoe comcast disconnected me without notice a few years back. (and I mean notice as in "we're disconnecting you", not like they claim they did, which was 2 months prior for a warning)
03:25 🔗 balrog Coderjoe: for overuse?
03:25 🔗 Coderjoe yep
03:25 🔗 balrog verizon doesn't harrass
03:26 🔗 balrog harass *
03:26 🔗 balrog I wish I had FiOS though. it is available here.
03:26 🔗 Coderjoe this was back when they had the unpublished limit, before they started with the overage charge
03:26 🔗 asking i was reading in a thread today, one guy says he's being harassed because he got caught a few times for pirating torrents
03:26 🔗 asking then someone else replies 'tell them to fuck off or else you'll chose another isp'
03:26 🔗 Coderjoe I now have at&t uverse. they were supposedly going to put caps on it, but it doesn't appear to have happened yet
03:26 🔗 asking you think they are likely to sue or obey?
03:27 🔗 Coderjoe (possibly because they have trouble distingushing TV traffic from the user's own internet traffic)
03:27 🔗 Coderjoe that who is likely to sue or obey?
03:27 🔗 asking that isp
03:27 🔗 asking the isp in the guy in my story
03:28 🔗 Coderjoe it is not the ISP's responsibility to sue over copyright infringement. it is the infringed party's.
03:28 🔗 asking oh true
03:29 🔗 Coderjoe hmm. this site might be large... looks like he published a lot of logs (with photos) of his travels
03:30 🔗 asking wget downloaded those first for me, and it was just over 500 from 2002 to 2007
03:30 🔗 asking as i remember they only went to 2008, and the later years only had a few pics in them
03:31 🔗 asking you can also delete some of the gifs, he had copies of jps as gives for thumbnails or quicker page loading
03:31 🔗 asking (they're copies but one of the copies is lower quality)
03:31 🔗 Coderjoe i will not be deleting anything
03:32 🔗 asking i wonder if embedded archives in his pictures
03:35 🔗 m0lson asking, you shoud tell the user on the forum to stay off the piratebay
03:38 🔗 asking what?
03:39 🔗 m0lson "one guy says he's being harassed because he got caught a few times for pirating torrents"
03:41 🔗 asking oh that, i didn't save the link
03:41 🔗 asking he didn't necessarily have to be on the pirate bay
03:41 🔗 m0lson public trackers in genral
03:41 🔗 asking sure
04:26 🔗 asking Coderjoe: did you leave it running this whole time? how much did it download?
04:27 🔗 Coderjoe currently at 250M
04:27 🔗 Coderjoe and in the reports
04:28 🔗 Coderjoe (i told wget to ignore the directory where the realtime report scripts live)
05:44 🔗 SketchCow Uploaded 4 terabytes of Yahoo Video!
05:44 🔗 asking :O
05:44 🔗 SketchCow Now I'm adding more, doing up .tars for it, and so on.
05:45 🔗 SketchCow So hopefully tomorrow, I can set the thing to start uploading them again.
05:53 🔗 SketchCow I've uploaded 41 sets of videos so far.
05:53 🔗 SketchCow I forgot it's only, like, 9.7 million users.
05:53 🔗 asking all in flv?
05:53 🔗 SketchCow http://www.textfiles.com/videoyahoo/USERSCRAPE/USERLISTS/
05:53 🔗 SketchCow I believe so, yes.
05:54 🔗 asking its too bad you guys are so strict on altering the content, you would achieve better compression if you converted flv to another format before compressing as an archive
05:54 🔗 asking there are lossless functions if you want
05:54 🔗 SketchCow Yes, that's really too bad.
05:54 🔗 SketchCow Remember when you see those old books?
05:54 🔗 SketchCow It's too bad they didn't cut out the parts that weren't that politically expedient.
05:54 🔗 SketchCow Or took out the black people
05:55 🔗 asking it's lossless
05:55 🔗 SketchCow Or thought they could rewrite them in shorthand to save space
05:55 🔗 asking no but how about tripping margins
05:55 🔗 SketchCow Is that like tripping balls?
05:55 🔗 asking what?
05:55 🔗 asking you mean lossless compression isn't good enough?
05:56 🔗 SketchCow You are free to download the files we're uploading, compress them any way you want, and make a new set.
05:56 🔗 SketchCow Give me another week or two, still uploading.
05:56 🔗 asking and then throw it away in /dev/null? fantastic!
05:56 🔗 SketchCow That doesn't sound very lossless
05:57 🔗 asking the files are lossless, your perogative is lossy
05:58 🔗 asking perhaps i didn't understand clearly (i'm new here), you want to mirror them as a compressed torrent, or exactly as the user interface was for yahoo.com/videos was
05:59 🔗 SketchCow I want to take these 15-20 terabytes of Yahoo! Video we downloaded over 4-5 months and put them on archive.org.
05:59 🔗 underscor And let people do whatever they want with them
05:59 🔗 SketchCow And I'm something like 10 terabytes in, things are going swimmingly.
06:00 🔗 asking after them download it, or browser all those videos on archive.org
06:00 🔗 asking ...?
06:00 🔗 SketchCow That wasn't english
06:00 🔗 asking after that download it, or browse all those videos on archive.org
06:00 🔗 SketchCow Good question. Not sure.
06:00 🔗 SketchCow I'm sure something will happen with them.
06:01 🔗 asking do you have a small dataset uploaded?
06:01 🔗 SketchCow I am probably going to write something to download the files, run some number crunching, upload the resulting number crunching.
06:01 🔗 SketchCow I have a fucking huge dataset uploaded, that's even better than a small one.
06:01 🔗 asking are the videos divided into folders of users?
06:02 🔗 SketchCow I guess there's a small one in the big one.
06:02 🔗 SketchCow Yes.
06:02 🔗 asking where?
06:02 🔗 asking what's the url
06:02 🔗 SketchCow http://www.archive.org/details/archiveteam-yahoovideo
06:02 🔗 asking i'll get back to you in a few days
06:49 🔗 Coderjoe lachlan mirror still going. grabbing lots of jpgs now.. currently at 748M
07:06 🔗 asking jeez
07:07 🔗 asking is there a way to retrieve usenet posts that had the x-no-distribute header?
07:08 🔗 Coderjoe if you have the article id or server-specific index number, perhaps
07:08 🔗 Coderjoe otherwise, you have to pull full headers (not xover) of all the articles to find the ones that have headers like that
12:01 🔗 Coderjoe now at 1.6G and still going
12:55 🔗 Paradoks Coderjoe: Of course "username" is my password. What's the point of having a password if it doesn't logically fit the username? (I have no idea why my User ID is "password". If I set it, I set it years ago.)
13:52 🔗 bearh Where should i upload my backup of an old website(only 30mb zipped)?
15:10 🔗 inv Bear_: you're bearh?
15:11 🔗 inv msg SketchCow, he'll hook you up with an rsync account
15:11 🔗 inv I can host it temporarily if you can't keep a stable connection up
15:13 🔗 Bear_ kk
15:13 🔗 Bear_ Sorry, I wasn't looking at my irc client.
16:18 🔗 Schbirid anyone know a good tool to download images off a flickr account without being the owner of that account?
16:22 🔗 underscor Schbirid: I have some
16:23 🔗 underscor Linux only though
16:23 🔗 Schbirid splendid
16:23 🔗 Schbirid i remember the term "flickr fuckr" but could not find it anymore
16:23 🔗 underscor Let me go find them
16:24 🔗 underscor Hmm, I think this is the current version
16:24 🔗 underscor Gimme a username to test
16:24 🔗 underscor :)
16:25 🔗 Schbirid random flickr says "minkee"
16:27 🔗 underscor Schbirid: Thanks
16:27 🔗 underscor One second
16:27 🔗 underscor Making a few changes
16:27 🔗 Schbirid you rock
16:27 🔗 Schbirid i'll be back in ~30 minutes
16:27 🔗 underscor Ok
16:28 🔗 underscor If you want to fully extract everything, you need ruby and the json and yaml gems
16:28 🔗 underscor (just fyi)
16:55 🔗 asking Coderjoe: did it finish?
16:56 🔗 asking wow 1.6
16:57 🔗 asking Schbirid: you can't if they're marked private, but if you're going to index flirkr, javascript can do it
16:57 🔗 asking javascript is great because you can use it like the firefox XPS attacks against irc networks
16:58 🔗 asking you get random users to go on your web site and the code runs
16:58 🔗 asking they leave the site open and it scrapes and send back the pics to you
16:58 🔗 asking there's also javascript ddos websites done in the same fashion
16:58 🔗 asking or bandwidth killers, those load up as many pictures as possible from the website
16:59 🔗 asking its really easy to get people to use that software to help you since you don't have to explain them much
17:00 🔗 asking like if u want grandma to help you, you can do it this way
17:20 🔗 underscor or you can not violate flickr's TOS by doing it the legitimate way
17:22 🔗 underscor Schbirid: http://tracker.archive.org/flickr/
17:22 🔗 underscor There's an example of the output too
17:22 🔗 underscor Need ruby, libjson-ruby, and libyaml-ruby
17:23 🔗 underscor (if you're on debian or ubuntu)
17:39 🔗 Schbirid underscor: works partially, it does not seem to like usernames like 54421772@N03 but looks like it is downloading the images alright
17:39 🔗 Schbirid thank you!
17:47 🔗 underscor Make sure you put the username in quotes
17:47 🔗 underscor But other than that, yay
17:49 🔗 Schbirid i did, it still seemed to use the @ as seperator somewhere
17:50 🔗 Schbirid wait, actually this is not those usernames
17:50 🔗 Schbirid got a normal one with the same error. let me pastebin it
18:00 🔗 Schbirid underscor: http://pastebin.com/Hfk4tKBi
18:00 🔗 Schbirid i did "gem install json", not yaml since google told me that stuff is builtin. using ruby 1.9.something (latest on archlinux)
18:03 🔗 underscor oh okay, yeah
18:03 🔗 underscor 1.8 needs it
18:04 🔗 underscor Ah, numeric username was borking some regexps
18:05 🔗 underscor Think I fixed it, let me test
18:06 🔗 asking underscor: that's legitimate
18:06 🔗 asking it's as legitimate as using wget to scrape it all
18:06 🔗 asking except you're using javascript, and you can distribute easily
18:08 🔗 underscor except it's a lot more work to write the backend stuff to control it all
18:08 🔗 underscor Also, you run into XSS issues too
18:09 🔗 underscor Schbirid: flickrgrabr updated, should work now
18:10 🔗 Schbirid thanks!
18:24 🔗 alard That's a funny use of Ruby: as a json to yaml converter. (For people who don't like json, presumably. :)
18:28 🔗 Schbirid underscor: yes, that fixed it :)
18:29 🔗 underscor awesome
18:29 🔗 underscor alard: ;)
19:05 🔗 underscor <underscor> (It's only singleuser, so if someone's already connected it'll drop new ones)
19:05 🔗 underscor <underscor> ? :)
19:05 🔗 underscor <underscor> Someone willing to test it my telnetting to
19:05 🔗 underscor <underscor> Whee, got my parallax propeller running hangman over telnet
19:15 🔗 closure wtf is a parallax propeller?
19:15 🔗 alard Tried it, lost. Nice. (I'm probably not familiar enough with the type of words it uses.)
19:15 🔗 underscor it's an 8 core microchip
19:15 🔗 underscor http://tracker.archive.org/wordbank.txt
19:16 🔗 underscor closure: running at 80mhz with (I think) 32k ram
19:17 🔗 balrog an interesting microcontroller
19:17 🔗 alard underscor: Ha. I recognize only a few.
19:17 🔗 underscor :)
19:18 🔗 alard I did learn the word 'gonkulator', so it's a very educational game.
19:18 🔗 Coderjoe asking: still going. now at 2.7G
19:19 🔗 underscor alard: hahha
19:41 🔗 lemonkey http://laughingsquid.com/fifteen-people-youll-see-at-every-video-gamecomicnerd-convention/
19:47 🔗 lemonkey http://sfist.com/attachments/SFist_AndrewD/OccupySF_Oct15_19_steverhodes.jpg?487
19:47 🔗 lemonkey woops wrong chan
19:48 🔗 balrog lemonkey: got my PM?
19:52 🔗 alard underscor: https://gist.github.com/3e333ef4b583117928ee (Sorry, couldn't help it.)
19:53 🔗 underscor :D
19:53 🔗 underscor That's awesome
20:46 🔗 human39_ I just acquired some old flyers/pictures zines from a basement. They're a bit damp. Any tips on how to dry these guys out?
21:48 🔗 bsmith093 anyone good with archive conversion?
21:50 🔗 bsmith093 specifically cbz cbr to pdf?
22:36 🔗 alard underscor: I was wondering, would you be able (and willing) to set up a listerine-like thing for the downloading of MobileMe? It would be really helpful.
