#archiveteam 2013-01-09,Wed

↑back Search

Time Nickname Message
00:51 πŸ”— SketchCow So, unfortunately, it looks like Myspace is now doing a small transition and killing pages
02:42 πŸ”— bsmith093 you know how some really advanced bulk renamers can add the parent folder(s) to the name of a file? well i need to remove that part organized like this stuff/blah/status/blah - authorname - filename.txt the only matching parts will be the "blah", and it is garanteed to be a part of the filename
02:56 πŸ”— instence_ uh
02:56 πŸ”— instence_ whats your before and after?
02:58 πŸ”— instence_ you say matching parts, are you trying to regex match those? and only modify those files partially? or?
02:59 πŸ”— instence_ an app for windows I used to rename stuff is called "ReNamer" works great
03:14 πŸ”— bsmith093 instence_: before= stuff/blah/status/blah - authorname - filename.txt
03:14 πŸ”— bsmith093 instence_: after= stuff/blah/status/authorname - filename.txt
03:28 πŸ”— instence_ with ReNamer that would be quite easy, but I think its a windows only app
03:29 πŸ”— instence_ http://www.den4b.com/?x=downloads&product=renamer
03:30 πŸ”— instence_ http://www.den4b.com/?x=screenshots&product=renamer
03:31 πŸ”— instence_ you can stack rules as well
03:46 πŸ”— dashcloud so, is there still a channel for fileformat wiki efforts, or it just goes here or -bs?
04:45 πŸ”— tuankiet Hello eberybody!
04:48 πŸ”— bsmith093 instence_: how would i do that in renamer, it runs fine in wine, so im using that for now
04:55 πŸ”— tuankiet @alard: are there any projects?
06:03 πŸ”— Nemo_bis SketchCow: thanks!
06:15 πŸ”— SketchCow No problem. Sorry there's still a lag with me this year.
06:16 πŸ”— SketchCow I'd hoped to be more archiveteam responsive, but this DEFCON documentary is kicking my aaaaaassssss
08:41 πŸ”— chronomex godane: ftp.download.packardbell.com: Downloaded: 2679 files, 28G in 1d 17h 47m 55s (194 KB/s)
08:42 πŸ”— chronomex now: time nice ionice -c 3 zip -vr ftp.download.packardbell.com.zip ftp.download.packardbell.com
09:07 πŸ”— godane chronomex: thanks for getting it
09:07 πŸ”— godane i know that would have take me forever to get
09:08 πŸ”— chronomex :)
09:08 πŸ”— godane and to also upload
09:19 πŸ”— chronomex yeah, might take a while
09:19 πŸ”— chronomex I downloaded a terabyte of ftp last month :P
09:20 πŸ”— Nemo_bis chronomex: ah, 200 KB/s, lucky you :)
09:21 πŸ”— Nemo_bis NATO still at 40 KB/s
09:21 πŸ”— Nemo_bis 42 GiB so far
09:22 πŸ”— chronomex o_O
09:22 πŸ”— chronomex ftp.3gpp.org is huge
09:22 πŸ”— chronomex btw.
09:23 πŸ”— chronomex 350g, iirc
09:23 πŸ”— Nemo_bis everything has recent timestamps there
13:54 πŸ”— hiker1 To what extent does heritrix discover JavaScript and CSS?
14:15 πŸ”— alard tuankiet: Well, it's time to start downloading the Yahoo blogs.
14:16 πŸ”— alard hiker1: It probably downloads things referenced with <script> or <link rel="stylesheet"> tags, and I think it even has some rules to find images etc. in the actual CSS and JavaScript files.
14:16 πŸ”— hiker1 How easy is it to set up?
14:17 πŸ”— alard It isn't that hard, but it's unwieldy.
14:17 πŸ”— hiker1 I wanted to test it on a single site.
14:18 πŸ”— hiker1 I suppose it's probably not worth the hassle
14:18 πŸ”— ersi Neither Heritrix or Wayback is easy to setup
14:22 πŸ”— hiker1 sigh.
14:22 πŸ”— hiker1 Maybe someone that knows how could release a VirtualBox image with it already installed and ready to accept a warc file?
14:24 πŸ”— hiker1 alard: There is a python library called mitmproxy. Might be useful to proxy the HTTPS records: http://mitmproxy.org/
14:24 πŸ”— hiker1 Right now I am using a simple rewrite modification to warc-proxy to get them sent.
14:24 πŸ”— hiker1 very, very rudimentary.
14:24 πŸ”— ersi I've fiddled a little with it, and plan to maybe continue - but we'll see (RE: wayback, heritrix)
14:24 πŸ”— godane so i just found a very good copy of the screen savers episode from 2003
14:25 πŸ”— godane Kevin Rose uploaded it too :-D
14:25 πŸ”— ersi OH MY GOD!
14:29 πŸ”— godane https://www.youtube.com/user/kevinrose
14:29 πŸ”— godane i found it on his youtube channel
14:29 πŸ”— godane i may have to email him so i get more episodes of tss
14:31 πŸ”— godane he has about 50 episodes of the screen savers in mp4
14:31 πŸ”— godane :-D
14:40 πŸ”— hiker1 WARC doesn't replay the actual browser sessions, only the traffic. Some JavaScript scripts I found appear to append a callback handle to the url that is generated at runtime based on a live JS object. WARC can not replay this behavior.
14:42 πŸ”— hiker1 Technically it does archive all the information that a website outputs, but some of the information is impractical to use or view without extensive modifications to the JavaScript.
14:44 πŸ”— hiker1 It makes me think of an HTML5 game http://wordsquared.com/. You can download all the traffic, but you will never be able to see the game properly I think.
14:44 πŸ”— alard hiker1: And? Or are you just thinking aloud? :)
14:44 πŸ”— alard You could fix individual sites, but there's no general solution, I think.
14:53 πŸ”— hiker1 thinking aloud :)
14:53 πŸ”— hiker1 I noticed this while attempting to archive a website just now.
14:54 πŸ”— tuankiet @alard: Tracker rate limiting is in effect. Retrying after 30 seconds... :((
14:57 πŸ”— alard tuankiet: Yes, there was something wrong yesterday. I'm now gathering some files to debug with. (Until I got distracted by wordsquared just now. :)
14:57 πŸ”— hiker1 hah xD
15:00 πŸ”— tuankiet @alard: Oh, runnning again. I've just restarted VMs to update the code :))
15:11 πŸ”— alard Good. Found the problem: HTTP/1.1 999 Unable to process request at this time -- error 999
15:12 πŸ”— alard What's the best way to handle those? Wait and retry?
15:13 πŸ”— Nemo_bis ah, as it was feared
15:14 πŸ”— balrog- that means you are being throttled
15:15 πŸ”— balrog- http://www.murraymoffatt.com/software-problem-0011.html
15:16 πŸ”— alard It's Nemo_bis, in this case.
15:17 πŸ”— Nemo_bis alard: I got that error? but I just started
15:17 πŸ”— balrog- wow MS is killing messenger
15:17 πŸ”— Nemo_bis I have lots of "Project code is out of date and needs to be upgraded. Retrying after 30 seconds..."
15:18 πŸ”— alard Yes, I've paused the thing again.
15:18 πŸ”— twrist Messenger is being integrated into skype, though.
15:18 πŸ”— twrist So yeah.
15:18 πŸ”— alard Nemo_bis: In the last few minutes there were 999-warcs from grue, tuankiet, and you.
15:18 πŸ”— Nemo_bis hm
15:19 πŸ”— balrog- twrist: yeah but the protocol, etc are going away
15:19 πŸ”— twrist Ah, right.
15:19 πŸ”— ersi Super old.
15:19 πŸ”— balrog- alard: need to detect 999s and throttle
15:19 πŸ”— twrist So, what's currently being archived?
15:19 πŸ”— Nemo_bis alard: I've switched the warrior to tinyback
15:19 πŸ”— alard balrog-: How long to wait? (And does saying you're from Google still work?)
15:20 πŸ”— balrog- alard: I don't know, I haven't tested Ҁ” info online says 2-24 hours, but I don't know
15:20 πŸ”— Nemo_bis can it be that Yahoo is suspicious because it sees activity from my IP on flickr etc. as logged in user?
15:20 πŸ”— Nemo_bis it definitely can't be bandwidth in my case
15:21 πŸ”— tuankiet Bad thing now
15:22 πŸ”— alard Nemo_bis: Perhaps you're normally less active on Asian blogs.
15:23 πŸ”— twrist Give me a git URL to clone, guys.
15:23 πŸ”— ersi At what project are you guys getting HTTP 999's?
15:24 πŸ”— twrist I'm itching to join in.
15:24 πŸ”— ersi twrist: http://github.com/archiveteam/
15:24 πŸ”— twrist Need to be a bit more precise, I'm using ubuntu server and IRSSI
15:24 πŸ”— twrist I only just started as well
15:24 πŸ”— * twrist is GLaDOS, FYI
15:25 πŸ”— ersi I think they're doing yahooblogs-grab right now
15:25 πŸ”— twrist ah
15:25 πŸ”— twrist so https://github.com/archiveteam/yahooblogs-grab.git?
15:25 πŸ”— ersi yeah..
15:25 πŸ”— alard twrist: There's not much sense starting right now, we need to update the script.
15:26 πŸ”— alard ersi: blog.yahoo.com
15:26 πŸ”— twrist ah
15:28 πŸ”— tuankiet Or using Tor so we won't have 999 again. But the speed is super low :))
15:29 πŸ”— twrist The URL I typed out isn't working.
15:29 πŸ”— twrist Anyone else able to paste it in here?
15:29 πŸ”— Deewiant https://github.com/ArchiveTeam/yahooblog-grab.git
15:30 πŸ”— twrist ah, no s
15:37 πŸ”— twrist so the arguments were --downloader=name --concurrent=6?
15:43 πŸ”— alard Yes. There's a new version that should handle the 999 error better.
15:54 πŸ”— goekesmi ls
15:54 πŸ”— * goekesmi sighs.
15:54 πŸ”— hiker1 xD
16:04 πŸ”— chazchaz Is ther a channel for yahooblog-grab?
16:42 πŸ”— SketchCow I suggest #yahooblah
16:46 πŸ”— Coderjoe O_O yahoo blog is from yahoo korea?
18:13 πŸ”— alard I think the current version of the script works better. (There are fewer 0MB items, and it's much slower.)
19:09 πŸ”— hiker1 Is anyone archiving stuff from Tor?
19:24 πŸ”— swebb I used tor once to auto-change my IP when grabbing some stuff from google, but it was way slow.
19:25 πŸ”— hiker1 well, yeah. But there are some websites which are tor only.
19:28 πŸ”— * ats raw-images an extremely dodgy floppy four times using two different Amiga drives, converts using disk-analyser, merges the resulting partial images back together giving a full image, and peers happily at the first bits of email he ever sent :)
19:28 πŸ”— balrog- what are you using to merge?
19:30 πŸ”— ats rawadf off aminet, patched to not complain about the number of tracks in the .eadf files disk-analyser produces
19:30 πŸ”— ats I also had to patch disk-analyser to not write junk into the EADF track header structure...
19:32 πŸ”— ats then disk-analyser again to turn (raw-track) EADF into (AmigaDOS-track) ADF, adfread to extract the files from the filesystem, and unar to extract the .lzx archives on the floppy
19:52 πŸ”— hiker1 If anyone is bored of archiving with wget, please try my WarcMiddleware. I'd be glad to assist in setting it up. https://github.com/iramari/WarcMiddleware
20:34 πŸ”— Nemo_bis alard: how do I know if I'm still collecting mostly useless 999 crap, in case I work on Yahoo?
21:02 πŸ”— alard Nemo_bis: Hard to say. It shouldn't, it should retry (and print a message).
21:04 πŸ”— Nemo_bis ok
21:05 πŸ”— Nemo_bis TinyBack was getting ratelimited anyway
21:38 πŸ”— SketchCow Nemo_bis: http://archive.org/details/magazine_rack
21:38 πŸ”— Nemo_bis SketchCow: Pretty!!!
21:39 πŸ”— Nemo_bis Are you going to make some of those dark?
21:39 πŸ”— SketchCow Ostensibly
21:40 πŸ”— Nemo_bis :)
21:45 πŸ”— SketchCow Like, Wood Magazine will probably disappear.
21:50 πŸ”— Nemo_bis But... children in Africa will DIE if we don't let them know how to build life-saving wood stuff, in English, on a website!
21:53 πŸ”— Nemo_bis On eMule and eMule only there's also another 5 GiB archive of another woodworking magazine. Surely the same woodworking geek scanner.
21:54 πŸ”— chronomex haha
21:55 πŸ”— SketchCow Which one?
21:55 πŸ”— SketchCow You have so many here.
21:56 πŸ”— SketchCow http://archive.org/details/general_magazine
21:56 πŸ”— SketchCow http://archive.org/details/woodsmith_magazin
21:57 πŸ”— SketchCow http://archive.org/details/woodsmith_magazine I mean
21:58 πŸ”— SketchCow How long was this uploading, Nemo_bis?
21:58 πŸ”— Nemo_bis SketchCow: I don't know, a few days of work for the CSV maybe.
21:58 πŸ”— Nemo_bis I didn't measure the time for download and upload in itself.
22:00 πŸ”— Nemo_bis Also a few hours of trackers browsing and other searches.
22:01 πŸ”— Nemo_bis http://p.defau.lt/?YTRaoQFxExjw8T612Pl_XQ
22:03 πŸ”— SketchCow In the future, like godane, I can just browse your uploads and see what you haven't had pushed into a collection and make it happen.
22:03 πŸ”— SketchCow Your activities also get the attention of the devs, who see it come by
22:08 πŸ”— * Nemo_bis hopes not to get too many curses
22:08 πŸ”— Nemo_bis I thought sending you a nice list at the end of the job was going to be helpful?
22:09 πŸ”— SketchCow No.
22:09 πŸ”— SketchCow Doesn't help and it actually gets caught in the spam filter
22:10 πŸ”— SketchCow Because someone from italy is mailing me piles of URLs
22:10 πŸ”— Nemo_bis Oh, even.
22:10 πŸ”— chronomex :P
22:12 πŸ”— SketchCow Also, the vorugsveta collection didn't make it through the fun
22:12 πŸ”— SketchCow I'm going to make it a collection for you, but it needs more love
22:12 πŸ”— Nemo_bis Yes, I noticed.
22:13 πŸ”— Nemo_bis I didn't look those zips carefully enough, sorry.
22:13 πŸ”— SketchCow Yeah, those things are buuuuuuuunk
22:13 πŸ”— SketchCow How about I dark them all with a note to delete them?
22:13 πŸ”— Nemo_bis Suggestions on how to get something useful out of a FictionBook?
22:13 πŸ”— Nemo_bis I'm ok with it.
22:14 πŸ”— SketchCow No, wait, this thing is valid.
22:14 πŸ”— SketchCow Just not playing with our system
22:14 πŸ”— SketchCow FICTIONBOOOOOOOOOK
22:14 πŸ”— SketchCow Thanks, Russia
22:14 πŸ”— Nemo_bis heh
22:14 πŸ”— Nemo_bis It's not even well seeded, by the way.
22:28 πŸ”— hiker1 Nemo_bis: What did you mean when you said make some of those dark?
22:28 πŸ”— SketchCow http://archive.org/details/vokrugsveta
22:28 πŸ”— SketchCow we'll see when the gods arise on that one
22:29 πŸ”— mistym Nemo_bis: Wikipedia suggests Calibre can convert FictionBook to smth more conventional.
22:30 πŸ”— SketchCow https://twitter.com/jefferson_bail/status/289096186420400128
22:49 πŸ”— Nemo_bis SketchCow: thanks for fixing it. I liked that tweet too, wondered what syllabus exactly.
22:50 πŸ”— SketchCow I'm sure it's related to computer programming, and realizing what was done
22:50 πŸ”— SketchCow I asked him to send it along.
22:50 πŸ”— Nemo_bis Nice
22:52 πŸ”— SketchCow By the way, the guy who wrote the wikipedia entry also wrote a scathing e-mail to archive.org about how we were the pit of evil
22:52 πŸ”— SketchCow Good thing I helped bring in so much fundraising last year
22:53 πŸ”— SketchCow Also: Ares Magazine is as sexy as sexy gets
22:58 πŸ”— SketchCow http://archive.org/details/ares_magazine
23:03 πŸ”— Nemo_bis Should still be usable, shouldn't it? With some printing perhaps.
23:05 πŸ”— godane stupid question
23:05 πŸ”— godane i don't know how to submit a comment on youtube
23:07 πŸ”— SketchCow Goood
23:08 πŸ”— godane why is that?
23:08 πŸ”— godane trying to help kevin rose upload the 50 episodes of the screen savers he has
23:10 πŸ”— godane this is the episode in question: https://www.youtube.com/watch?v=ZglwVT5NIJw
23:10 πŸ”— godane its a episode from july 14 2003
23:11 πŸ”— godane there next to no caps for episodes in 2003
23:26 πŸ”— SketchCow Example of "I'm just gonna dark it"
23:26 πŸ”— SketchCow http://www.woodworkersjournal.com/Main/Store/5_Disc_Annual_Collection_CD_Bundle_20052009_257.aspx
23:30 πŸ”— dashcloud here's something interested I came across today: http://www.emsps.com/oldtools/ They buy and sell old-very old software
23:31 πŸ”— Nemo_bis SketchCow: some computer magazines like Pc Open here use the PDFs of their past issues as fillers for DVDs when they don't find enough stuff, it seems.
23:32 πŸ”— Nemo_bis Something like 10 % of their CD/DVDs contains either some or all past issues in PDF...
23:32 πŸ”— dashcloud Linux Journal definitely does that
23:38 πŸ”— chronomex nice
23:46 πŸ”— SketchCow So, I don't mind being the guy making these collections, BUT
23:47 πŸ”— SketchCow I'd really appreciate it if you do-gooder motherfuckers would walk the collection and find doubles and cases where we have something really shitty when there's known better versions.
23:52 πŸ”— Nemo_bis SketchCow: are there more duplicates than those I told you?
23:52 πŸ”— Nemo_bis (Question is pointless if email really went to spam.)
23:55 πŸ”— SketchCow It did go to spam.
23:58 πŸ”— Nemo_bis http://p.defau.lt/?2fxIiFNmvwaO2FBSJdn7fA
23:58 πŸ”— Nemo_bis <https://archive.org/search.php?query=%22Toronto%20PET%20User%27s%20Group%22> (duplicate of <https://archive.org/details/tpug-newsletter I'm afraid)
23:58 πŸ”— Nemo_bis and YourComputer which you had already spotted (and deleted, unless it was someone else)
23:59 πŸ”— Nemo_bis I didn't find more in public items.

irclogger-viewer