Time |
Nickname |
Message |
00:51
π
|
SketchCow |
So, unfortunately, it looks like Myspace is now doing a small transition and killing pages |
02:42
π
|
bsmith093 |
you know how some really advanced bulk renamers can add the parent folder(s) to the name of a file? well i need to remove that part organized like this stuff/blah/status/blah - authorname - filename.txt the only matching parts will be the "blah", and it is garanteed to be a part of the filename |
02:56
π
|
instence_ |
uh |
02:56
π
|
instence_ |
whats your before and after? |
02:58
π
|
instence_ |
you say matching parts, are you trying to regex match those? and only modify those files partially? or? |
02:59
π
|
instence_ |
an app for windows I used to rename stuff is called "ReNamer" works great |
03:14
π
|
bsmith093 |
instence_: before= stuff/blah/status/blah - authorname - filename.txt |
03:14
π
|
bsmith093 |
instence_: after= stuff/blah/status/authorname - filename.txt |
03:28
π
|
instence_ |
with ReNamer that would be quite easy, but I think its a windows only app |
03:29
π
|
instence_ |
http://www.den4b.com/?x=downloads&product=renamer |
03:30
π
|
instence_ |
http://www.den4b.com/?x=screenshots&product=renamer |
03:31
π
|
instence_ |
you can stack rules as well |
03:46
π
|
dashcloud |
so, is there still a channel for fileformat wiki efforts, or it just goes here or -bs? |
04:45
π
|
tuankiet |
Hello eberybody! |
04:48
π
|
bsmith093 |
instence_: how would i do that in renamer, it runs fine in wine, so im using that for now |
04:55
π
|
tuankiet |
@alard: are there any projects? |
06:03
π
|
Nemo_bis |
SketchCow: thanks! |
06:15
π
|
SketchCow |
No problem. Sorry there's still a lag with me this year. |
06:16
π
|
SketchCow |
I'd hoped to be more archiveteam responsive, but this DEFCON documentary is kicking my aaaaaassssss |
08:41
π
|
chronomex |
godane: ftp.download.packardbell.com: Downloaded: 2679 files, 28G in 1d 17h 47m 55s (194 KB/s) |
08:42
π
|
chronomex |
now: time nice ionice -c 3 zip -vr ftp.download.packardbell.com.zip ftp.download.packardbell.com |
09:07
π
|
godane |
chronomex: thanks for getting it |
09:07
π
|
godane |
i know that would have take me forever to get |
09:08
π
|
chronomex |
:) |
09:08
π
|
godane |
and to also upload |
09:19
π
|
chronomex |
yeah, might take a while |
09:19
π
|
chronomex |
I downloaded a terabyte of ftp last month :P |
09:20
π
|
Nemo_bis |
chronomex: ah, 200 KB/s, lucky you :) |
09:21
π
|
Nemo_bis |
NATO still at 40 KB/s |
09:21
π
|
Nemo_bis |
42 GiB so far |
09:22
π
|
chronomex |
o_O |
09:22
π
|
chronomex |
ftp.3gpp.org is huge |
09:22
π
|
chronomex |
btw. |
09:23
π
|
chronomex |
350g, iirc |
09:23
π
|
Nemo_bis |
everything has recent timestamps there |
13:54
π
|
hiker1 |
To what extent does heritrix discover JavaScript and CSS? |
14:15
π
|
alard |
tuankiet: Well, it's time to start downloading the Yahoo blogs. |
14:16
π
|
alard |
hiker1: It probably downloads things referenced with <script> or <link rel="stylesheet"> tags, and I think it even has some rules to find images etc. in the actual CSS and JavaScript files. |
14:16
π
|
hiker1 |
How easy is it to set up? |
14:17
π
|
alard |
It isn't that hard, but it's unwieldy. |
14:17
π
|
hiker1 |
I wanted to test it on a single site. |
14:18
π
|
hiker1 |
I suppose it's probably not worth the hassle |
14:18
π
|
ersi |
Neither Heritrix or Wayback is easy to setup |
14:22
π
|
hiker1 |
sigh. |
14:22
π
|
hiker1 |
Maybe someone that knows how could release a VirtualBox image with it already installed and ready to accept a warc file? |
14:24
π
|
hiker1 |
alard: There is a python library called mitmproxy. Might be useful to proxy the HTTPS records: http://mitmproxy.org/ |
14:24
π
|
hiker1 |
Right now I am using a simple rewrite modification to warc-proxy to get them sent. |
14:24
π
|
hiker1 |
very, very rudimentary. |
14:24
π
|
ersi |
I've fiddled a little with it, and plan to maybe continue - but we'll see (RE: wayback, heritrix) |
14:24
π
|
godane |
so i just found a very good copy of the screen savers episode from 2003 |
14:25
π
|
godane |
Kevin Rose uploaded it too :-D |
14:25
π
|
ersi |
OH MY GOD! |
14:29
π
|
godane |
https://www.youtube.com/user/kevinrose |
14:29
π
|
godane |
i found it on his youtube channel |
14:29
π
|
godane |
i may have to email him so i get more episodes of tss |
14:31
π
|
godane |
he has about 50 episodes of the screen savers in mp4 |
14:31
π
|
godane |
:-D |
14:40
π
|
hiker1 |
WARC doesn't replay the actual browser sessions, only the traffic. Some JavaScript scripts I found appear to append a callback handle to the url that is generated at runtime based on a live JS object. WARC can not replay this behavior. |
14:42
π
|
hiker1 |
Technically it does archive all the information that a website outputs, but some of the information is impractical to use or view without extensive modifications to the JavaScript. |
14:44
π
|
hiker1 |
It makes me think of an HTML5 game http://wordsquared.com/. You can download all the traffic, but you will never be able to see the game properly I think. |
14:44
π
|
alard |
hiker1: And? Or are you just thinking aloud? :) |
14:44
π
|
alard |
You could fix individual sites, but there's no general solution, I think. |
14:53
π
|
hiker1 |
thinking aloud :) |
14:53
π
|
hiker1 |
I noticed this while attempting to archive a website just now. |
14:54
π
|
tuankiet |
@alard: Tracker rate limiting is in effect. Retrying after 30 seconds... :(( |
14:57
π
|
alard |
tuankiet: Yes, there was something wrong yesterday. I'm now gathering some files to debug with. (Until I got distracted by wordsquared just now. :) |
14:57
π
|
hiker1 |
hah xD |
15:00
π
|
tuankiet |
@alard: Oh, runnning again. I've just restarted VMs to update the code :)) |
15:11
π
|
alard |
Good. Found the problem: HTTP/1.1 999 Unable to process request at this time -- error 999 |
15:12
π
|
alard |
What's the best way to handle those? Wait and retry? |
15:13
π
|
Nemo_bis |
ah, as it was feared |
15:14
π
|
balrog- |
that means you are being throttled |
15:15
π
|
balrog- |
http://www.murraymoffatt.com/software-problem-0011.html |
15:16
π
|
alard |
It's Nemo_bis, in this case. |
15:17
π
|
Nemo_bis |
alard: I got that error? but I just started |
15:17
π
|
balrog- |
wow MS is killing messenger |
15:17
π
|
Nemo_bis |
I have lots of "Project code is out of date and needs to be upgraded. Retrying after 30 seconds..." |
15:18
π
|
alard |
Yes, I've paused the thing again. |
15:18
π
|
twrist |
Messenger is being integrated into skype, though. |
15:18
π
|
twrist |
So yeah. |
15:18
π
|
alard |
Nemo_bis: In the last few minutes there were 999-warcs from grue, tuankiet, and you. |
15:18
π
|
Nemo_bis |
hm |
15:19
π
|
balrog- |
twrist: yeah but the protocol, etc are going away |
15:19
π
|
twrist |
Ah, right. |
15:19
π
|
ersi |
Super old. |
15:19
π
|
balrog- |
alard: need to detect 999s and throttle |
15:19
π
|
twrist |
So, what's currently being archived? |
15:19
π
|
Nemo_bis |
alard: I've switched the warrior to tinyback |
15:19
π
|
alard |
balrog-: How long to wait? (And does saying you're from Google still work?) |
15:20
π
|
balrog- |
alard: I don't know, I haven't tested Γ’ΒΒΓΒ info online says 2-24 hours, but I don't know |
15:20
π
|
Nemo_bis |
can it be that Yahoo is suspicious because it sees activity from my IP on flickr etc. as logged in user? |
15:20
π
|
Nemo_bis |
it definitely can't be bandwidth in my case |
15:21
π
|
tuankiet |
Bad thing now |
15:22
π
|
alard |
Nemo_bis: Perhaps you're normally less active on Asian blogs. |
15:23
π
|
twrist |
Give me a git URL to clone, guys. |
15:23
π
|
ersi |
At what project are you guys getting HTTP 999's? |
15:24
π
|
twrist |
I'm itching to join in. |
15:24
π
|
ersi |
twrist: http://github.com/archiveteam/ |
15:24
π
|
twrist |
Need to be a bit more precise, I'm using ubuntu server and IRSSI |
15:24
π
|
twrist |
I only just started as well |
15:24
π
|
* |
twrist is GLaDOS, FYI |
15:25
π
|
ersi |
I think they're doing yahooblogs-grab right now |
15:25
π
|
twrist |
ah |
15:25
π
|
twrist |
so https://github.com/archiveteam/yahooblogs-grab.git? |
15:25
π
|
ersi |
yeah.. |
15:25
π
|
alard |
twrist: There's not much sense starting right now, we need to update the script. |
15:26
π
|
alard |
ersi: blog.yahoo.com |
15:26
π
|
twrist |
ah |
15:28
π
|
tuankiet |
Or using Tor so we won't have 999 again. But the speed is super low :)) |
15:29
π
|
twrist |
The URL I typed out isn't working. |
15:29
π
|
twrist |
Anyone else able to paste it in here? |
15:29
π
|
Deewiant |
https://github.com/ArchiveTeam/yahooblog-grab.git |
15:30
π
|
twrist |
ah, no s |
15:37
π
|
twrist |
so the arguments were --downloader=name --concurrent=6? |
15:43
π
|
alard |
Yes. There's a new version that should handle the 999 error better. |
15:54
π
|
goekesmi |
ls |
15:54
π
|
* |
goekesmi sighs. |
15:54
π
|
hiker1 |
xD |
16:04
π
|
chazchaz |
Is ther a channel for yahooblog-grab? |
16:42
π
|
SketchCow |
I suggest #yahooblah |
16:46
π
|
Coderjoe |
O_O yahoo blog is from yahoo korea? |
18:13
π
|
alard |
I think the current version of the script works better. (There are fewer 0MB items, and it's much slower.) |
19:09
π
|
hiker1 |
Is anyone archiving stuff from Tor? |
19:24
π
|
swebb |
I used tor once to auto-change my IP when grabbing some stuff from google, but it was way slow. |
19:25
π
|
hiker1 |
well, yeah. But there are some websites which are tor only. |
19:28
π
|
* |
ats raw-images an extremely dodgy floppy four times using two different Amiga drives, converts using disk-analyser, merges the resulting partial images back together giving a full image, and peers happily at the first bits of email he ever sent :) |
19:28
π
|
balrog- |
what are you using to merge? |
19:30
π
|
ats |
rawadf off aminet, patched to not complain about the number of tracks in the .eadf files disk-analyser produces |
19:30
π
|
ats |
I also had to patch disk-analyser to not write junk into the EADF track header structure... |
19:32
π
|
ats |
then disk-analyser again to turn (raw-track) EADF into (AmigaDOS-track) ADF, adfread to extract the files from the filesystem, and unar to extract the .lzx archives on the floppy |
19:52
π
|
hiker1 |
If anyone is bored of archiving with wget, please try my WarcMiddleware. I'd be glad to assist in setting it up. https://github.com/iramari/WarcMiddleware |
20:34
π
|
Nemo_bis |
alard: how do I know if I'm still collecting mostly useless 999 crap, in case I work on Yahoo? |
21:02
π
|
alard |
Nemo_bis: Hard to say. It shouldn't, it should retry (and print a message). |
21:04
π
|
Nemo_bis |
ok |
21:05
π
|
Nemo_bis |
TinyBack was getting ratelimited anyway |
21:38
π
|
SketchCow |
Nemo_bis: http://archive.org/details/magazine_rack |
21:38
π
|
Nemo_bis |
SketchCow: Pretty!!! |
21:39
π
|
Nemo_bis |
Are you going to make some of those dark? |
21:39
π
|
SketchCow |
Ostensibly |
21:40
π
|
Nemo_bis |
:) |
21:45
π
|
SketchCow |
Like, Wood Magazine will probably disappear. |
21:50
π
|
Nemo_bis |
But... children in Africa will DIE if we don't let them know how to build life-saving wood stuff, in English, on a website! |
21:53
π
|
Nemo_bis |
On eMule and eMule only there's also another 5 GiB archive of another woodworking magazine. Surely the same woodworking geek scanner. |
21:54
π
|
chronomex |
haha |
21:55
π
|
SketchCow |
Which one? |
21:55
π
|
SketchCow |
You have so many here. |
21:56
π
|
SketchCow |
http://archive.org/details/general_magazine |
21:56
π
|
SketchCow |
http://archive.org/details/woodsmith_magazin |
21:57
π
|
SketchCow |
http://archive.org/details/woodsmith_magazine I mean |
21:58
π
|
SketchCow |
How long was this uploading, Nemo_bis? |
21:58
π
|
Nemo_bis |
SketchCow: I don't know, a few days of work for the CSV maybe. |
21:58
π
|
Nemo_bis |
I didn't measure the time for download and upload in itself. |
22:00
π
|
Nemo_bis |
Also a few hours of trackers browsing and other searches. |
22:01
π
|
Nemo_bis |
http://p.defau.lt/?YTRaoQFxExjw8T612Pl_XQ |
22:03
π
|
SketchCow |
In the future, like godane, I can just browse your uploads and see what you haven't had pushed into a collection and make it happen. |
22:03
π
|
SketchCow |
Your activities also get the attention of the devs, who see it come by |
22:08
π
|
* |
Nemo_bis hopes not to get too many curses |
22:08
π
|
Nemo_bis |
I thought sending you a nice list at the end of the job was going to be helpful? |
22:09
π
|
SketchCow |
No. |
22:09
π
|
SketchCow |
Doesn't help and it actually gets caught in the spam filter |
22:10
π
|
SketchCow |
Because someone from italy is mailing me piles of URLs |
22:10
π
|
Nemo_bis |
Oh, even. |
22:10
π
|
chronomex |
:P |
22:12
π
|
SketchCow |
Also, the vorugsveta collection didn't make it through the fun |
22:12
π
|
SketchCow |
I'm going to make it a collection for you, but it needs more love |
22:12
π
|
Nemo_bis |
Yes, I noticed. |
22:13
π
|
Nemo_bis |
I didn't look those zips carefully enough, sorry. |
22:13
π
|
SketchCow |
Yeah, those things are buuuuuuuunk |
22:13
π
|
SketchCow |
How about I dark them all with a note to delete them? |
22:13
π
|
Nemo_bis |
Suggestions on how to get something useful out of a FictionBook? |
22:13
π
|
Nemo_bis |
I'm ok with it. |
22:14
π
|
SketchCow |
No, wait, this thing is valid. |
22:14
π
|
SketchCow |
Just not playing with our system |
22:14
π
|
SketchCow |
FICTIONBOOOOOOOOOK |
22:14
π
|
SketchCow |
Thanks, Russia |
22:14
π
|
Nemo_bis |
heh |
22:14
π
|
Nemo_bis |
It's not even well seeded, by the way. |
22:28
π
|
hiker1 |
Nemo_bis: What did you mean when you said make some of those dark? |
22:28
π
|
SketchCow |
http://archive.org/details/vokrugsveta |
22:28
π
|
SketchCow |
we'll see when the gods arise on that one |
22:29
π
|
mistym |
Nemo_bis: Wikipedia suggests Calibre can convert FictionBook to smth more conventional. |
22:30
π
|
SketchCow |
https://twitter.com/jefferson_bail/status/289096186420400128 |
22:49
π
|
Nemo_bis |
SketchCow: thanks for fixing it. I liked that tweet too, wondered what syllabus exactly. |
22:50
π
|
SketchCow |
I'm sure it's related to computer programming, and realizing what was done |
22:50
π
|
SketchCow |
I asked him to send it along. |
22:50
π
|
Nemo_bis |
Nice |
22:52
π
|
SketchCow |
By the way, the guy who wrote the wikipedia entry also wrote a scathing e-mail to archive.org about how we were the pit of evil |
22:52
π
|
SketchCow |
Good thing I helped bring in so much fundraising last year |
22:53
π
|
SketchCow |
Also: Ares Magazine is as sexy as sexy gets |
22:58
π
|
SketchCow |
http://archive.org/details/ares_magazine |
23:03
π
|
Nemo_bis |
Should still be usable, shouldn't it? With some printing perhaps. |
23:05
π
|
godane |
stupid question |
23:05
π
|
godane |
i don't know how to submit a comment on youtube |
23:07
π
|
SketchCow |
Goood |
23:08
π
|
godane |
why is that? |
23:08
π
|
godane |
trying to help kevin rose upload the 50 episodes of the screen savers he has |
23:10
π
|
godane |
this is the episode in question: https://www.youtube.com/watch?v=ZglwVT5NIJw |
23:10
π
|
godane |
its a episode from july 14 2003 |
23:11
π
|
godane |
there next to no caps for episodes in 2003 |
23:26
π
|
SketchCow |
Example of "I'm just gonna dark it" |
23:26
π
|
SketchCow |
http://www.woodworkersjournal.com/Main/Store/5_Disc_Annual_Collection_CD_Bundle_20052009_257.aspx |
23:30
π
|
dashcloud |
here's something interested I came across today: http://www.emsps.com/oldtools/ They buy and sell old-very old software |
23:31
π
|
Nemo_bis |
SketchCow: some computer magazines like Pc Open here use the PDFs of their past issues as fillers for DVDs when they don't find enough stuff, it seems. |
23:32
π
|
Nemo_bis |
Something like 10 % of their CD/DVDs contains either some or all past issues in PDF... |
23:32
π
|
dashcloud |
Linux Journal definitely does that |
23:38
π
|
chronomex |
nice |
23:46
π
|
SketchCow |
So, I don't mind being the guy making these collections, BUT |
23:47
π
|
SketchCow |
I'd really appreciate it if you do-gooder motherfuckers would walk the collection and find doubles and cases where we have something really shitty when there's known better versions. |
23:52
π
|
Nemo_bis |
SketchCow: are there more duplicates than those I told you? |
23:52
π
|
Nemo_bis |
(Question is pointless if email really went to spam.) |
23:55
π
|
SketchCow |
It did go to spam. |
23:58
π
|
Nemo_bis |
http://p.defau.lt/?2fxIiFNmvwaO2FBSJdn7fA |
23:58
π
|
Nemo_bis |
<https://archive.org/search.php?query=%22Toronto%20PET%20User%27s%20Group%22> (duplicate of <https://archive.org/details/tpug-newsletter I'm afraid) |
23:58
π
|
Nemo_bis |
and YourComputer which you had already spotted (and deleted, unless it was someone else) |
23:59
π
|
Nemo_bis |
I didn't find more in public items. |