[00:51] Stupid thing. All three irc connections broke. [02:07] i'm making a mirror of torrentfreak.com [02:08] i may have to learn how to do a warc version next mouth [02:08] don't want to do that now since i have download over 50gbs of dl.tv this mouth [02:08] *month [03:03] ... is archiveteam, and not the thailand flood, responsible for the rise in hard drive prices? :P [03:52] grr [03:53] boo hiss! [03:53] is there a dark part of the IA? for stuff that isnt out of copyright yet, but will be dust before that? [03:54] how long have I been in reconnect limbo? [03:54] bsmith093: yes, it has been talked about many times [03:54] 2012-04-19 16:44:21 [I] Connection closed [03:55] and? [03:55] and what [03:55] so 7 hours. GRARRRR [03:55] is it officaly there, or do they just ignore it and make parts dark? [03:55] officially? [03:56] I've never gone looking around for some kind of proof that they have a collection that is not accessible [03:56] if I was them I'd probably keep my mouth shut [03:57] they can specify public/nonpublic and unindexed/indexed on a per-item basis [03:57] books are great, but what about cassette-only audiobooks, or animation?, i cant imagine they dont have the full disney collection somewhere, but all the other stuf old tv shows, lost pilots, whatever the BBC hasnt destroyed yet, that should all be in there on some level [03:57] no need to move them into a special collection [03:57] they have a large television archive. they have systems capturing 24/7 from a number of tv channels [03:57] basically make the LoC a subset, is what I;m saying [03:58] all of which goes dark until some later date [03:58] which is where all the stuff for the september 11 collection was culled from [03:59] Coderjoe: The story I heard was that the 9/11 collection was accidental [03:59] bsmith093: As for media companies, those are typically handled internally [03:59] theres an austrailian movie, i think it was made for tv, called Go to Hell! that i only found out about through demonoid, from a guy who specifically uploads old and or rare stuff [04:01] Those tend to be hard to keep track of since there's no central entity managing them [04:01] If a studio goes under, so does its archival holdings [04:14] looks like mirroring torrentfreak.com equals to a lot of garage data [04:15] like .html files that don't work at all [04:31] anyway, there are metadata fields such as "noindex", "public", and "hidden", as well as "publicdate" [04:31] public and hidden are marked as deprecated [04:38] that was an awesome split from my POV [04:44] ditto [04:44] who split from your pov? [04:46] chronomex: https://ezcrypt.it/XT4n#FxYzfwfz0E8DYcmjItfmhJnJ [04:47] that's quite the split [05:50] huh [05:50] dna lounge has a live video feed [05:51] bit of a sync offset, though [06:50] how do do you make sure nothing is compressed when mirror with wget? [06:50] mirror torrentfreak cause some files to be compress but most of it is not [07:05] Finished the mirror a few days ago using wget-warc. Did I miss anything? Any post-checks I could do? In short... what's next? :P [07:05] http://archiveteam.org/index.php?title=Cyberpunkreview.com [07:54] how do you guys stop webserver from compressing html? [07:55] i want to make a mirror of torrentfreak but some files i just get the compress data [07:55] Do you mean: get the webserver (apache) to not compress html at all, ever? Or do you mean: advertise to the server that my client does not support compression? [07:55] that's weird, the compression should only be on the protocol, not on the resulting data [07:55] i using wget and some files are compressed [07:56] and considering it's "some" files and not all [07:56] you sure they aren't supposed to be that way [07:56] cause wget should decompress the data before writing it to disk [07:57] it spotly too [07:57] but you could maybe make wget not tell the webserver that it supports gzip compression [07:57] how i dunno but [07:57] and are you sure the compressed files aren't supposed to be compressed? [07:57] some files download in coompress gzip then other times it doesn't [07:58] i'm adding --header="Content-Encoding: deflate" [07:58] wget --header="accept-encoding: " <-- I don't know if that would work or not. But that's my first guess to tell the server you don't support compression. [07:58] gzip uses deflate [07:59] so it's pretty much the same [07:59] Quick test shows that --header="accept-encoding: " does "turn off" compression. [07:59] nice [08:00] it didn't work for me [08:00] trying to mirror torrentfreak.com [08:01] What's the command you're using? I'll try to replicate. [08:01] wget doesn't do compression, and does not ask the server for a compressed transfer encoding [08:01] //torrentfreak.com/ [08:01] n --convert-links -U Mozilla --header="Accept-Encoding:" -e robots=off -nc http: [08:01] wget --mirror -r -p --html-extensio [08:02] wget --mirror -r -p --html-extension --convert-links -U Mozilla --header="Accept-Encoding:" -e robots=off -nc http://torrentfreak.com/ [08:02] Why did you add the Accept-Encoding header? [08:02] aggro said it worked [08:02] the server (or script on the server) is violaing spec if it returns data with a compressed tranfer encoding when not asked for [08:03] I only tested on my blog. [08:03] This command: wget --header="accept-encoding: gzip" http://aspensmonster.com/ [08:03] Versus this one: wget --header="accept-encoding: " http://aspensmonster.com/ [08:04] first one gives me the binary. Second one gives me plain html. [08:04] But still replicatin' on torrentfreak [08:06] is there a way to log mirror wget -o command and still see it in console? [08:06] Not that I know of, but "tail -f the-log-file" has the same effect. [08:07] now its giving me compress data again [08:07] :-( [08:08] you can tee wget's stdout [08:08] yeah. I get compressed data too with your command. Now I'm just watching the headers to see what's going on. Might even get to jump into wireshark :D [08:08] godane: I think the server is ignoring the accept-encoding header and forcing a transfer-encoding on you [08:09] "everyone supports this, right?" [08:09] Well, it's a varnish server running over nginx... [08:09] hmm... [08:10] i think the only zlib support in wget is with the warc output [08:10] got a specific offending url? [08:11] root index.html file [08:11] its causes problems sometimes [08:11] url? [08:11] curl --header "accept-encoding: " http://torrentfreak.com [08:11] still spits out binary. [08:12] Content-Encoding: gzip [08:13] either varnish, nginx, php, or the php script is ignoring the lack of accepting gzip [08:13] Well, if its grabbing from a proxy, the proxy could very well be forcing gzip no matter what. Nginx + varnish is a high probability of that. [08:14] varnish is a proxy, from the look of the headers [08:14] Via: 1.1 varnish [08:14] Varnish is an HTTP accelerator designed for content-heavy dynamic web sites. In contrast to other HTTP accelerators, such as Squid, which began life as a client-side cache, or Apache and nginx, which are primarily origin servers, Varnish was designed from the ground up as an HTTP accelerator. Varnish is focused exclusively on HTTP, unlike other proxy servers that often support FTP, SMTP and other network protocols. [08:15] IOW: varnish is a proxy server [08:15] Wordpress too. Ten bucks says he's running w3-total-cache or super-cache [08:15] http://torrentfreak.com/wp-login.php [08:15] 1.1 [08:16] quite old [08:16] 3.0 was released in 2011 [08:17] Hmmmm... [08:18] Apparently varnish can't handle cookies according to http://stackoverflow.com/questions/8011102/is-there-a-way-to-bypass-varnish-cache-on-client-side. Got something new with this: [08:18] curl --header http://torrentfreak.com/?cache=123 [08:18] the "?cache=123" is just meant to throw off the caching mechanism [08:23] derp. no need for "--header" up there. [08:23] i'm trying to mirror torrenfreak the right way [08:24] so how can i have a local mirror of torrentfreak and disable the gzip problem? [08:24] wget-warc, yes? [08:25] i want to use wget-warc [08:25] but i also want a copy in torrentfreak.com folder [08:25] aye [08:25] it i could be host on my local lan [08:25] Is it unusual for my netbook hard drive to make a small click like in this video every now and again? http://www.youtube.com/watch?v=aEDaPeKcFys [08:28] it feels like a varnish bug that wasn't caught by 1.1 because most normal browsers support gzip encoding [08:29] hmm [08:30] except this mailing list message suggests varnish had no gzip support until jan 2011 [08:32] Well, if it makes you feel any better, this command does seem to chug along: [08:32] wget --mirror -r -p --html-extension --convert-links -U Mozilla --header="accept-encoding: " -e robots=off -nc http://torrentfreak.com/?dummyvar=123 [08:33] I get the initial "index.html?dummyvar=123.html" file for the root, then lots of folders for the SEF URLs that WP generates... and clean index.html files in there... [08:33] Just an option I suppose. [08:33] Also, I'd probably institute some sort of delay between requests. [08:35] Or rather, upon scanning, it's doing the same behaviour you mentioned :P some encoded, some not. Probably has to do with reusing the same connection. [08:35] like i thought [08:36] I think wget might have an option to force a new connection each time, but I'm also looking into post-processing. I.e., just decompressing after the fact. [08:37] there's the option of adding gzip and deflate encoding support to wget [08:38] Renaming the file to index.html.gz, then "gzip -d index.html.gz" gives you the original html, but I don't know how that would square with how wget does the link conversion. [08:38] If anything I'd bet wget would attempt to convert links on the binary gz's first :P [08:38] it needs to be able to read the file to get links [08:38] ^ [08:39] link conversion happens all at the end [08:41] within the last hour or so, i passed the 1500 uploaded videos mark [08:44] and I'm nearing 1000 wikis downloaded with WikiTeam taskforce [08:45] getting close to 1/3 of this stage6 data uploaded [08:46] 23 items needing special attention so far [09:19] of course. [09:20] my irc foo is weak. [09:21] Regardless, I think I've found a trick for torrent freak @godane ; I tried to figure out a way to get wget to append parameters to each request and I think I've found one. Been running for about five minutes now without any gzip encoded html files. [09:21] wget --mirror -r -p --html-extension --convert-links -U Mozilla --header=accept-encoding: -e robots=off -nc --post-data user=blah&password=blah http://torrentfreak.com/ [09:21] The line I'm running to test for the presence of gzip'd files is: [09:21] for i in `find . -name "*.html"`; do file $i ; done | grep gzip [09:22] (from the root directory of the mirror, of course) [09:24] Adding gzip support to wget would be an awesome idea though. I just don't trust myself to try tackling that beast :P [09:37] uh [09:38] that's going to cause a POST request, when the site is likely expecting a GET request [09:42] you'd be better off creating a bogus cookie. varnish, is appears, does not use the cache when there are cookies [09:42] since it has no idea how to interpret them [10:25] How would I go about making a bogus cookie? [10:29] put raisins in [10:29] and what sort of dough is best? [11:04] aggro: what did you find? [11:04] nevermind [11:05] just notice your pm from earlier [12:34] i found the edge magazine [12:35] i mean from 1995 and up [12:49] holy bat crap [12:50] there is a edge magazine torrent with 400+mb pdfs [12:50] what the hell? [12:52] pr0n. [12:57] Ops, please. [12:57] sup jason [12:58] Well, EFnet shit an actual bed into another bed, which it then shit [12:58] ? [12:59] haha [12:59] SketchCow: Just found most of edge magazine on demonoid [13:00] just search 'edge magazine' on demonid [13:00] no need to email the links [13:00] winr4r: no speedydeletion yet, we have a win rar [13:00] many peers? [13:00] seeds i mean .. [13:00] emijrp: i know, i was amazed [13:01] added webcitation links to all the citations [13:01] i thought there was some bot on wikipedia that automatically slapped a speedy tag on new articles man [13:02] ha, i did a bot for that at Spanish Wikipedia, to delete junk and test pages [13:02] using regexps to detect, and other tricky stuff [13:03] oh man, i hated those things [13:04] find the baddass guys and discard the suckers https://en.wikipedia.org/wiki/Category:Web_archiving_initiatives [13:04] who wants to bet that it'll end up in AFD anyway? [13:05] in that case it will end in '''keep''', citations are OK [13:10] ha https://en.wikipedia.org/wiki/AT#Other_uses [13:11] emijrp: haha [13:11] SketchCow, my Q-audio upload progresses. Once it's done I'll run a Perl script I found to find duplicate files, as there are bound to be several. [13:35] Sure. [13:36] There are also, as I suspected, even more blatant copyright violations than I thought, which is going to present a special problem in the case of songs by artists who aren't widely-known enough that a definite determination can be made. [13:38] I don't know if anything like this is up there, but let's say somebody produced their own CD and uploaded it so their Twitter friends could enjoy it. The filenames are likely to look just like other infringing files up ther. [13:40] The strange thing is, Q has stopped complaining about scrapers. Either everyone doesn't want to get their IP blocked and has stopped, he's started blocking people silently, or he's stopped caring about it. [13:40] Regarding violation, I want it all. [13:41] We'll deal with the problems of that later, but the time slow in tracking down provenance is not the way to go. [13:41] And I bet he probably has slowed down a bit while people show interest. [13:41] I just want blind dudes to have their shit [13:42] I'd conduct a little experiment and scrape some to see if I get blocked, escept A. I don't have a scraper capable of grabbing the modern filenaming convention, and B. I don't want my IP blocked, silently or otherwise. [13:42] [14:41:40] <@SketchCow> I just want blind dudes to have their shit [13:42] That. Is. Beautiful. [13:42] <3 you [13:42] <3 your ideals (that I know of) [13:42] <3 your style [13:42] <3 [13:43] I have a spare IP..... on a slowish connection (10Mb/1Mb) [13:43] If you want me to try :D [13:43] No, I am sure Jaybird11 is handling it just fine. [13:43] ok [13:44] * SmileyG goes back to his Cave to worship the church of Sketchcow and his "Son" sockington. [13:44] Don't make this weird [13:44] er [13:46] I came up with an example of a version of SketchCow's civil war letter thing. Take http://q-audio.net/d/89 [13:46] This is my very first audio tweet in my own voice to Q-audio. And it's very stupid, nothing of significance is said, and it was poorly produced to say the least! [13:47] But you can tell several things. First, that I wanted to test Q-audio's record/upload facility and didn't want to bother with doing a proper setup for my microphone. [13:48] Second, near the beginning you can hear my speech synthesizer. That's a DECtalk express. The fact that someone was still using one of those as his primary speech synthesizer on October 10, 2010 might be significant years down the road. [13:48] It goes on and on. [13:50] Amusingly, some time later someone posted a Q-audio post where they were making weird noises with feedback and pretending to contact aliens. Lol! [13:57] :/ sorry. [14:00] Wow it seems from the files going across right now like someone uploaded the same file over and over and over again, based on the name [14:50] i'm on underground-gamer now [14:51] i may get some goods for dark-rack [14:52] i can said the a lot of torrents are like 400gb+ [15:10] i think i found something [15:10] the first pc gamer CD [15:17] I'm grabbing all the CD stuff on UG but you can go after mags if you want [15:18] I got the pc zone one already [15:20] i didn't see pc gamer cd #1 on archive.org [15:21] yeah guess that one is new [15:22] its going at about 20kbytes [15:23] its only 221mb [15:54] i found game manuals for nes [17:04] SketchCow: I found dragaon magazine [17:04] a dungeons and dragons magazine [17:05] godane: email him the torrent link [17:05] its on underground-gamer [17:05] i don't know if he has a account there [17:06] Grab it, then; I'm sure it's not a big set [17:06] its 5.74gb [17:09] there's a whole magazine category on there with hundreds of gb of stuff [17:10] I'm tempted to grab the whole magazine category; it'll take me months, but what the hell [17:14] it might not be too bad if you skip stuff that's already on IA or retromags [17:28] so, if you decide to celerate 4/20 by finding the 43 minute version of Dark Star, you'll find some beautiful words about archival http://archive.org/details/gd73-12-06.sbd.kaplan-fink-hamilton.4452.sbeok.shnf [17:28] "It's so amazing to be able to take a Dark Star or a St. Stephen from 72 and hold it up next to one from 77 or 82, roll them around on your tongue and decide which one really smacks you hardest right in the pleasure lobe. Truly a blessing, this Archive. God bless you." [17:40] i also found gamroom magazine [18:12] does the stage6 videos have any satellaview videos? [18:12] i only ask cause i may found something rare [18:44] satellaview? someone recorded one of the live transmissions with the satellite audio? [18:44] yes [18:45] Coderjoe: ping [18:46] there is id numbers with the torrent [18:47] 1011014, 1011305, 1017211 [19:12] closure, we're making you an admin of the group for stage6 [19:13] Shortly, you'll be able to just upload right into that group, ok? [19:14] can i be an admin in wikiteam collection pl0x? [19:23] http://www.kickstarter.com/projects/280024034/3ton-preservation-society [19:24] shaqfu: Wow, very modest goal. [19:25] Seems like a neat collection of stuff, too [19:25] I wish Kickstarter didn't make it a complete pain to participate if you're not American. [19:25] I suppose because of rewards? [19:26] No, just because their financial backend is very US-centric. It's not possible to accept payments at all if you're not in the US (so, no out-of-the-States projects), and difficult to pay for non-Americans. [19:26] Gotcha [19:26] I was able to use it from canada but they may have tightened it up recently [19:26] I know why they originally did that. [19:27] It was because only Amazon allowed financial holds. [19:27] And Amazon was only US. [19:27] Aha. [19:27] I too have been surprised. [19:28] Seems like they're big enough now to loosen that [19:28] Considering they're blowing up, the utter shutout of international orders and refusing to find ways to deal with that have been pretty f'in glaring. [19:28] Paying through Amazon was quite painless for me, in Europe. [19:28] A year ago, some russians asked, BEGGED for a russian kickstarter. [19:28] They made a film! Kickstarter blogged it! Did nothing. [19:28] SketchCow: Blowing up good, or blowing up bad? [19:28] Blowing up good [19:29] Although they're about to encounter a concentration of fraud and terror they're probably not ready for. [19:29] Yeah, they're in for a scary time once a big project tanks [19:29] Or is a well designed fraud! [19:29] My bet's on Wasteland 2 never actually getting made [19:29] http://www.wired.com/gamelife/2012/04/prince-of-persia-source-code/ in case nobody has seen it yet. [19:29] It's me being a hero with other people being heroes. [19:30] No not-heroes except the sadness and rot of time [19:30] :p [19:30] they still didn't fix the issues though, heh [19:30] Fuck you, time [19:30] obligatory hourglass reference [19:30] balrog__: I mailed the guy, he mailed back an ages-for-you 500 seconds ago saying he was telling his editor. [19:31] godane: see ffnet-grab [19:31] we use warctools' warc2warc to deal with servers that violate HTTP specs [19:31] well [19:31] "violate" [19:31] it's technically not a violation [19:31] yeah I know :) [19:31] yipdw: what do they do then? [19:31] but it is a dick move [19:31] balrog__: github.com/ArchiveTeam/ffnet-grab :P [19:32] we use warc2warc -D to decompress compressed data [19:33] As expected, in the comments on the Ars article about PoP, there was a license war :P [19:33] is the seesaw thing still broken [19:33] for direct s3 uploads [19:34] shaqfu: Amazingly, it hasn't broken out on the Github repo. [19:34] I have not had to use my terrible powers once. [19:34] underscor: pong? [19:34] The Ars thread even had an invocation of RMS! [19:34] why do people continually fight about licensing? :( [19:34] mistym: That's promising; hopefully it stays calm [19:35] SketchCow, can i be an admin in wikiteam collection pl0x? [19:35] herp. Hi internet dudes and dudettes... [19:35] me too, i'm going to upload a thousand wikis (wikiteam items) in a few days â if someone writes a script for that ;) [19:36] (via s3) [19:40] underscor, godane: i do not have any of the three IDs listed before, it appears [19:41] not even metadata. [19:41] sadly, I only have metadata for videos I was personally interested in (or other videos that shared tags with ones I was interested in) [19:41] so my contribution to the collection is slightly slanted in that direction [19:42] oli: Was it broken? [19:43] 1600 items with the prefix stage6- [19:44] I want to pass along the thing with adminning a colleciton. [19:44] So, I would love to make people admins of collections they're contributing to, moving the human factor of me moving them over. [19:44] But to do that, I need to know the e-mail address you have for your uploader account. [19:45] Once I know that, I can have that account added as the admin for the given collection. [19:45] alard: well all my scripts broke [19:45] and someone acknowledged some problem earlier in this chan or the memac one [19:45] the rsync process was just not connecting or something [19:46] oli: Ah, but seesaw-s3 isn't using rsync. The normal seesaw is. [19:46] ok, well whatever it was :p i was using this: DATA_DIR=data-1 ./seesaw-s3-repeat.sh oli BLAH BLAH [19:46] and that one was not working [19:46] IIRC [19:47] it kept hanging anyway [19:47] w/ that curl problem or HTP 100 or whatever [19:48] oli: Let's do this on the #memac channel. [20:05] SketchCow: https://twitter.com/mongoose_q/status/193426274964877312 [20:09] Yeah [20:30] would archive.org be interested in 160GB of UT2004 maps grabbed from some long-gone mirror server? [20:30] I would. [20:30] archive.org is different than me. [20:31] Me, I'm interested, and I use archive.org as a place to store things. [20:31] okay, I guess I will send them to you [20:31] So yes. [20:31] is it a file directory? I can give you an rsync list. [20:31] my upstream is too slow but I can mail you a drive that I do not need back [20:32] it's a dir with 60,000 files [20:36] i'm getting edge magazine [20:37] some one made some 200dpi scans with ads removed [20:37] goes from 02-1995 to end of 2007 [20:39] Ads removed? Unfortunate [20:39] 12 years worth of scans is pretty ++, though. [20:40] do you guys have cd-i magazine? [20:48] OKAY NEXT [20:48] ---------------------------------------------------------- [20:48] SO I HAVE BEEN WORKING WITH THE DISCFERRET PROJECT [20:48] GOING VERY WELL, NEEDS MORE HANDS [20:48] THIS IS YOUR TIME TO STRETCH THOSE DEV MUSCLES [20:48] COME TO #DISCFERRET ON THE FREENODE IRC NETWORK [20:49] C-C-C-C-C-OMBO BREAKER: 10 days for knol [20:49] DON'T BE A DUMBASS, THAT IS ALL [20:49] ---------------------------------------------------------- [20:49] CAPS LOCK [20:49] emijrp: dm me your e-mail on archive.org and what collections you need admin on [20:55] SketchCow: do you have Ahoy Disk Magazine for the Commodore 64/128 [20:56] this torrent may include all the floppies that come with magazine [21:01] at least people who really want something might Google and find me https://ludios.org/archive/uz2.gameservers.net-ut2004-listing.html [21:01] * ivan` makes a third copy [21:02] looks like it died 6 months after I wget'ed, good timing [21:04] why dont we use the IA S3 to upload all the YouTube Creative Commons videos? [21:05] IA has a free S3 now? [21:05] one to one, no shitty 200GB packs [21:05] Before we do that, I need to discuss it with Brewster [21:05] I'm trying to deal with the at-risk stuff as a priority. [21:06] k [21:06] I feel so useless in a low-bandwidth residence [21:06] it's like a gravity well [21:07] ivan`: You can think of smart things that other people can run? [21:07] yeah, I should write some software [21:15] http://uz.ut-files.com/ http://uz.ut-files.com/ http://uz2.ut-files.com/ http://uz3.ut-files.com/ are high-risk given past failures and that these are treated as "caches" [21:15] via http://news.ut-files.com/ [21:17] must be at least 200GB [21:18] e-mail me for a mailing address for the drie [21:18] okay, will do when I get it onto a drive [21:30] SketchCow: are you going to be in SF in may? [21:32] "Must have experience in Tier 4 archiving" - corporate archive ads are always lol [21:34] shaqfu: "must have at least 10 years experience with warc" [21:34] â¦ and node.js [21:35] Haha [21:35] (for those following at home, a "tier 4 archive" is suit code for "shit you page") [21:36] Recently saw a library job ad looking for someone with 8+ experience with Rails. [21:37] Oh! At NYPL, right? [21:37] Yeah! [21:37] My brother called me and asked about that [21:37] mistym: must have experience since THE DAY IT WAS INVENTED [21:37] I guess they only really want dhh to apply. [21:37] Him and his dev friends were having laughs at their expense; I told him that *all* their postings are like that [21:38] Protip: if SketchCow got a MLIS and wouldn't be qualified for your digital archivist posting, rethink your posting [21:38] shaqfu: No kidding. [21:38] Of course, it's also hilarious how unrealistic these requirements are while at the same paying a pittance. [21:41] Maybe designed so they can adjust the pay scale even further downward when they inevitably hire someone who doesn't meet every requirement. [21:41] mistym: Happens when the jobs say "BA required, MLIS preferred" [21:41] shaqfu: Ugh. Yes. [21:41] "Oh, it's a BA-level posting, so we'll pay BA-level rates" [21:42] mistym: Don't get me started on "archivist" positions that are actually secretaries [21:43]