[00:52] balrog_: i just mirrored it [01:26] balrog_: uploaded: http://archive.org/details/www.mindsetcomputing.com-20130104-mirror [01:27] awesome :) [01:38] heh, http://thewholegoddamnedinternet.com has a cvs keyword substitution at the bottom [01:38] that's dedication [01:58] uploaded: http://archive.org/details/www.raygunrevival.com-20130103-mirror [02:15] you guys may want this: https://www.youtube.com/user/TorontoPETUsersGroup [02:16] world of commodore video talks [02:20] wow there are still talks about commodore. [02:43] godane: What options do you typicall use on wget? [03:01] this is what i used for rayguyrevival.com grab: wget "http://$website/Published/" --mirror --warc-file=$website-$(date +%Y%m%d) --reject-regex='(replytocom)' --warc-cdx -E -o wget.log [03:02] website="www.raygunrevival.com" [03:02] thanks [03:10] wow that site was big [03:16] i had to get the Published folder first cause no of the .pdf links go there [03:16] *no=none [03:17] anyways i'm uploading my arstechnica image dumps [03:28] http://forums.g4tv.com/showthread.php?187820-Save-G4TV-com-%28the-website%29! [03:28] i'm doing my best the save the fourms but you guys need to get the videos [03:29] not every video has a .mp4 file in it [04:03] someone do the hard work for making a script to save the video :P [04:06] i'm trying to figure it out now [04:06] best i can get is the url folder [04:07] like for http://www.g4tv.com/videos/37668/bible-adventures-gameplay/ [04:07] the video file will be in http://vids.g4tv.com/videoDB/037/668/video37668/ [04:08] but nothing in the html gives the name [04:08] the filename i mean [04:10] http://vids.g4tv.com/videoDB/037/668/video37668/tr_bible_adventures_FS_flv.flv. [04:10] how did you get the filename? [04:11] i normally have to download it using a plugin in firefox [04:12] ^^ you can prob use chromes inspector too. i should have thought of that..but i used a program called streamtransport that pulls the flash http://www.streamtransport.com/ [04:12] question is either can we guess the filename at all or predict it based on their title. [04:13] if you do chrome > inspect element and then network tab you can see if it if you catch it as it loads [04:18] http://www.g4tv.com/xml/BroadbandPlayerService.asmx/GetEmbeddedVideo?videoKey=37668&strPlaylist=&playLargeVideo=true&excludedVideoKeys=&playlistType=normal&maxPlaylistSize=0 [04:19] just key in the id # into videoKey, and its in FilePath http://www.g4tv.com/VideoRequest.ashx?v=37668&r=http%3a%2f%2fvids.g4tv.com%2fvideoDB%2f037%2f668%2fvideo37668%2ftr_bible_adventures_FS_flv.flv [04:19] fucking a [04:20] i imagine i can run through a list of ids and get all the xml [04:21] ok [04:21] and then grep out the filepaths and then wget that list of videos [04:21] yes [04:22] but i can also imagine that we can create a script that will download file and then up it to archive.org [04:22] cause we have desc and date of post [04:23] also title [04:23] the url for archiving will best to be something like video37668_$filename or something [04:24] also grab the thumbnail image too and uploaded with it [04:24] there are tons of early techtv stuff on there [04:25] some full episodes in 56k [04:27] gamespot tv episode in 56k: http://www.g4tv.com/videos/24430/thief-2-pc-review/ [04:28] yeah trying to figure out why wget is not keep the rest of the url in my range [04:28] http://www.g4tv.com/xml/BroadbandPlayerService.asmx/GetEmbeddedVideo?videoKey={0..5}&strPlaylist=&playLargeVideo=true&excludedVideoKeys=&playlistType=normal&maxPlaylistSize=0 [04:28] cutsoff after the range in braces [04:29] i guess maybe thats not wget but bash related [04:30] well w/e i can just generate a list of urls [04:39] i'm making a warc of the xml output [04:39] of everything? [04:39] *every id [04:39] i think so [04:39] doing what [04:40] the latest id is 61790 [04:40] thats on the front page [04:43] what did you do to generate the list [04:44] i planning on zcat *.warc.gz | grep 'vids.g4tv.com' or somthing [04:44] i mean the list of urls to leech the xmls [04:45] are you going to download all the videos too or dont have the bw? [04:46] i don't think i can download all the videos [04:46] S[h]O[r]T: that url has an ampersand in it [04:47] put the url in double quotes [04:47] like ""http://"" or do you just mean "http" [04:47] i cant do "" since the range doesnt execute, it takes it literally [04:48] the alternative is to escape the ampersands with backslashes [04:48] \&strPlaylist= [04:48] actually, I dunno if that will work; lemme test [04:48] yeah, ill try that. ive got a php script going generating 0-99999 but taking awhile :P [04:49] yes, it will [04:49] err, ok [04:49] why not just use a shell loop? [04:49] <---half retard [04:49] if it works or works eventually good enough for me :P [04:50] ok :) [04:50] i can seed out the list now :-D [04:50] or if godane would type faster :P [04:51] for future reference: [04:51] ill take whatever list you have. ive got the bw/space to leech all the videos [04:51] for (( i=1; i<=99999; i++ )) [04:51] do [04:52] wget ... "http://example.com/$i" [04:52] done [04:52] i did for i in $(seq 1 61790); do [04:53] echo "http://example.com/$i" >> index.txt [04:53] done [04:53] your probably right thats the last id but you never know whats published and isnt [04:53] wget -x -i index.txt -o wget.log [04:57] some of the files return videos are no longer active vs doesnt exist [06:07] http://retractionwatch.wordpress.com/2013/01/03/owner-of-science-fraud-site-suspended-for-legal-threats-identifies-himself-talks-about-next-steps/ [06:08] might be too late for the original site, but the planned site is something we should make sure to save [06:59] db48x: shot him an email [06:59] who knows, maybe I can help out hosting the original stuff [07:26] No idea if this has been linked in here before [07:26] but it's a fun toy [07:26] http://monitor.us.archive.org/weathermap/weathermap.html [08:32] spiffy [08:42] grr doesn't load [08:42] ah [08:43] oh, so when the traffic with HE is big I know s3 will suck [08:44] [doesn't work with https] [12:12] I have a Yahoo blogs question. [12:12] However, at least some of the blogs also have a friendlier name, see [12:12] The problem is this: each blog is available via the GUID of its owner, e.g. [12:12] http://blog.yahoo.com/_FWXMIGZ6AAZ66QJABH727HC4HM/ [12:12] http://profile.yahoo.com/FWXMIGZ6AAZ66QJABH727HC4HM/ [12:12] which links to this profile here [12:12] http://blog.yahoo.com/lopyeuthuong/ [12:12] It's the same blog. [12:12] There's a link from the friendly-name version to the profile, so it's possible to go from a name to a GUID. I'd like to go from a GUID to a name. Any ideas? [12:12] (Context: tuankiet has been collecting Vietnamese Yahoo blogs to be archived, with a lot of GUIDs. If it's possible I'd also like to archive the blogs via their friendly URLs.) [12:21] huh. [12:23] alard: Using warc-proxy, one of the records I have in a warc file it sits trying to serve it to the browser but never sends it. If I remove the record it works. Could you take a look at it? [12:24] hiker1: Yes, where is it? [12:25] Give me one minute to upload it + a version that has the record removed. [12:30] alard: http://www.fileden.com/files/2012/10/26/3360748/Broken%20Warc.zip [12:32] alard: A different bug I found is that if the root record does not have a '/' at the end, warc-proxy will not serve it. [12:33] And it becomes impossible to access that record, at least through chrome. Chrome always seems to request the site with the / at the end [12:34] You mean something like: WARC-Target-URI: http://example.com [12:34] yes. [12:35] It's very tempting to say that that's a bug in the tool that made the warc. [12:36] My program does not require it, so perhaps it is my fault. But I could see other tools not requiring it, so perhaps the behavior should be changed in both places. [12:37] Apparently the / is not required: https://tools.ietf.org/html/rfc2616#section-3.2.2 [12:39] (Should we do this somewhere else?) [12:40] Could it be that the record you removed is the only record without a Content-Length header? [12:40] the WARC record has a content-length header [12:40] the HTTP header does not [12:44] Do you have a modified version that can handle uncompressed warcs? [12:44] I've had the same 'problem' with a Target-URI missing / [12:44] warc-proxy handles uncompressed records just fine for what I've tried [12:44] alard: I'm not sure I understand you [12:45] We're talking bout this warc-proxy, right? https://github.com/alard/warc-proxy At least my version of the Firefox extension doesn't show .warc, only .warc.gz. [12:45] It's always shown .warc for me. [12:46] I use Chrome [12:46] I was using it without the firefox extension. Just sat the proxy, then browsed http://warc/ [12:46] Ah, I see. I'm using the firefox extension (which uses a different filename filter). [12:47] alard: By the way, http://warc/ is beautiful :] [12:47] Thanks. (It's the same as in the Firefox extension, except for the native file selection dialog.) [12:48] Does Hanzo warctools even check Content-Length? I don't think so [12:51] Hanzo warctools does adds it to a list of headers [12:56] alard: in warc-proxy, the content-length for the 404 is set to 12. It should be 91. Around line 257. [12:56] What may be happening is this: if there's no Content-Length header in the HTTP response, the server should close the connection when everything is sent. The warc-proxy doesn't close the connection. [12:56] That seems very likely. [13:00] alard: In Chrome's network view, it shows the connection is never closed. [13:01] Yes. I'm currently trying to understand https://tools.ietf.org/html/rfc2616#section-14.10 [13:06] uploaded: http://archive.org/details/CSG_TV-Game_Collection_Complete_VHS_-_1996 [13:52] alard: I fixed it by having warc-proxy calculate the content-length if it isn't already present in a header. [13:54] It introduces a little bit of complexity to the __call__ method because now it uses parse_http_response [13:56] Alternatively, and perhaps a better option, would be for warc-proxy to forcibly close the connection, since this is what the server must do since it does not send a content-length. [14:04] so i got the vidoe file url list [14:04] for the g4 videos [14:16] alard: you still there? [14:18] Yes. I was concentrating. :) I've now changed the warc-proxy so it sends the correct, reconstructed headers. [14:18] I found that there is a very easy fix [14:19] http_server = tornado.httpserver.HTTPServer(my_application, no_keep_alive=True) [14:19] This tells Tornado to close all connections regardless of what version is used [14:19] *http version [14:19] That's not a fix, that's a hack. :) If you disable keep-alive the request shouldn't have a Connection: keep-alive header. [14:20] Tornado recommends it [14:20] alard: i got g4tv video url list [14:20] alard: i need you to make a script for downloading and uploading to archive.org [14:20] hiker1: https://github.com/alard/warc-proxy/commit/9d107976ccd47c244669b5e680d67a5caf6e103c [14:20] `HTTPServer` is a very basic connection handler. Beyond parsing the HTTP request body and headers, the only HTTP semantics implemented in `HTTPServer` is HTTP/1.1 keep-alive connections. We do not, however, implement chunked encoding, so the request callback must provide a ``Content-Length`` header or implement chunked encoding for HTTP/1.1 requests for the server to run correctly for HTTP/1.1 clients. If the request handler is una [14:20] ble to do this, you can provide the ``no_keep_alive`` argument to the `HTTPServer` constructor, which will ensure the connection is closed on every request no matter what HTTP version the client is using. [14:21] hiker1: (Plus the commits before.) [14:22] godane: Great. If I may ask, why don't you write the script yourself? [14:22] Either method should work. I think I prefer the no_keep_alive method because then it sends exactly what was received from the server. It's also simple [14:22] It's a good reason to learn it, and people here are probably more than happy to help you. [14:23] Yes, but why does it have to send the exact response from the server? [14:23] alard: That's true. As long as the exact response is saved, it doesn't matter. [14:23] *saved in the WARC already. [14:24] thank you for fixing it upstream. I think this has been the root of a few problems I've been having [14:28] yep. other hacks I added are no longer needed [14:28] Some of the google ad scripts were causing failures with warc-proxy but are now resolved [14:34] alard: One other fix: line 193 changed to if mime_type and data["type"] and not mime_type.search(data["type"]): [14:34] my www.eff.org warc is still going to live site with warc-proxy [14:36] Some of them don't have a type set, so that line throws an error [14:38] hiker1: Shouldn't that be "type" in data then? [14:38] it still looks like https will always go to live web [14:39] godane: Ah yes, that was the other bug. [14:43] alard: sure [14:43] er... [14:43] data["type"] is sometimes None. [14:43] I think "type" in data would return True [14:43] So it should be: "type" in data and data["type"] [14:44] yes [14:45] Actually, there's always a "type" field (line 113), so data["type"] will do. [14:47] alard: so now typing urls for archives doesn't work with warc-proxy [14:47] i using localhost port 8000 for my proxy [14:47] in less thats the problem [14:52] godane: What do you mean with typing urls for archives? [14:54] hiker1: I think the mime-type is None thing works now. [14:54] when i run it proxy i use to type urls in that i know the archive will work with [14:58] Actually, it seems completely broken at the moment. [14:58] (If you remove the cached .idx files, that is.) [15:04] Morning [15:08] Afternoon. [15:08] hey SketchCow [15:09] all my image dumps of arstechinca are uploaded now [15:10] hiker1: Any idea where these "trailing data in http response" messages come from? I only get them with your warcs. [15:16] http://www.apachefoorumi.net/index.php?topic=68497.0 [15:25] Thank you, godane [15:25] your welcome [15:26] now we good dumps of arstechnica and engadget [15:26] also a few of torrentfreak [15:28] alard: I think my content-length is too short [15:28] WARC content length [15:29] hiker1: There are a lot of \r\n at the end of the response record. [15:29] alard: What do you mean? [15:29] Are you saying I have too many? [15:30] Perhaps. I thought there should be two, so \r\n\r\n, but you have four, \r\n\r\n\r\n\r\n. You also have four bytes too many, so that matches. [15:31] Does the gzipped data always end with \r\n\r\n? [15:31] I don't know. [15:32] I fixed the extra \r\n in my program [15:32] no more trailing data notices [15:33] The warc standard asks for \r\n\r\n at the end of the record. [15:34] Is that included in the block, or after the record? [15:34] as in, should the WARC length include it or not include it [15:34] No, that's not included in the block. [15:35] hmm [15:35] I think the Content-Length for the warc should not include the two \r\n. [15:36] Since you have (or had?) \r\n\r\n + \r\n\r\n in your file, the first \r\n\r\n is from the HTTP response. [15:36] So is having four of them correct? [15:37] I think you only need two \r\n, since the gzipped HTTP body just ends with gzipped data. [15:38] Yes, I think it should only be two. [15:39] alard: Should I decompress the gzipped content? [15:39] sorry. Should I decompress the gzipped content body of the HTTP response? [15:39] or should I have set the crawler to not accept gzip in the first place? [15:42] If you decompress, you should also remove the Content-Encoding header. But I wouldn't do that (you're supposed to keep the original server response). [15:42] Not sending the Accept-Encoding header might be better. [15:42] That spends a lot of bandwidth. I think I will leave it compressed. [15:45] This is from the Heritrix release notes: "FetchHTTP also now includes the parameter 'acceptCompression' which if true will cause Heritrix requests to include an "Accept-Encoding: gzip,deflate" header offering to receive compressed responses. (The default for this parameter is false for now.) [15:45] As always, responses are saved to ARC/WARC files exactly as they are received, and some bulk access/viewing tools may not currently support chunked/compressed responses. (Future updates to the 'Wayback' tool will address this.)" [15:46] So it's probably okay to ask for gzip. [15:48] godane how many videos in your list [15:49] will check [15:49] 37616 [15:49] ok [15:49] i got the same amount :) [15:49] that's a lot of videos. [15:50] also know that the hd videos may not be in this [15:50] theres a like a button to enable hd? [15:50] yes [15:50] link to an hd video if you got one offhand [15:50] alard: The project I'm working on is at https://github.com/iramari/WarcMiddleware in case you are interested in trying it [15:51] not all video is hd [15:51] or hd option [15:51] but i think most hd videos you just change _flv.flv to _flvhd.flv [15:52] any chance someone can look at the yahoo group archiver and see what's wrong with it? [15:52] I'm no good at perl [15:52] No one is truly good at perl. [15:52] http://grabyahoogroup.svn.sourceforge.net/viewvc/grabyahoogroup/trunk/yahoo_group/ [15:53] specifically grabyahoogroup.pl [15:53] Yep. that's definitely Perl. [15:53] there are TONS of good stuff on yahoo groups, and most groups require registration and approval to see it :/ [15:54] (why? any group that doesn't require approval will get spammed) [15:55] godane: With the current warcproxy, entering urls seems to work for me, so I can't find your problem. [15:55] Can you click on the urls in the list? [15:55] Yes, I am able to enter urls as well. [15:56] be sure that the url has a trailing '/', and also include www. if the url requires it [15:56] Also, SSL/https is difficult, it seems. The proxy probably would have to simulate the encryption. [15:56] the trailing / is only need for the root of the website [15:56] Is the / still needed? [15:56] alard: or rewrite it all to http [15:56] alard: did you change it? [15:57] I tried to, yes, but that may have introduced new problems: https://github.com/alard/warc-proxy/commit/33ca7e5e30722e8ee40e6a0ed9d3828b82973171 [15:58] (You'll have to rebuild the .idx files.) [15:58] testing [15:58] alard: I think your code adds it to every url [15:59] Rewriting https to http would work, perhaps, but once you start rewriting you don't really need a proxy. [16:00] nevermind. Yes, that fix works [16:01] its weird [16:01] i guess we can just re-run the list with flvhd after. im trying to see if the xml supports getting the hd videourl then can just have a seperate list and not have to run through 37k urls trying [16:01] i'm looking at my romshepherd.com dump [16:02] i get robots.txt and images just fine [16:02] godane: Is the .warc.gz online somewhere? [16:02] no [16:03] its a very old one [16:03] but you can use my stillflying.net warc.gz [16:03] https://archive.org/details/stillflying.net-20120905-mirror [16:03] alard, how did you find out the GUIDs of the blogs [16:04] Searching with Google/Yahoo. [16:04] For some blogs you find the blog name, for others the GUID. [16:05] do you have another blog by guid you can link me to? [16:08] also opening a url folder takes a very long time now [16:09] S[h]O[r]T: http://tracker.archiveteam.org:8124/ids.txt [16:09] if anyone wants to take a shot at fixing that yahoo group downloader, I would greatly appreciate it [16:10] Just wanted to drop this in here before I go offline due to stupid wireless. [16:11] I'd like to do a project to make a Wiki and translate metadata to and from Internet Archive to it. [16:11] So we can finally get decent metadata. [16:12] Since I expect the actual site to have collaborated metadata 15 minutes after never, a moderated way for it to happen will be a good second chance. [16:13] Anyway, something to think about, I know I have. [16:14] hiker1: So you've probably removed this? https://github.com/iramari/WarcMiddleware/blob/master/warcmiddleware.py#L81 [16:15] I did not. That entire file is sort of deprecated. [16:15] alard: This is the version that is used https://github.com/iramari/WarcMiddleware/blob/master/warcclientfactory.py [16:16] At the bottom of this recent commit was the change: https://github.com/iramari/WarcMiddleware/commit/bb69bbf6c19bbc7df150f4bc671e7406257eb750 [16:16] Oh, okay. I was happy I found something, but that commit seems fine. There's no reason to add these \r\n's. [16:18] What do you think about the project? Using scrapy + WARC? [16:19] I didn't know Scrapy, but adding WARC support to web archivers is always a good idea, I think. [16:21] Why do you have the DownloaderMiddleware if that's not recommended? [16:21] because that was what I built first [16:24] I should probably delete the file. [16:27] Yes, if it creates substandard warcs. [16:28] Although if I look at the Scrapy architecture, DownloaderMiddleware seems to be the way to do this, and the ClientFactory is a hack. [16:28] I'm afraid it does. Scrapy does not seem to send enough information through its DownloaderMiddleware interface to properly create WARC files [16:28] Then modify Scrapy! [16:29] hah [16:29] It's only a two line fix I think [16:29] Scrapy skips saving what it calls the start_line which contains HTTP 1/1 or whatever with the 200 OK [16:29] so it throws that out. [16:30] which is annoying if you are trying to create a WARC file. [16:30] ClientFactory does seem a bit of a hack, but it lets me capture the raw data without any parsing or reconstruction. [16:31] Can't you send a patch to the Scrapy people? [16:31] probably. I never contribute to projects much though. [16:35] You have to start with something. [16:36] telling someone in IRC is so much easier though [16:36] :) [16:36] I just learned how to clone someone else's github repo. [16:38] Heh. [16:38] alard: Isn't it better to be able to grab the raw data w/ the ClientFactory than to reconstruct it from headers? [16:39] godane: Like you said, the warcproxy doesn't like your stillflying.net warc. It's really slow. [16:40] hiker1: Yes. So I'd think you would need to modify Scrapy so that it keeps the raw data. I don't know how Scrapy works, so I don't know where you'd put something like that, but I assume some data gets passed on. [16:41] Some information is passed on, yes. I'm not sure the Scrapy devs would want to expose the raw data. Maybe they'd be okay with it, I don't know. [16:42] You can always submit the patch and see what happens. What would be wrong about keeping the raw data? [16:42] They like exposing the parsed data. [16:46] I'm looking in to how the data could be passed. [16:49] As a property of the Request and Response? https://scrapy.readthedocs.org/en/latest/topics/request-response.html [17:16] alard: The only way to save the raw data would be to basically merge parts of the clientfactory I wrote into the actual one to add a raw_response argument to Request and Response. [17:17] As it stands, the DownloaderMiddleware version saves gzipped responses uncompressed, because Scrapy uncompresses them before handing them off to the middleware. [17:20] hiker1: And if you change the priority of the WarcMiddleware? [17:37] alard: that worked [17:37] good idea [17:37] You'll probably also want to get to the response before the redirect-handler etc. do anything. [17:38] It is changing the order of the HTTP headers [17:39] probably because it's storing them in a list and reconstructing them [18:13] godane: Your warc file is faulty. It was made with one of the early Wget-warc versions, so it contains Transfer-Encoding: chunked responses with 'de-chunked' responses. [18:27] oh [18:28] but i was able to browse it before in warc-proxy [18:35] also i was using wget 1.14 [18:36] so this could mean all warc since august are faulty [18:36] please make warc-proxy work with bad warcs [18:37] cause if this bug is in wget 1.14 then all warc will have problems [18:38] we should probably write a clean-warc tool to fix malformed warcs. [18:40] i'm doing another mirror of stillfly.net [18:40] if it fails this time i blame warc-proxy [18:45] does this patch fix it: https://github.com/alard/warc-proxy/commit/8717f33b642f414de896dcafb2e91a3dc27c38ca [18:55] so that patch is not working [19:38] alard: ok so i'm testing warc-proxy in midori [19:38] its working a lot faster here but [19:39] but using proxy in midori will block everything [19:39] even localhost:8000 [19:40] ok got it working [19:40] can't use localhost:8000 when using proxy [19:43] alard: i get connection reset by peer error with warc-proxy [19:49] midori? [19:55] its a webgtk browser [19:56] part of my problem with midori is no pages load [19:56] i did get view in stillflying.net but nothing in torrentfreak.com [20:25] www.jizzday.com [20:50] godane is no longer here, but here's a correction: the warc file is fine, but the warcproxy removes the chunking.