#archiveteam 2013-01-05,Sat

↑back Search

Time	Nickname	Message
00:52 ^🔗	godane	balrog_: i just mirrored it
01:26 ^🔗	godane	balrog_: uploaded: http://archive.org/details/www.mindsetcomputing.com-20130104-mirror
01:27 ^🔗	balrog_	awesome :)
01:38 ^🔗	db48x	heh, http://thewholegoddamnedinternet.com has a cvs keyword substitution at the bottom
01:38 ^🔗	db48x	that's dedication
01:58 ^🔗	godane	uploaded: http://archive.org/details/www.raygunrevival.com-20130103-mirror
02:15 ^🔗	godane	you guys may want this: https://www.youtube.com/user/TorontoPETUsersGroup
02:16 ^🔗	godane	world of commodore video talks
02:20 ^🔗	hiker1	wow there are still talks about commodore.
02:43 ^🔗	hiker1	godane: What options do you typicall use on wget?
03:01 ^🔗	godane	this is what i used for rayguyrevival.com grab: wget "http://$website/Published/" --mirror --warc-file=$website-$(date +%Y%m%d) --reject-regex='(replytocom)' --warc-cdx -E -o wget.log
03:02 ^🔗	godane	website="www.raygunrevival.com"
03:02 ^🔗	hiker1	thanks
03:10 ^🔗	hiker1	wow that site was big
03:16 ^🔗	godane	i had to get the Published folder first cause no of the .pdf links go there
03:16 ^🔗	godane	*no=none
03:17 ^🔗	godane	anyways i'm uploading my arstechnica image dumps
03:28 ^🔗	godane	http://forums.g4tv.com/showthread.php?187820-Save-G4TV-com-%28the-website%29!
03:28 ^🔗	godane	i'm doing my best the save the fourms but you guys need to get the videos
03:29 ^🔗	godane	not every video has a .mp4 file in it
04:03 ^🔗	S[h]O[r]T	someone do the hard work for making a script to save the video :P
04:06 ^🔗	godane	i'm trying to figure it out now
04:06 ^🔗	godane	best i can get is the url folder
04:07 ^🔗	godane	like for http://www.g4tv.com/videos/37668/bible-adventures-gameplay/
04:07 ^🔗	godane	the video file will be in http://vids.g4tv.com/videoDB/037/668/video37668/
04:08 ^🔗	godane	but nothing in the html gives the name
04:08 ^🔗	godane	the filename i mean
04:10 ^🔗	S[h]O[r]T	http://vids.g4tv.com/videoDB/037/668/video37668/tr_bible_adventures_FS_flv.flv.
04:10 ^🔗	godane	how did you get the filename?
04:11 ^🔗	godane	i normally have to download it using a plugin in firefox
04:12 ^🔗	S[h]O[r]T	^^ you can prob use chromes inspector too. i should have thought of that..but i used a program called streamtransport that pulls the flash http://www.streamtransport.com/
04:12 ^🔗	S[h]O[r]T	question is either can we guess the filename at all or predict it based on their title.
04:13 ^🔗	S[h]O[r]T	if you do chrome > inspect element and then network tab you can see if it if you catch it as it loads
04:18 ^🔗	S[h]O[r]T	http://www.g4tv.com/xml/BroadbandPlayerService.asmx/GetEmbeddedVideo?videoKey=37668&strPlaylist=&playLargeVideo=true&excludedVideoKeys=&playlistType=normal&maxPlaylistSize=0
04:19 ^🔗	S[h]O[r]T	just key in the id # into videoKey, and its in FilePath http://www.g4tv.com/VideoRequest.ashx?v=37668&r=http%3a%2f%2fvids.g4tv.com%2fvideoDB%2f037%2f668%2fvideo37668%2ftr_bible_adventures_FS_flv.flv
04:19 ^🔗	godane	fucking a
04:20 ^🔗	S[h]O[r]T	i imagine i can run through a list of ids and get all the xml
04:21 ^🔗	godane	ok
04:21 ^🔗	S[h]O[r]T	and then grep out the filepaths and then wget that list of videos
04:21 ^🔗	godane	yes
04:22 ^🔗	godane	but i can also imagine that we can create a script that will download file and then up it to archive.org
04:22 ^🔗	godane	cause we have desc and date of post
04:23 ^🔗	godane	also title
04:23 ^🔗	godane	the url for archiving will best to be something like video37668_$filename or something
04:24 ^🔗	godane	also grab the thumbnail image too and uploaded with it
04:24 ^🔗	godane	there are tons of early techtv stuff on there
04:25 ^🔗	godane	some full episodes in 56k
04:27 ^🔗	godane	gamespot tv episode in 56k: http://www.g4tv.com/videos/24430/thief-2-pc-review/
04:28 ^🔗	S[h]O[r]T	yeah trying to figure out why wget is not keep the rest of the url in my range
04:28 ^🔗	S[h]O[r]T	http://www.g4tv.com/xml/BroadbandPlayerService.asmx/GetEmbeddedVideo?videoKey={0..5}&strPlaylist=&playLargeVideo=true&excludedVideoKeys=&playlistType=normal&maxPlaylistSize=0
04:28 ^🔗	S[h]O[r]T	cutsoff after the range in braces
04:29 ^🔗	S[h]O[r]T	i guess maybe thats not wget but bash related
04:30 ^🔗	S[h]O[r]T	well w/e i can just generate a list of urls
04:39 ^🔗	godane	i'm making a warc of the xml output
04:39 ^🔗	S[h]O[r]T	of everything?
04:39 ^🔗	S[h]O[r]T	*every id
04:39 ^🔗	godane	i think so
04:39 ^🔗	S[h]O[r]T	doing what
04:40 ^🔗	godane	the latest id is 61790
04:40 ^🔗	godane	thats on the front page
04:43 ^🔗	S[h]O[r]T	what did you do to generate the list
04:44 ^🔗	godane	i planning on zcat *.warc.gz \| grep 'vids.g4tv.com' or somthing
04:44 ^🔗	S[h]O[r]T	i mean the list of urls to leech the xmls
04:45 ^🔗	S[h]O[r]T	are you going to download all the videos too or dont have the bw?
04:46 ^🔗	godane	i don't think i can download all the videos
04:46 ^🔗	db48x	S[h]O[r]T: that url has an ampersand in it
04:47 ^🔗	db48x	put the url in double quotes
04:47 ^🔗	S[h]O[r]T	like ""http://"" or do you just mean "http"
04:47 ^🔗	S[h]O[r]T	i cant do "" since the range doesnt execute, it takes it literally
04:48 ^🔗	db48x	the alternative is to escape the ampersands with backslashes
04:48 ^🔗	db48x	\&strPlaylist=
04:48 ^🔗	db48x	actually, I dunno if that will work; lemme test
04:48 ^🔗	S[h]O[r]T	yeah, ill try that. ive got a php script going generating 0-99999 but taking awhile :P
04:49 ^🔗	db48x	yes, it will
04:49 ^🔗	db48x	err, ok
04:49 ^🔗	db48x	why not just use a shell loop?
04:49 ^🔗	S[h]O[r]T	<---half retard
04:49 ^🔗	S[h]O[r]T	if it works or works eventually good enough for me :P
04:50 ^🔗	db48x	ok :)
04:50 ^🔗	godane	i can seed out the list now :-D
04:50 ^🔗	S[h]O[r]T	or if godane would type faster :P
04:51 ^🔗	db48x	for future reference:
04:51 ^🔗	S[h]O[r]T	ill take whatever list you have. ive got the bw/space to leech all the videos
04:51 ^🔗	db48x	for (( i=1; i<=99999; i++ ))
04:51 ^🔗	db48x	do
04:52 ^🔗	db48x	wget ... "http://example.com/$i"
04:52 ^🔗	db48x	done
04:52 ^🔗	godane	i did for i in $(seq 1 61790); do
04:53 ^🔗	godane	echo "http://example.com/$i" >> index.txt
04:53 ^🔗	godane	done
04:53 ^🔗	S[h]O[r]T	your probably right thats the last id but you never know whats published and isnt
04:53 ^🔗	godane	wget -x -i index.txt -o wget.log
04:57 ^🔗	S[h]O[r]T	some of the files return videos are no longer active vs doesnt exist
06:07 ^🔗	db48x	http://retractionwatch.wordpress.com/2013/01/03/owner-of-science-fraud-site-suspended-for-legal-threats-identifies-himself-talks-about-next-steps/
06:08 ^🔗	db48x	might be too late for the original site, but the planned site is something we should make sure to save
06:59 ^🔗	joepie91	db48x: shot him an email
06:59 ^🔗	joepie91	who knows, maybe I can help out hosting the original stuff
07:26 ^🔗	underscor	No idea if this has been linked in here before
07:26 ^🔗	underscor	but it's a fun toy
07:26 ^🔗	underscor	http://monitor.us.archive.org/weathermap/weathermap.html
08:32 ^🔗	chronomex	spiffy
08:42 ^🔗	Nemo_bis	grr doesn't load
08:42 ^🔗	Nemo_bis	ah
08:43 ^🔗	Nemo_bis	oh, so when the traffic with HE is big I know s3 will suck
08:44 ^🔗	Nemo_bis	[doesn't work with https]
12:12 ^🔗	alard	I have a Yahoo blogs question.
12:12 ^🔗	alard	However, at least some of the blogs also have a friendlier name, see
12:12 ^🔗	alard	The problem is this: each blog is available via the GUID of its owner, e.g.
12:12 ^🔗	alard	http://blog.yahoo.com/_FWXMIGZ6AAZ66QJABH727HC4HM/
12:12 ^🔗	alard	http://profile.yahoo.com/FWXMIGZ6AAZ66QJABH727HC4HM/
12:12 ^🔗	alard	which links to this profile here
12:12 ^🔗	alard	http://blog.yahoo.com/lopyeuthuong/
12:12 ^🔗	alard	It's the same blog.
12:12 ^🔗	alard	There's a link from the friendly-name version to the profile, so it's possible to go from a name to a GUID. I'd like to go from a GUID to a name. Any ideas?
12:12 ^🔗	alard	(Context: tuankiet has been collecting Vietnamese Yahoo blogs to be archived, with a lot of GUIDs. If it's possible I'd also like to archive the blogs via their friendly URLs.)
12:21 ^🔗	chronomex	huh.
12:23 ^🔗	hiker1	alard: Using warc-proxy, one of the records I have in a warc file it sits trying to serve it to the browser but never sends it. If I remove the record it works. Could you take a look at it?
12:24 ^🔗	alard	hiker1: Yes, where is it?
12:25 ^🔗	hiker1	Give me one minute to upload it + a version that has the record removed.
12:30 ^🔗	hiker1	alard: http://www.fileden.com/files/2012/10/26/3360748/Broken%20Warc.zip
12:32 ^🔗	hiker1	alard: A different bug I found is that if the root record does not have a '/' at the end, warc-proxy will not serve it.
12:33 ^🔗	hiker1	And it becomes impossible to access that record, at least through chrome. Chrome always seems to request the site with the / at the end
12:34 ^🔗	alard	You mean something like: WARC-Target-URI: http://example.com
12:34 ^🔗	hiker1	yes.
12:35 ^🔗	alard	It's very tempting to say that that's a bug in the tool that made the warc.
12:36 ^🔗	hiker1	My program does not require it, so perhaps it is my fault. But I could see other tools not requiring it, so perhaps the behavior should be changed in both places.
12:37 ^🔗	alard	Apparently the / is not required: https://tools.ietf.org/html/rfc2616#section-3.2.2
12:39 ^🔗	alard	(Should we do this somewhere else?)
12:40 ^🔗	alard	Could it be that the record you removed is the only record without a Content-Length header?
12:40 ^🔗	hiker1	the WARC record has a content-length header
12:40 ^🔗	hiker1	the HTTP header does not
12:44 ^🔗	alard	Do you have a modified version that can handle uncompressed warcs?
12:44 ^🔗	ersi	I've had the same 'problem' with a Target-URI missing /
12:44 ^🔗	ersi	warc-proxy handles uncompressed records just fine for what I've tried
12:44 ^🔗	hiker1	alard: I'm not sure I understand you
12:45 ^🔗	alard	We're talking bout this warc-proxy, right? https://github.com/alard/warc-proxy At least my version of the Firefox extension doesn't show .warc, only .warc.gz.
12:45 ^🔗	hiker1	It's always shown .warc for me.
12:46 ^🔗	hiker1	I use Chrome
12:46 ^🔗	ersi	I was using it without the firefox extension. Just sat the proxy, then browsed http://warc/
12:46 ^🔗	alard	Ah, I see. I'm using the firefox extension (which uses a different filename filter).
12:47 ^🔗	ersi	alard: By the way, http://warc/ is beautiful :]
12:47 ^🔗	alard	Thanks. (It's the same as in the Firefox extension, except for the native file selection dialog.)
12:48 ^🔗	hiker1	Does Hanzo warctools even check Content-Length? I don't think so
12:51 ^🔗	hiker1	Hanzo warctools does adds it to a list of headers
12:56 ^🔗	hiker1	alard: in warc-proxy, the content-length for the 404 is set to 12. It should be 91. Around line 257.
12:56 ^🔗	alard	What may be happening is this: if there's no Content-Length header in the HTTP response, the server should close the connection when everything is sent. The warc-proxy doesn't close the connection.
12:56 ^🔗	hiker1	That seems very likely.
13:00 ^🔗	hiker1	alard: In Chrome's network view, it shows the connection is never closed.
13:01 ^🔗	alard	Yes. I'm currently trying to understand https://tools.ietf.org/html/rfc2616#section-14.10
13:06 ^🔗	godane	uploaded: http://archive.org/details/CSG_TV-Game_Collection_Complete_VHS_-_1996
13:52 ^🔗	hiker1	alard: I fixed it by having warc-proxy calculate the content-length if it isn't already present in a header.
13:54 ^🔗	hiker1	It introduces a little bit of complexity to the __call__ method because now it uses parse_http_response
13:56 ^🔗	hiker1	Alternatively, and perhaps a better option, would be for warc-proxy to forcibly close the connection, since this is what the server must do since it does not send a content-length.
14:04 ^🔗	godane	so i got the vidoe file url list
14:04 ^🔗	godane	for the g4 videos
14:16 ^🔗	hiker1	alard: you still there?
14:18 ^🔗	alard	Yes. I was concentrating. :) I've now changed the warc-proxy so it sends the correct, reconstructed headers.
14:18 ^🔗	hiker1	I found that there is a very easy fix
14:19 ^🔗	hiker1	http_server = tornado.httpserver.HTTPServer(my_application, no_keep_alive=True)
14:19 ^🔗	hiker1	This tells Tornado to close all connections regardless of what version is used
14:19 ^🔗	hiker1	*http version
14:19 ^🔗	alard	That's not a fix, that's a hack. :) If you disable keep-alive the request shouldn't have a Connection: keep-alive header.
14:20 ^🔗	hiker1	Tornado recommends it
14:20 ^🔗	godane	alard: i got g4tv video url list
14:20 ^🔗	godane	alard: i need you to make a script for downloading and uploading to archive.org
14:20 ^🔗	alard	hiker1: https://github.com/alard/warc-proxy/commit/9d107976ccd47c244669b5e680d67a5caf6e103c
14:20 ^🔗	hiker1	`HTTPServer` is a very basic connection handler. Beyond parsing the HTTP request body and headers, the only HTTP semantics implemented in `HTTPServer` is HTTP/1.1 keep-alive connections. We do not, however, implement chunked encoding, so the request callback must provide a ``Content-Length`` header or implement chunked encoding for HTTP/1.1 requests for the server to run correctly for HTTP/1.1 clients. If the request handler is una
14:20 ^🔗	hiker1	ble to do this, you can provide the ``no_keep_alive`` argument to the `HTTPServer` constructor, which will ensure the connection is closed on every request no matter what HTTP version the client is using.
14:21 ^🔗	alard	hiker1: (Plus the commits before.)
14:22 ^🔗	alard	godane: Great. If I may ask, why don't you write the script yourself?
14:22 ^🔗	hiker1	Either method should work. I think I prefer the no_keep_alive method because then it sends exactly what was received from the server. It's also simple
14:22 ^🔗	alard	It's a good reason to learn it, and people here are probably more than happy to help you.
14:23 ^🔗	alard	Yes, but why does it have to send the exact response from the server?
14:23 ^🔗	hiker1	alard: That's true. As long as the exact response is saved, it doesn't matter.
14:23 ^🔗	hiker1	*saved in the WARC already.
14:24 ^🔗	hiker1	thank you for fixing it upstream. I think this has been the root of a few problems I've been having
14:28 ^🔗	hiker1	yep. other hacks I added are no longer needed
14:28 ^🔗	hiker1	Some of the google ad scripts were causing failures with warc-proxy but are now resolved
14:34 ^🔗	hiker1	alard: One other fix: line 193 changed to if mime_type and data["type"] and not mime_type.search(data["type"]):
14:34 ^🔗	godane	my www.eff.org warc is still going to live site with warc-proxy
14:36 ^🔗	hiker1	Some of them don't have a type set, so that line throws an error
14:38 ^🔗	alard	hiker1: Shouldn't that be "type" in data then?
14:38 ^🔗	godane	it still looks like https will always go to live web
14:39 ^🔗	alard	godane: Ah yes, that was the other bug.
14:43 ^🔗	hiker1	alard: sure
14:43 ^🔗	hiker1	er...
14:43 ^🔗	hiker1	data["type"] is sometimes None.
14:43 ^🔗	hiker1	I think "type" in data would return True
14:43 ^🔗	alard	So it should be: "type" in data and data["type"]
14:44 ^🔗	hiker1	yes
14:45 ^🔗	alard	Actually, there's always a "type" field (line 113), so data["type"] will do.
14:47 ^🔗	godane	alard: so now typing urls for archives doesn't work with warc-proxy
14:47 ^🔗	godane	i using localhost port 8000 for my proxy
14:47 ^🔗	godane	in less thats the problem
14:52 ^🔗	alard	godane: What do you mean with typing urls for archives?
14:54 ^🔗	alard	hiker1: I think the mime-type is None thing works now.
14:54 ^🔗	godane	when i run it proxy i use to type urls in that i know the archive will work with
14:58 ^🔗	alard	Actually, it seems completely broken at the moment.
14:58 ^🔗	alard	(If you remove the cached .idx files, that is.)
15:04 ^🔗	SketchCow	Morning
15:08 ^🔗	alard	Afternoon.
15:08 ^🔗	godane	hey SketchCow
15:09 ^🔗	godane	all my image dumps of arstechinca are uploaded now
15:10 ^🔗	alard	hiker1: Any idea where these "trailing data in http response" messages come from? I only get them with your warcs.
15:16 ^🔗	Nemo_bis	http://www.apachefoorumi.net/index.php?topic=68497.0
15:25 ^🔗	SketchCow	Thank you, godane
15:25 ^🔗	godane	your welcome
15:26 ^🔗	godane	now we good dumps of arstechnica and engadget
15:26 ^🔗	godane	also a few of torrentfreak
15:28 ^🔗	hiker1	alard: I think my content-length is too short
15:28 ^🔗	hiker1	WARC content length
15:29 ^🔗	alard	hiker1: There are a lot of \r\n at the end of the response record.
15:29 ^🔗	hiker1	alard: What do you mean?
15:29 ^🔗	hiker1	Are you saying I have too many?
15:30 ^🔗	alard	Perhaps. I thought there should be two, so \r\n\r\n, but you have four, \r\n\r\n\r\n\r\n. You also have four bytes too many, so that matches.
15:31 ^🔗	alard	Does the gzipped data always end with \r\n\r\n?
15:31 ^🔗	hiker1	I don't know.
15:32 ^🔗	hiker1	I fixed the extra \r\n in my program
15:32 ^🔗	hiker1	no more trailing data notices
15:33 ^🔗	alard	The warc standard asks for \r\n\r\n at the end of the record.
15:34 ^🔗	hiker1	Is that included in the block, or after the record?
15:34 ^🔗	hiker1	as in, should the WARC length include it or not include it
15:34 ^🔗	alard	No, that's not included in the block.
15:35 ^🔗	hiker1	hmm
15:35 ^🔗	alard	I think the Content-Length for the warc should not include the two \r\n.
15:36 ^🔗	alard	Since you have (or had?) \r\n\r\n + \r\n\r\n in your file, the first \r\n\r\n is from the HTTP response.
15:36 ^🔗	hiker1	So is having four of them correct?
15:37 ^🔗	alard	I think you only need two \r\n, since the gzipped HTTP body just ends with gzipped data.
15:38 ^🔗	hiker1	Yes, I think it should only be two.
15:39 ^🔗	hiker1	alard: Should I decompress the gzipped content?
15:39 ^🔗	hiker1	sorry. Should I decompress the gzipped content body of the HTTP response?
15:39 ^🔗	hiker1	or should I have set the crawler to not accept gzip in the first place?
15:42 ^🔗	alard	If you decompress, you should also remove the Content-Encoding header. But I wouldn't do that (you're supposed to keep the original server response).
15:42 ^🔗	alard	Not sending the Accept-Encoding header might be better.
15:42 ^🔗	hiker1	That spends a lot of bandwidth. I think I will leave it compressed.
15:45 ^🔗	alard	This is from the Heritrix release notes: "FetchHTTP also now includes the parameter 'acceptCompression' which if true will cause Heritrix requests to include an "Accept-Encoding: gzip,deflate" header offering to receive compressed responses. (The default for this parameter is false for now.)
15:45 ^🔗	alard	As always, responses are saved to ARC/WARC files exactly as they are received, and some bulk access/viewing tools may not currently support chunked/compressed responses. (Future updates to the 'Wayback' tool will address this.)"
15:46 ^🔗	alard	So it's probably okay to ask for gzip.
15:48 ^🔗	S[h]O[r]T	godane how many videos in your list
15:49 ^🔗	godane	will check
15:49 ^🔗	godane	37616
15:49 ^🔗	S[h]O[r]T	ok
15:49 ^🔗	S[h]O[r]T	i got the same amount :)
15:49 ^🔗	hiker1	that's a lot of videos.
15:50 ^🔗	godane	also know that the hd videos may not be in this
15:50 ^🔗	S[h]O[r]T	theres a like a button to enable hd?
15:50 ^🔗	godane	yes
15:50 ^🔗	S[h]O[r]T	link to an hd video if you got one offhand
15:50 ^🔗	hiker1	alard: The project I'm working on is at https://github.com/iramari/WarcMiddleware in case you are interested in trying it
15:51 ^🔗	godane	not all video is hd
15:51 ^🔗	godane	or hd option
15:51 ^🔗	godane	but i think most hd videos you just change _flv.flv to _flvhd.flv
15:52 ^🔗	balrog_	any chance someone can look at the yahoo group archiver and see what's wrong with it?
15:52 ^🔗	balrog_	I'm no good at perl
15:52 ^🔗	hiker1	No one is truly good at perl.
15:52 ^🔗	balrog_	http://grabyahoogroup.svn.sourceforge.net/viewvc/grabyahoogroup/trunk/yahoo_group/
15:53 ^🔗	balrog_	specifically grabyahoogroup.pl
15:53 ^🔗	hiker1	Yep. that's definitely Perl.
15:53 ^🔗	balrog_	there are TONS of good stuff on yahoo groups, and most groups require registration and approval to see it :/
15:54 ^🔗	balrog_	(why? any group that doesn't require approval will get spammed)
15:55 ^🔗	alard	godane: With the current warcproxy, entering urls seems to work for me, so I can't find your problem.
15:55 ^🔗	alard	Can you click on the urls in the list?
15:55 ^🔗	hiker1	Yes, I am able to enter urls as well.
15:56 ^🔗	hiker1	be sure that the url has a trailing '/', and also include www. if the url requires it
15:56 ^🔗	alard	Also, SSL/https is difficult, it seems. The proxy probably would have to simulate the encryption.
15:56 ^🔗	hiker1	the trailing / is only need for the root of the website
15:56 ^🔗	alard	Is the / still needed?
15:56 ^🔗	hiker1	alard: or rewrite it all to http
15:56 ^🔗	hiker1	alard: did you change it?
15:57 ^🔗	alard	I tried to, yes, but that may have introduced new problems: https://github.com/alard/warc-proxy/commit/33ca7e5e30722e8ee40e6a0ed9d3828b82973171
15:58 ^🔗	alard	(You'll have to rebuild the .idx files.)
15:58 ^🔗	hiker1	testing
15:58 ^🔗	hiker1	alard: I think your code adds it to every url
15:59 ^🔗	alard	Rewriting https to http would work, perhaps, but once you start rewriting you don't really need a proxy.
16:00 ^🔗	hiker1	nevermind. Yes, that fix works
16:01 ^🔗	godane	its weird
16:01 ^🔗	S[h]O[r]T	i guess we can just re-run the list with flvhd after. im trying to see if the xml supports getting the hd videourl then can just have a seperate list and not have to run through 37k urls trying
16:01 ^🔗	godane	i'm looking at my romshepherd.com dump
16:02 ^🔗	godane	i get robots.txt and images just fine
16:02 ^🔗	alard	godane: Is the .warc.gz online somewhere?
16:02 ^🔗	godane	no
16:03 ^🔗	godane	its a very old one
16:03 ^🔗	godane	but you can use my stillflying.net warc.gz
16:03 ^🔗	godane	https://archive.org/details/stillflying.net-20120905-mirror
16:03 ^🔗	S[h]O[r]T	alard, how did you find out the GUIDs of the blogs
16:04 ^🔗	alard	Searching with Google/Yahoo.
16:04 ^🔗	alard	For some blogs you find the blog name, for others the GUID.
16:05 ^🔗	S[h]O[r]T	do you have another blog by guid you can link me to?
16:08 ^🔗	godane	also opening a url folder takes a very long time now
16:09 ^🔗	alard	S[h]O[r]T: http://tracker.archiveteam.org:8124/ids.txt
16:09 ^🔗	balrog_	if anyone wants to take a shot at fixing that yahoo group downloader, I would greatly appreciate it
16:10 ^🔗	SketchCow	Just wanted to drop this in here before I go offline due to stupid wireless.
16:11 ^🔗	SketchCow	I'd like to do a project to make a Wiki and translate metadata to and from Internet Archive to it.
16:11 ^🔗	SketchCow	So we can finally get decent metadata.
16:12 ^🔗	SketchCow	Since I expect the actual site to have collaborated metadata 15 minutes after never, a moderated way for it to happen will be a good second chance.
16:13 ^🔗	SketchCow	Anyway, something to think about, I know I have.
16:14 ^🔗	alard	hiker1: So you've probably removed this? https://github.com/iramari/WarcMiddleware/blob/master/warcmiddleware.py#L81
16:15 ^🔗	hiker1	I did not. That entire file is sort of deprecated.
16:15 ^🔗	hiker1	alard: This is the version that is used https://github.com/iramari/WarcMiddleware/blob/master/warcclientfactory.py
16:16 ^🔗	hiker1	At the bottom of this recent commit was the change: https://github.com/iramari/WarcMiddleware/commit/bb69bbf6c19bbc7df150f4bc671e7406257eb750
16:16 ^🔗	alard	Oh, okay. I was happy I found something, but that commit seems fine. There's no reason to add these \r\n's.
16:18 ^🔗	hiker1	What do you think about the project? Using scrapy + WARC?
16:19 ^🔗	alard	I didn't know Scrapy, but adding WARC support to web archivers is always a good idea, I think.
16:21 ^🔗	alard	Why do you have the DownloaderMiddleware if that's not recommended?
16:21 ^🔗	hiker1	because that was what I built first
16:24 ^🔗	hiker1	I should probably delete the file.
16:27 ^🔗	alard	Yes, if it creates substandard warcs.
16:28 ^🔗	alard	Although if I look at the Scrapy architecture, DownloaderMiddleware seems to be the way to do this, and the ClientFactory is a hack.
16:28 ^🔗	hiker1	I'm afraid it does. Scrapy does not seem to send enough information through its DownloaderMiddleware interface to properly create WARC files
16:28 ^🔗	alard	Then modify Scrapy!
16:29 ^🔗	hiker1	hah
16:29 ^🔗	hiker1	It's only a two line fix I think
16:29 ^🔗	hiker1	Scrapy skips saving what it calls the start_line which contains HTTP 1/1 or whatever with the 200 OK
16:29 ^🔗	hiker1	so it throws that out.
16:30 ^🔗	hiker1	which is annoying if you are trying to create a WARC file.
16:30 ^🔗	hiker1	ClientFactory does seem a bit of a hack, but it lets me capture the raw data without any parsing or reconstruction.
16:31 ^🔗	alard	Can't you send a patch to the Scrapy people?
16:31 ^🔗	hiker1	probably. I never contribute to projects much though.
16:35 ^🔗	alard	You have to start with something.
16:36 ^🔗	hiker1	telling someone in IRC is so much easier though
16:36 ^🔗	hiker1	:)
16:36 ^🔗	hiker1	I just learned how to clone someone else's github repo.
16:38 ^🔗	alard	Heh.
16:38 ^🔗	hiker1	alard: Isn't it better to be able to grab the raw data w/ the ClientFactory than to reconstruct it from headers?
16:39 ^🔗	alard	godane: Like you said, the warcproxy doesn't like your stillflying.net warc. It's really slow.
16:40 ^🔗	alard	hiker1: Yes. So I'd think you would need to modify Scrapy so that it keeps the raw data. I don't know how Scrapy works, so I don't know where you'd put something like that, but I assume some data gets passed on.
16:41 ^🔗	hiker1	Some information is passed on, yes. I'm not sure the Scrapy devs would want to expose the raw data. Maybe they'd be okay with it, I don't know.
16:42 ^🔗	alard	You can always submit the patch and see what happens. What would be wrong about keeping the raw data?
16:42 ^🔗	hiker1	They like exposing the parsed data.
16:46 ^🔗	hiker1	I'm looking in to how the data could be passed.
16:49 ^🔗	alard	As a property of the Request and Response? https://scrapy.readthedocs.org/en/latest/topics/request-response.html
17:16 ^🔗	hiker1	alard: The only way to save the raw data would be to basically merge parts of the clientfactory I wrote into the actual one to add a raw_response argument to Request and Response.
17:17 ^🔗	hiker1	As it stands, the DownloaderMiddleware version saves gzipped responses uncompressed, because Scrapy uncompresses them before handing them off to the middleware.
17:20 ^🔗	alard	hiker1: And if you change the priority of the WarcMiddleware?
17:37 ^🔗	hiker1	alard: that worked
17:37 ^🔗	hiker1	good idea
17:37 ^🔗	alard	You'll probably also want to get to the response before the redirect-handler etc. do anything.
17:38 ^🔗	hiker1	It is changing the order of the HTTP headers
17:39 ^🔗	hiker1	probably because it's storing them in a list and reconstructing them
18:13 ^🔗	alard	godane: Your warc file is faulty. It was made with one of the early Wget-warc versions, so it contains Transfer-Encoding: chunked responses with 'de-chunked' responses.
18:27 ^🔗	godane	oh
18:28 ^🔗	godane	but i was able to browse it before in warc-proxy
18:35 ^🔗	godane	also i was using wget 1.14
18:36 ^🔗	godane	so this could mean all warc since august are faulty
18:36 ^🔗	godane	please make warc-proxy work with bad warcs
18:37 ^🔗	godane	cause if this bug is in wget 1.14 then all warc will have problems
18:38 ^🔗	hiker1	we should probably write a clean-warc tool to fix malformed warcs.
18:40 ^🔗	godane	i'm doing another mirror of stillfly.net
18:40 ^🔗	godane	if it fails this time i blame warc-proxy
18:45 ^🔗	godane	does this patch fix it: https://github.com/alard/warc-proxy/commit/8717f33b642f414de896dcafb2e91a3dc27c38ca
18:55 ^🔗	godane	so that patch is not working
19:38 ^🔗	godane	alard: ok so i'm testing warc-proxy in midori
19:38 ^🔗	godane	its working a lot faster here but
19:39 ^🔗	godane	but using proxy in midori will block everything
19:39 ^🔗	godane	even localhost:8000
19:40 ^🔗	godane	ok got it working
19:40 ^🔗	godane	can't use localhost:8000 when using proxy
19:43 ^🔗	godane	alard: i get connection reset by peer error with warc-proxy
19:49 ^🔗	hiker1	midori?
19:55 ^🔗	godane	its a webgtk browser
19:56 ^🔗	godane	part of my problem with midori is no pages load
19:56 ^🔗	godane	i did get view in stillflying.net but nothing in torrentfreak.com
20:25 ^🔗	Jizzday_c	www.jizzday.com
20:50 ^🔗	alard	godane is no longer here, but here's a correction: the warc file is fine, but the warcproxy removes the chunking.

irclogger-viewer