Time |
Nickname |
Message |
00:52
🔗
|
godane |
balrog_: i just mirrored it |
01:26
🔗
|
godane |
balrog_: uploaded: http://archive.org/details/www.mindsetcomputing.com-20130104-mirror |
01:27
🔗
|
balrog_ |
awesome :) |
01:38
🔗
|
db48x |
heh, http://thewholegoddamnedinternet.com has a cvs keyword substitution at the bottom |
01:38
🔗
|
db48x |
that's dedication |
01:58
🔗
|
godane |
uploaded: http://archive.org/details/www.raygunrevival.com-20130103-mirror |
02:15
🔗
|
godane |
you guys may want this: https://www.youtube.com/user/TorontoPETUsersGroup |
02:16
🔗
|
godane |
world of commodore video talks |
02:20
🔗
|
hiker1 |
wow there are still talks about commodore. |
02:43
🔗
|
hiker1 |
godane: What options do you typicall use on wget? |
03:01
🔗
|
godane |
this is what i used for rayguyrevival.com grab: wget "http://$website/Published/" --mirror --warc-file=$website-$(date +%Y%m%d) --reject-regex='(replytocom)' --warc-cdx -E -o wget.log |
03:02
🔗
|
godane |
website="www.raygunrevival.com" |
03:02
🔗
|
hiker1 |
thanks |
03:10
🔗
|
hiker1 |
wow that site was big |
03:16
🔗
|
godane |
i had to get the Published folder first cause no of the .pdf links go there |
03:16
🔗
|
godane |
*no=none |
03:17
🔗
|
godane |
anyways i'm uploading my arstechnica image dumps |
03:28
🔗
|
godane |
http://forums.g4tv.com/showthread.php?187820-Save-G4TV-com-%28the-website%29! |
03:28
🔗
|
godane |
i'm doing my best the save the fourms but you guys need to get the videos |
03:29
🔗
|
godane |
not every video has a .mp4 file in it |
04:03
🔗
|
S[h]O[r]T |
someone do the hard work for making a script to save the video :P |
04:06
🔗
|
godane |
i'm trying to figure it out now |
04:06
🔗
|
godane |
best i can get is the url folder |
04:07
🔗
|
godane |
like for http://www.g4tv.com/videos/37668/bible-adventures-gameplay/ |
04:07
🔗
|
godane |
the video file will be in http://vids.g4tv.com/videoDB/037/668/video37668/ |
04:08
🔗
|
godane |
but nothing in the html gives the name |
04:08
🔗
|
godane |
the filename i mean |
04:10
🔗
|
S[h]O[r]T |
http://vids.g4tv.com/videoDB/037/668/video37668/tr_bible_adventures_FS_flv.flv. |
04:10
🔗
|
godane |
how did you get the filename? |
04:11
🔗
|
godane |
i normally have to download it using a plugin in firefox |
04:12
🔗
|
S[h]O[r]T |
^^ you can prob use chromes inspector too. i should have thought of that..but i used a program called streamtransport that pulls the flash http://www.streamtransport.com/ |
04:12
🔗
|
S[h]O[r]T |
question is either can we guess the filename at all or predict it based on their title. |
04:13
🔗
|
S[h]O[r]T |
if you do chrome > inspect element and then network tab you can see if it if you catch it as it loads |
04:18
🔗
|
S[h]O[r]T |
http://www.g4tv.com/xml/BroadbandPlayerService.asmx/GetEmbeddedVideo?videoKey=37668&strPlaylist=&playLargeVideo=true&excludedVideoKeys=&playlistType=normal&maxPlaylistSize=0 |
04:19
🔗
|
S[h]O[r]T |
just key in the id # into videoKey, and its in FilePath http://www.g4tv.com/VideoRequest.ashx?v=37668&r=http%3a%2f%2fvids.g4tv.com%2fvideoDB%2f037%2f668%2fvideo37668%2ftr_bible_adventures_FS_flv.flv |
04:19
🔗
|
godane |
fucking a |
04:20
🔗
|
S[h]O[r]T |
i imagine i can run through a list of ids and get all the xml |
04:21
🔗
|
godane |
ok |
04:21
🔗
|
S[h]O[r]T |
and then grep out the filepaths and then wget that list of videos |
04:21
🔗
|
godane |
yes |
04:22
🔗
|
godane |
but i can also imagine that we can create a script that will download file and then up it to archive.org |
04:22
🔗
|
godane |
cause we have desc and date of post |
04:23
🔗
|
godane |
also title |
04:23
🔗
|
godane |
the url for archiving will best to be something like video37668_$filename or something |
04:24
🔗
|
godane |
also grab the thumbnail image too and uploaded with it |
04:24
🔗
|
godane |
there are tons of early techtv stuff on there |
04:25
🔗
|
godane |
some full episodes in 56k |
04:27
🔗
|
godane |
gamespot tv episode in 56k: http://www.g4tv.com/videos/24430/thief-2-pc-review/ |
04:28
🔗
|
S[h]O[r]T |
yeah trying to figure out why wget is not keep the rest of the url in my range |
04:28
🔗
|
S[h]O[r]T |
http://www.g4tv.com/xml/BroadbandPlayerService.asmx/GetEmbeddedVideo?videoKey={0..5}&strPlaylist=&playLargeVideo=true&excludedVideoKeys=&playlistType=normal&maxPlaylistSize=0 |
04:28
🔗
|
S[h]O[r]T |
cutsoff after the range in braces |
04:29
🔗
|
S[h]O[r]T |
i guess maybe thats not wget but bash related |
04:30
🔗
|
S[h]O[r]T |
well w/e i can just generate a list of urls |
04:39
🔗
|
godane |
i'm making a warc of the xml output |
04:39
🔗
|
S[h]O[r]T |
of everything? |
04:39
🔗
|
S[h]O[r]T |
*every id |
04:39
🔗
|
godane |
i think so |
04:39
🔗
|
S[h]O[r]T |
doing what |
04:40
🔗
|
godane |
the latest id is 61790 |
04:40
🔗
|
godane |
thats on the front page |
04:43
🔗
|
S[h]O[r]T |
what did you do to generate the list |
04:44
🔗
|
godane |
i planning on zcat *.warc.gz | grep 'vids.g4tv.com' or somthing |
04:44
🔗
|
S[h]O[r]T |
i mean the list of urls to leech the xmls |
04:45
🔗
|
S[h]O[r]T |
are you going to download all the videos too or dont have the bw? |
04:46
🔗
|
godane |
i don't think i can download all the videos |
04:46
🔗
|
db48x |
S[h]O[r]T: that url has an ampersand in it |
04:47
🔗
|
db48x |
put the url in double quotes |
04:47
🔗
|
S[h]O[r]T |
like ""http://"" or do you just mean "http" |
04:47
🔗
|
S[h]O[r]T |
i cant do "" since the range doesnt execute, it takes it literally |
04:48
🔗
|
db48x |
the alternative is to escape the ampersands with backslashes |
04:48
🔗
|
db48x |
\&strPlaylist= |
04:48
🔗
|
db48x |
actually, I dunno if that will work; lemme test |
04:48
🔗
|
S[h]O[r]T |
yeah, ill try that. ive got a php script going generating 0-99999 but taking awhile :P |
04:49
🔗
|
db48x |
yes, it will |
04:49
🔗
|
db48x |
err, ok |
04:49
🔗
|
db48x |
why not just use a shell loop? |
04:49
🔗
|
S[h]O[r]T |
<---half retard |
04:49
🔗
|
S[h]O[r]T |
if it works or works eventually good enough for me :P |
04:50
🔗
|
db48x |
ok :) |
04:50
🔗
|
godane |
i can seed out the list now :-D |
04:50
🔗
|
S[h]O[r]T |
or if godane would type faster :P |
04:51
🔗
|
db48x |
for future reference: |
04:51
🔗
|
S[h]O[r]T |
ill take whatever list you have. ive got the bw/space to leech all the videos |
04:51
🔗
|
db48x |
for (( i=1; i<=99999; i++ )) |
04:51
🔗
|
db48x |
do |
04:52
🔗
|
db48x |
wget ... "http://example.com/$i" |
04:52
🔗
|
db48x |
done |
04:52
🔗
|
godane |
i did for i in $(seq 1 61790); do |
04:53
🔗
|
godane |
echo "http://example.com/$i" >> index.txt |
04:53
🔗
|
godane |
done |
04:53
🔗
|
S[h]O[r]T |
your probably right thats the last id but you never know whats published and isnt |
04:53
🔗
|
godane |
wget -x -i index.txt -o wget.log |
04:57
🔗
|
S[h]O[r]T |
some of the files return videos are no longer active vs doesnt exist |
06:07
🔗
|
db48x |
http://retractionwatch.wordpress.com/2013/01/03/owner-of-science-fraud-site-suspended-for-legal-threats-identifies-himself-talks-about-next-steps/ |
06:08
🔗
|
db48x |
might be too late for the original site, but the planned site is something we should make sure to save |
06:59
🔗
|
joepie91 |
db48x: shot him an email |
06:59
🔗
|
joepie91 |
who knows, maybe I can help out hosting the original stuff |
07:26
🔗
|
underscor |
No idea if this has been linked in here before |
07:26
🔗
|
underscor |
but it's a fun toy |
07:26
🔗
|
underscor |
http://monitor.us.archive.org/weathermap/weathermap.html |
08:32
🔗
|
chronomex |
spiffy |
08:42
🔗
|
Nemo_bis |
grr doesn't load |
08:42
🔗
|
Nemo_bis |
ah |
08:43
🔗
|
Nemo_bis |
oh, so when the traffic with HE is big I know s3 will suck |
08:44
🔗
|
Nemo_bis |
[doesn't work with https] |
12:12
🔗
|
alard |
I have a Yahoo blogs question. |
12:12
🔗
|
alard |
However, at least some of the blogs also have a friendlier name, see |
12:12
🔗
|
alard |
The problem is this: each blog is available via the GUID of its owner, e.g. |
12:12
🔗
|
alard |
http://blog.yahoo.com/_FWXMIGZ6AAZ66QJABH727HC4HM/ |
12:12
🔗
|
alard |
http://profile.yahoo.com/FWXMIGZ6AAZ66QJABH727HC4HM/ |
12:12
🔗
|
alard |
which links to this profile here |
12:12
🔗
|
alard |
http://blog.yahoo.com/lopyeuthuong/ |
12:12
🔗
|
alard |
It's the same blog. |
12:12
🔗
|
alard |
There's a link from the friendly-name version to the profile, so it's possible to go from a name to a GUID. I'd like to go from a GUID to a name. Any ideas? |
12:12
🔗
|
alard |
(Context: tuankiet has been collecting Vietnamese Yahoo blogs to be archived, with a lot of GUIDs. If it's possible I'd also like to archive the blogs via their friendly URLs.) |
12:21
🔗
|
chronomex |
huh. |
12:23
🔗
|
hiker1 |
alard: Using warc-proxy, one of the records I have in a warc file it sits trying to serve it to the browser but never sends it. If I remove the record it works. Could you take a look at it? |
12:24
🔗
|
alard |
hiker1: Yes, where is it? |
12:25
🔗
|
hiker1 |
Give me one minute to upload it + a version that has the record removed. |
12:30
🔗
|
hiker1 |
alard: http://www.fileden.com/files/2012/10/26/3360748/Broken%20Warc.zip |
12:32
🔗
|
hiker1 |
alard: A different bug I found is that if the root record does not have a '/' at the end, warc-proxy will not serve it. |
12:33
🔗
|
hiker1 |
And it becomes impossible to access that record, at least through chrome. Chrome always seems to request the site with the / at the end |
12:34
🔗
|
alard |
You mean something like: WARC-Target-URI: http://example.com |
12:34
🔗
|
hiker1 |
yes. |
12:35
🔗
|
alard |
It's very tempting to say that that's a bug in the tool that made the warc. |
12:36
🔗
|
hiker1 |
My program does not require it, so perhaps it is my fault. But I could see other tools not requiring it, so perhaps the behavior should be changed in both places. |
12:37
🔗
|
alard |
Apparently the / is not required: https://tools.ietf.org/html/rfc2616#section-3.2.2 |
12:39
🔗
|
alard |
(Should we do this somewhere else?) |
12:40
🔗
|
alard |
Could it be that the record you removed is the only record without a Content-Length header? |
12:40
🔗
|
hiker1 |
the WARC record has a content-length header |
12:40
🔗
|
hiker1 |
the HTTP header does not |
12:44
🔗
|
alard |
Do you have a modified version that can handle uncompressed warcs? |
12:44
🔗
|
ersi |
I've had the same 'problem' with a Target-URI missing / |
12:44
🔗
|
ersi |
warc-proxy handles uncompressed records just fine for what I've tried |
12:44
🔗
|
hiker1 |
alard: I'm not sure I understand you |
12:45
🔗
|
alard |
We're talking bout this warc-proxy, right? https://github.com/alard/warc-proxy At least my version of the Firefox extension doesn't show .warc, only .warc.gz. |
12:45
🔗
|
hiker1 |
It's always shown .warc for me. |
12:46
🔗
|
hiker1 |
I use Chrome |
12:46
🔗
|
ersi |
I was using it without the firefox extension. Just sat the proxy, then browsed http://warc/ |
12:46
🔗
|
alard |
Ah, I see. I'm using the firefox extension (which uses a different filename filter). |
12:47
🔗
|
ersi |
alard: By the way, http://warc/ is beautiful :] |
12:47
🔗
|
alard |
Thanks. (It's the same as in the Firefox extension, except for the native file selection dialog.) |
12:48
🔗
|
hiker1 |
Does Hanzo warctools even check Content-Length? I don't think so |
12:51
🔗
|
hiker1 |
Hanzo warctools does adds it to a list of headers |
12:56
🔗
|
hiker1 |
alard: in warc-proxy, the content-length for the 404 is set to 12. It should be 91. Around line 257. |
12:56
🔗
|
alard |
What may be happening is this: if there's no Content-Length header in the HTTP response, the server should close the connection when everything is sent. The warc-proxy doesn't close the connection. |
12:56
🔗
|
hiker1 |
That seems very likely. |
13:00
🔗
|
hiker1 |
alard: In Chrome's network view, it shows the connection is never closed. |
13:01
🔗
|
alard |
Yes. I'm currently trying to understand https://tools.ietf.org/html/rfc2616#section-14.10 |
13:06
🔗
|
godane |
uploaded: http://archive.org/details/CSG_TV-Game_Collection_Complete_VHS_-_1996 |
13:52
🔗
|
hiker1 |
alard: I fixed it by having warc-proxy calculate the content-length if it isn't already present in a header. |
13:54
🔗
|
hiker1 |
It introduces a little bit of complexity to the __call__ method because now it uses parse_http_response |
13:56
🔗
|
hiker1 |
Alternatively, and perhaps a better option, would be for warc-proxy to forcibly close the connection, since this is what the server must do since it does not send a content-length. |
14:04
🔗
|
godane |
so i got the vidoe file url list |
14:04
🔗
|
godane |
for the g4 videos |
14:16
🔗
|
hiker1 |
alard: you still there? |
14:18
🔗
|
alard |
Yes. I was concentrating. :) I've now changed the warc-proxy so it sends the correct, reconstructed headers. |
14:18
🔗
|
hiker1 |
I found that there is a very easy fix |
14:19
🔗
|
hiker1 |
http_server = tornado.httpserver.HTTPServer(my_application, no_keep_alive=True) |
14:19
🔗
|
hiker1 |
This tells Tornado to close all connections regardless of what version is used |
14:19
🔗
|
hiker1 |
*http version |
14:19
🔗
|
alard |
That's not a fix, that's a hack. :) If you disable keep-alive the request shouldn't have a Connection: keep-alive header. |
14:20
🔗
|
hiker1 |
Tornado recommends it |
14:20
🔗
|
godane |
alard: i got g4tv video url list |
14:20
🔗
|
godane |
alard: i need you to make a script for downloading and uploading to archive.org |
14:20
🔗
|
alard |
hiker1: https://github.com/alard/warc-proxy/commit/9d107976ccd47c244669b5e680d67a5caf6e103c |
14:20
🔗
|
hiker1 |
`HTTPServer` is a very basic connection handler. Beyond parsing the HTTP request body and headers, the only HTTP semantics implemented in `HTTPServer` is HTTP/1.1 keep-alive connections. We do not, however, implement chunked encoding, so the request callback must provide a ``Content-Length`` header or implement chunked encoding for HTTP/1.1 requests for the server to run correctly for HTTP/1.1 clients. If the request handler is una |
14:20
🔗
|
hiker1 |
ble to do this, you can provide the ``no_keep_alive`` argument to the `HTTPServer` constructor, which will ensure the connection is closed on every request no matter what HTTP version the client is using. |
14:21
🔗
|
alard |
hiker1: (Plus the commits before.) |
14:22
🔗
|
alard |
godane: Great. If I may ask, why don't you write the script yourself? |
14:22
🔗
|
hiker1 |
Either method should work. I think I prefer the no_keep_alive method because then it sends exactly what was received from the server. It's also simple |
14:22
🔗
|
alard |
It's a good reason to learn it, and people here are probably more than happy to help you. |
14:23
🔗
|
alard |
Yes, but why does it have to send the exact response from the server? |
14:23
🔗
|
hiker1 |
alard: That's true. As long as the exact response is saved, it doesn't matter. |
14:23
🔗
|
hiker1 |
*saved in the WARC already. |
14:24
🔗
|
hiker1 |
thank you for fixing it upstream. I think this has been the root of a few problems I've been having |
14:28
🔗
|
hiker1 |
yep. other hacks I added are no longer needed |
14:28
🔗
|
hiker1 |
Some of the google ad scripts were causing failures with warc-proxy but are now resolved |
14:34
🔗
|
hiker1 |
alard: One other fix: line 193 changed to if mime_type and data["type"] and not mime_type.search(data["type"]): |
14:34
🔗
|
godane |
my www.eff.org warc is still going to live site with warc-proxy |
14:36
🔗
|
hiker1 |
Some of them don't have a type set, so that line throws an error |
14:38
🔗
|
alard |
hiker1: Shouldn't that be "type" in data then? |
14:38
🔗
|
godane |
it still looks like https will always go to live web |
14:39
🔗
|
alard |
godane: Ah yes, that was the other bug. |
14:43
🔗
|
hiker1 |
alard: sure |
14:43
🔗
|
hiker1 |
er... |
14:43
🔗
|
hiker1 |
data["type"] is sometimes None. |
14:43
🔗
|
hiker1 |
I think "type" in data would return True |
14:43
🔗
|
alard |
So it should be: "type" in data and data["type"] |
14:44
🔗
|
hiker1 |
yes |
14:45
🔗
|
alard |
Actually, there's always a "type" field (line 113), so data["type"] will do. |
14:47
🔗
|
godane |
alard: so now typing urls for archives doesn't work with warc-proxy |
14:47
🔗
|
godane |
i using localhost port 8000 for my proxy |
14:47
🔗
|
godane |
in less thats the problem |
14:52
🔗
|
alard |
godane: What do you mean with typing urls for archives? |
14:54
🔗
|
alard |
hiker1: I think the mime-type is None thing works now. |
14:54
🔗
|
godane |
when i run it proxy i use to type urls in that i know the archive will work with |
14:58
🔗
|
alard |
Actually, it seems completely broken at the moment. |
14:58
🔗
|
alard |
(If you remove the cached .idx files, that is.) |
15:04
🔗
|
SketchCow |
Morning |
15:08
🔗
|
alard |
Afternoon. |
15:08
🔗
|
godane |
hey SketchCow |
15:09
🔗
|
godane |
all my image dumps of arstechinca are uploaded now |
15:10
🔗
|
alard |
hiker1: Any idea where these "trailing data in http response" messages come from? I only get them with your warcs. |
15:16
🔗
|
Nemo_bis |
http://www.apachefoorumi.net/index.php?topic=68497.0 |
15:25
🔗
|
SketchCow |
Thank you, godane |
15:25
🔗
|
godane |
your welcome |
15:26
🔗
|
godane |
now we good dumps of arstechnica and engadget |
15:26
🔗
|
godane |
also a few of torrentfreak |
15:28
🔗
|
hiker1 |
alard: I think my content-length is too short |
15:28
🔗
|
hiker1 |
WARC content length |
15:29
🔗
|
alard |
hiker1: There are a lot of \r\n at the end of the response record. |
15:29
🔗
|
hiker1 |
alard: What do you mean? |
15:29
🔗
|
hiker1 |
Are you saying I have too many? |
15:30
🔗
|
alard |
Perhaps. I thought there should be two, so \r\n\r\n, but you have four, \r\n\r\n\r\n\r\n. You also have four bytes too many, so that matches. |
15:31
🔗
|
alard |
Does the gzipped data always end with \r\n\r\n? |
15:31
🔗
|
hiker1 |
I don't know. |
15:32
🔗
|
hiker1 |
I fixed the extra \r\n in my program |
15:32
🔗
|
hiker1 |
no more trailing data notices |
15:33
🔗
|
alard |
The warc standard asks for \r\n\r\n at the end of the record. |
15:34
🔗
|
hiker1 |
Is that included in the block, or after the record? |
15:34
🔗
|
hiker1 |
as in, should the WARC length include it or not include it |
15:34
🔗
|
alard |
No, that's not included in the block. |
15:35
🔗
|
hiker1 |
hmm |
15:35
🔗
|
alard |
I think the Content-Length for the warc should not include the two \r\n. |
15:36
🔗
|
alard |
Since you have (or had?) \r\n\r\n + \r\n\r\n in your file, the first \r\n\r\n is from the HTTP response. |
15:36
🔗
|
hiker1 |
So is having four of them correct? |
15:37
🔗
|
alard |
I think you only need two \r\n, since the gzipped HTTP body just ends with gzipped data. |
15:38
🔗
|
hiker1 |
Yes, I think it should only be two. |
15:39
🔗
|
hiker1 |
alard: Should I decompress the gzipped content? |
15:39
🔗
|
hiker1 |
sorry. Should I decompress the gzipped content body of the HTTP response? |
15:39
🔗
|
hiker1 |
or should I have set the crawler to not accept gzip in the first place? |
15:42
🔗
|
alard |
If you decompress, you should also remove the Content-Encoding header. But I wouldn't do that (you're supposed to keep the original server response). |
15:42
🔗
|
alard |
Not sending the Accept-Encoding header might be better. |
15:42
🔗
|
hiker1 |
That spends a lot of bandwidth. I think I will leave it compressed. |
15:45
🔗
|
alard |
This is from the Heritrix release notes: "FetchHTTP also now includes the parameter 'acceptCompression' which if true will cause Heritrix requests to include an "Accept-Encoding: gzip,deflate" header offering to receive compressed responses. (The default for this parameter is false for now.) |
15:45
🔗
|
alard |
As always, responses are saved to ARC/WARC files exactly as they are received, and some bulk access/viewing tools may not currently support chunked/compressed responses. (Future updates to the 'Wayback' tool will address this.)" |
15:46
🔗
|
alard |
So it's probably okay to ask for gzip. |
15:48
🔗
|
S[h]O[r]T |
godane how many videos in your list |
15:49
🔗
|
godane |
will check |
15:49
🔗
|
godane |
37616 |
15:49
🔗
|
S[h]O[r]T |
ok |
15:49
🔗
|
S[h]O[r]T |
i got the same amount :) |
15:49
🔗
|
hiker1 |
that's a lot of videos. |
15:50
🔗
|
godane |
also know that the hd videos may not be in this |
15:50
🔗
|
S[h]O[r]T |
theres a like a button to enable hd? |
15:50
🔗
|
godane |
yes |
15:50
🔗
|
S[h]O[r]T |
link to an hd video if you got one offhand |
15:50
🔗
|
hiker1 |
alard: The project I'm working on is at https://github.com/iramari/WarcMiddleware in case you are interested in trying it |
15:51
🔗
|
godane |
not all video is hd |
15:51
🔗
|
godane |
or hd option |
15:51
🔗
|
godane |
but i think most hd videos you just change _flv.flv to _flvhd.flv |
15:52
🔗
|
balrog_ |
any chance someone can look at the yahoo group archiver and see what's wrong with it? |
15:52
🔗
|
balrog_ |
I'm no good at perl |
15:52
🔗
|
hiker1 |
No one is truly good at perl. |
15:52
🔗
|
balrog_ |
http://grabyahoogroup.svn.sourceforge.net/viewvc/grabyahoogroup/trunk/yahoo_group/ |
15:53
🔗
|
balrog_ |
specifically grabyahoogroup.pl |
15:53
🔗
|
hiker1 |
Yep. that's definitely Perl. |
15:53
🔗
|
balrog_ |
there are TONS of good stuff on yahoo groups, and most groups require registration and approval to see it :/ |
15:54
🔗
|
balrog_ |
(why? any group that doesn't require approval will get spammed) |
15:55
🔗
|
alard |
godane: With the current warcproxy, entering urls seems to work for me, so I can't find your problem. |
15:55
🔗
|
alard |
Can you click on the urls in the list? |
15:55
🔗
|
hiker1 |
Yes, I am able to enter urls as well. |
15:56
🔗
|
hiker1 |
be sure that the url has a trailing '/', and also include www. if the url requires it |
15:56
🔗
|
alard |
Also, SSL/https is difficult, it seems. The proxy probably would have to simulate the encryption. |
15:56
🔗
|
hiker1 |
the trailing / is only need for the root of the website |
15:56
🔗
|
alard |
Is the / still needed? |
15:56
🔗
|
hiker1 |
alard: or rewrite it all to http |
15:56
🔗
|
hiker1 |
alard: did you change it? |
15:57
🔗
|
alard |
I tried to, yes, but that may have introduced new problems: https://github.com/alard/warc-proxy/commit/33ca7e5e30722e8ee40e6a0ed9d3828b82973171 |
15:58
🔗
|
alard |
(You'll have to rebuild the .idx files.) |
15:58
🔗
|
hiker1 |
testing |
15:58
🔗
|
hiker1 |
alard: I think your code adds it to every url |
15:59
🔗
|
alard |
Rewriting https to http would work, perhaps, but once you start rewriting you don't really need a proxy. |
16:00
🔗
|
hiker1 |
nevermind. Yes, that fix works |
16:01
🔗
|
godane |
its weird |
16:01
🔗
|
S[h]O[r]T |
i guess we can just re-run the list with flvhd after. im trying to see if the xml supports getting the hd videourl then can just have a seperate list and not have to run through 37k urls trying |
16:01
🔗
|
godane |
i'm looking at my romshepherd.com dump |
16:02
🔗
|
godane |
i get robots.txt and images just fine |
16:02
🔗
|
alard |
godane: Is the .warc.gz online somewhere? |
16:02
🔗
|
godane |
no |
16:03
🔗
|
godane |
its a very old one |
16:03
🔗
|
godane |
but you can use my stillflying.net warc.gz |
16:03
🔗
|
godane |
https://archive.org/details/stillflying.net-20120905-mirror |
16:03
🔗
|
S[h]O[r]T |
alard, how did you find out the GUIDs of the blogs |
16:04
🔗
|
alard |
Searching with Google/Yahoo. |
16:04
🔗
|
alard |
For some blogs you find the blog name, for others the GUID. |
16:05
🔗
|
S[h]O[r]T |
do you have another blog by guid you can link me to? |
16:08
🔗
|
godane |
also opening a url folder takes a very long time now |
16:09
🔗
|
alard |
S[h]O[r]T: http://tracker.archiveteam.org:8124/ids.txt |
16:09
🔗
|
balrog_ |
if anyone wants to take a shot at fixing that yahoo group downloader, I would greatly appreciate it |
16:10
🔗
|
SketchCow |
Just wanted to drop this in here before I go offline due to stupid wireless. |
16:11
🔗
|
SketchCow |
I'd like to do a project to make a Wiki and translate metadata to and from Internet Archive to it. |
16:11
🔗
|
SketchCow |
So we can finally get decent metadata. |
16:12
🔗
|
SketchCow |
Since I expect the actual site to have collaborated metadata 15 minutes after never, a moderated way for it to happen will be a good second chance. |
16:13
🔗
|
SketchCow |
Anyway, something to think about, I know I have. |
16:14
🔗
|
alard |
hiker1: So you've probably removed this? https://github.com/iramari/WarcMiddleware/blob/master/warcmiddleware.py#L81 |
16:15
🔗
|
hiker1 |
I did not. That entire file is sort of deprecated. |
16:15
🔗
|
hiker1 |
alard: This is the version that is used https://github.com/iramari/WarcMiddleware/blob/master/warcclientfactory.py |
16:16
🔗
|
hiker1 |
At the bottom of this recent commit was the change: https://github.com/iramari/WarcMiddleware/commit/bb69bbf6c19bbc7df150f4bc671e7406257eb750 |
16:16
🔗
|
alard |
Oh, okay. I was happy I found something, but that commit seems fine. There's no reason to add these \r\n's. |
16:18
🔗
|
hiker1 |
What do you think about the project? Using scrapy + WARC? |
16:19
🔗
|
alard |
I didn't know Scrapy, but adding WARC support to web archivers is always a good idea, I think. |
16:21
🔗
|
alard |
Why do you have the DownloaderMiddleware if that's not recommended? |
16:21
🔗
|
hiker1 |
because that was what I built first |
16:24
🔗
|
hiker1 |
I should probably delete the file. |
16:27
🔗
|
alard |
Yes, if it creates substandard warcs. |
16:28
🔗
|
alard |
Although if I look at the Scrapy architecture, DownloaderMiddleware seems to be the way to do this, and the ClientFactory is a hack. |
16:28
🔗
|
hiker1 |
I'm afraid it does. Scrapy does not seem to send enough information through its DownloaderMiddleware interface to properly create WARC files |
16:28
🔗
|
alard |
Then modify Scrapy! |
16:29
🔗
|
hiker1 |
hah |
16:29
🔗
|
hiker1 |
It's only a two line fix I think |
16:29
🔗
|
hiker1 |
Scrapy skips saving what it calls the start_line which contains HTTP 1/1 or whatever with the 200 OK |
16:29
🔗
|
hiker1 |
so it throws that out. |
16:30
🔗
|
hiker1 |
which is annoying if you are trying to create a WARC file. |
16:30
🔗
|
hiker1 |
ClientFactory does seem a bit of a hack, but it lets me capture the raw data without any parsing or reconstruction. |
16:31
🔗
|
alard |
Can't you send a patch to the Scrapy people? |
16:31
🔗
|
hiker1 |
probably. I never contribute to projects much though. |
16:35
🔗
|
alard |
You have to start with something. |
16:36
🔗
|
hiker1 |
telling someone in IRC is so much easier though |
16:36
🔗
|
hiker1 |
:) |
16:36
🔗
|
hiker1 |
I just learned how to clone someone else's github repo. |
16:38
🔗
|
alard |
Heh. |
16:38
🔗
|
hiker1 |
alard: Isn't it better to be able to grab the raw data w/ the ClientFactory than to reconstruct it from headers? |
16:39
🔗
|
alard |
godane: Like you said, the warcproxy doesn't like your stillflying.net warc. It's really slow. |
16:40
🔗
|
alard |
hiker1: Yes. So I'd think you would need to modify Scrapy so that it keeps the raw data. I don't know how Scrapy works, so I don't know where you'd put something like that, but I assume some data gets passed on. |
16:41
🔗
|
hiker1 |
Some information is passed on, yes. I'm not sure the Scrapy devs would want to expose the raw data. Maybe they'd be okay with it, I don't know. |
16:42
🔗
|
alard |
You can always submit the patch and see what happens. What would be wrong about keeping the raw data? |
16:42
🔗
|
hiker1 |
They like exposing the parsed data. |
16:46
🔗
|
hiker1 |
I'm looking in to how the data could be passed. |
16:49
🔗
|
alard |
As a property of the Request and Response? https://scrapy.readthedocs.org/en/latest/topics/request-response.html |
17:16
🔗
|
hiker1 |
alard: The only way to save the raw data would be to basically merge parts of the clientfactory I wrote into the actual one to add a raw_response argument to Request and Response. |
17:17
🔗
|
hiker1 |
As it stands, the DownloaderMiddleware version saves gzipped responses uncompressed, because Scrapy uncompresses them before handing them off to the middleware. |
17:20
🔗
|
alard |
hiker1: And if you change the priority of the WarcMiddleware? |
17:37
🔗
|
hiker1 |
alard: that worked |
17:37
🔗
|
hiker1 |
good idea |
17:37
🔗
|
alard |
You'll probably also want to get to the response before the redirect-handler etc. do anything. |
17:38
🔗
|
hiker1 |
It is changing the order of the HTTP headers |
17:39
🔗
|
hiker1 |
probably because it's storing them in a list and reconstructing them |
18:13
🔗
|
alard |
godane: Your warc file is faulty. It was made with one of the early Wget-warc versions, so it contains Transfer-Encoding: chunked responses with 'de-chunked' responses. |
18:27
🔗
|
godane |
oh |
18:28
🔗
|
godane |
but i was able to browse it before in warc-proxy |
18:35
🔗
|
godane |
also i was using wget 1.14 |
18:36
🔗
|
godane |
so this could mean all warc since august are faulty |
18:36
🔗
|
godane |
please make warc-proxy work with bad warcs |
18:37
🔗
|
godane |
cause if this bug is in wget 1.14 then all warc will have problems |
18:38
🔗
|
hiker1 |
we should probably write a clean-warc tool to fix malformed warcs. |
18:40
🔗
|
godane |
i'm doing another mirror of stillfly.net |
18:40
🔗
|
godane |
if it fails this time i blame warc-proxy |
18:45
🔗
|
godane |
does this patch fix it: https://github.com/alard/warc-proxy/commit/8717f33b642f414de896dcafb2e91a3dc27c38ca |
18:55
🔗
|
godane |
so that patch is not working |
19:38
🔗
|
godane |
alard: ok so i'm testing warc-proxy in midori |
19:38
🔗
|
godane |
its working a lot faster here but |
19:39
🔗
|
godane |
but using proxy in midori will block everything |
19:39
🔗
|
godane |
even localhost:8000 |
19:40
🔗
|
godane |
ok got it working |
19:40
🔗
|
godane |
can't use localhost:8000 when using proxy |
19:43
🔗
|
godane |
alard: i get connection reset by peer error with warc-proxy |
19:49
🔗
|
hiker1 |
midori? |
19:55
🔗
|
godane |
its a webgtk browser |
19:56
🔗
|
godane |
part of my problem with midori is no pages load |
19:56
🔗
|
godane |
i did get view in stillflying.net but nothing in torrentfreak.com |
20:25
🔗
|
Jizzday_c |
www.jizzday.com |
20:50
🔗
|
alard |
godane is no longer here, but here's a correction: the warc file is fine, but the warcproxy removes the chunking. |