Time |
Nickname |
Message |
09:25
🔗
|
SketchCow |
http://instagram.com/p/gszkewiOJq/ |
09:45
🔗
|
midas1 |
lol SketchCow |
11:32
🔗
|
SketchCow |
The uploading of what we got from blip.tv so far has been going well now. |
11:32
🔗
|
SketchCow |
Thanks again to joepie for the metadata extraction routines. |
11:32
🔗
|
SketchCow |
Over 2,500 videos added |
12:05
🔗
|
Nemo_bis |
SketchCow: did that kill s3 or something? It's almost idle since 12 hours ago. |
12:32
🔗
|
Nemo_bis |
I'm pushing some stuff too https://archive.org/services/collection-rss.php?collection=wikiteam |
14:45
🔗
|
balrog |
it would be nice to archive http://vetusware.com but all the downloads are severely limited |
14:46
🔗
|
balrog |
even with the highest "membership" you're limited to 10 downloads per day |
15:20
🔗
|
midas1 |
balrog: and you need to have karma first before getting there |
15:20
🔗
|
balrog |
yes... |
15:21
🔗
|
midas1 |
would be great to archive tho, they have alot of old software |
16:06
🔗
|
odie5533 |
How can I determine the boundary offsets for a warc.gz file in Python? |
16:11
🔗
|
joepie91 |
odie5533: there's some WARC modules for Python that can do that afaik |
16:12
🔗
|
Lord_Nigh |
10 downloads per day? with 5 or 6 members at that level the site could be mirrored in a few months |
16:13
🔗
|
odie5533 |
joepie91: There are. Sadly, I couldn't quite make sense of the code =/ |
16:13
🔗
|
joepie91 |
odie5533: you could just use them :) |
16:14
🔗
|
odie5533 |
I am using them, but some are quite slow, and I was hoping to make a really lightweight fast version. |
16:15
🔗
|
odie5533 |
it works, but it doesn't determine offsets, and now I'm worried that the others are slow simply because they determine offsets and that's the only reason mine's at all fast. But I don't actually know this for sure since I can't determine offsets in my code. |
16:16
🔗
|
odie5533 |
initial testing showed mine was approximately 5x as fast as Hanzo warctools, but again, that could be 100% because of the offsets. |
16:19
🔗
|
DFJustin |
Lord_Nigh: sounds like a job for a RECAP type extension |
16:20
🔗
|
anounyym1 |
http://wdl2.winworldpc.com/Abandonware%20Applications/PC/ there's some abandonware stuff |
16:41
🔗
|
joepie91 |
odie5533: I'd imagine it's just looking for headers and storing the pos in the file |
16:41
🔗
|
joepie91 |
other than having to read the full file, I can't see it being slow |
17:59
🔗
|
odie5533 |
joepie91: finding the position in the file is not that easy the compressed position is different than the uncompressed position |
18:00
🔗
|
joepie91 |
odie5533: right... WARC itself however isn't compressed |
18:00
🔗
|
odie5533 |
and I'm not sure how to find the compressed position |
18:00
🔗
|
joepie91 |
so your issue is with gzip indexing rather than warc indexing :) |
18:00
🔗
|
odie5533 |
joepie91: I am trying to find the offset in the compressed one. yes. gzip indexing. |
18:01
🔗
|
joepie91 |
odie5533: http://stackoverflow.com/questions/9317281/how-to-get-a-random-line-from-within-a-gzip-compressed-file-in-python-without-re |
18:02
🔗
|
joepie91 |
first answer may have relevant info |
18:02
🔗
|
joepie91 |
wrt seeking in gzip |
18:04
🔗
|
odie5533 |
Doesn't seem very helpful. I think there is way to do it based on the fact that warc.gz are concatenated .gz files. |
18:06
🔗
|
joepie91 |
odie5533: huh? |
18:07
🔗
|
joepie91 |
warc.gz is a gzipped WARC, not concatenated gzips? |
18:07
🔗
|
odie5533 |
Each Warc Record is gzipped individually |
18:26
🔗
|
odie5533 |
joepie91: Seems like you can determine where an entry ended by using zlib's Decompress.unused_data. |
20:41
🔗
|
joepie91 |
odie5533: where do I find the qt warc viewer? |
20:42
🔗
|
zenguy_pc |
does anyone have a torrent for the videos deleted form conservatives.com ? just found out about this scandal |
21:07
🔗
|
adam_arch |
List of URLs from Conan O'Brien 20th anniversary that will go away at the end of day today: http://archive.org/~adam/conan20.txt |
21:08
🔗
|
SmileyG |
adam_arch: #archivebot - i'll be adding them |
21:08
🔗
|
adam_arch |
I've used the wayback save feature to grab the pages, but the videos will be a bit more labor intensive. They do seem to be mp4s eventually, so those could be piped through wayback /save as well but as far as I can tell, they need user interaction to get the video URL |
21:09
🔗
|
SketchCow |
Hi, gang. |
21:09
🔗
|
adam_arch |
I was planning on just using firebug to grab the video URLs but I'm not sure I have the time to get it done by myself |
21:09
🔗
|
SketchCow |
adam_arch works for the Internet Archive with me and wants to grab these Conan O'Brien videos before they're all deleted today. |
21:10
🔗
|
SketchCow |
Can someone get on that? He is able to grab the website but we want the videos as well. |
21:11
🔗
|
SmileyG |
ahhh |
21:11
🔗
|
* |
SmileyG goes to find some people to throw at it |
21:11
🔗
|
adam_arch |
sweet |
21:12
🔗
|
SmileyG |
flu has blown my mind away but i can at least try shouting at people to do it |
21:12
🔗
|
SketchCow |
In other news, I've uploaded 3000 blip.tv videos to archive today. |
21:12
🔗
|
DFJustin |
$ ~/youtube-dl -g http://teamcoco.com/video/first-ten-years-of-conan-package-1 |
21:12
🔗
|
DFJustin |
http://cdn.teamcococdn.com/storage/2013-11/72871-Conan20_102813_Conan20_First10Comedy_Pkg1-1080p.mp4 |
21:12
🔗
|
DFJustin |
you should be able to do the same for the rest |
21:12
🔗
|
SmileyG |
DFJustin: leave it with you then? |
21:13
🔗
|
SketchCow |
Really? OK, I can do that. |
21:13
🔗
|
DFJustin |
no I'm at work |
21:13
🔗
|
SmileyG |
for x in $(cat ./file); do youtube-dl $x; done |
21:13
🔗
|
SmileyG |
\o/ for bad use opf cats |
21:14
🔗
|
SketchCow |
SmileyG: I would have gotten that one |
21:14
🔗
|
SketchCow |
Do we have a best youtube-dl these days? |
21:14
🔗
|
SketchCow |
Is https://yt-dl.org/downloads/2013.11.15.1/youtube-dl the best? |
21:14
🔗
|
DFJustin |
I use http://rg3.github.io/youtube-dl/ |
21:15
🔗
|
DFJustin |
which I guess is the same |
21:19
🔗
|
SketchCow |
[Teamcoco] 72965: Downloading data webpage |
21:19
🔗
|
SketchCow |
[Teamcoco] 72965: Extracting information |
21:19
🔗
|
SketchCow |
[Teamcoco] last-5-ln-guest-pkg01: Downloading webpage |
21:19
🔗
|
SketchCow |
[download] 18.4% of 141.96MiB at 1.87MiB/s ETA 01:01 |
21:19
🔗
|
SketchCow |
[download] Destination: The Best of 'Late Night' Guests, Vol. 2-72965.mp4 |
21:19
🔗
|
SketchCow |
etc. |
21:19
🔗
|
SketchCow |
adam_arch: OK, I got this. I'm downloading the videos more than the websites. |
21:19
🔗
|
DFJustin |
well downloading them with youtube-dl doesn't get the warc goodness |
21:20
🔗
|
SketchCow |
No. But we did scrape them into wayback today. Just the videos didn't go |
21:20
🔗
|
SketchCow |
ERROR: unable to download video data: HTTP Error 400: Bad Request |
21:20
🔗
|
SketchCow |
[Teamcoco] 72980: Downloading data webpage |
21:20
🔗
|
SketchCow |
[Teamcoco] 72980: Extracting information |
21:20
🔗
|
SketchCow |
[Teamcoco] in-the-year-2000: Downloading webpage |
21:33
🔗
|
adam_arch |
found that one: cdn.teamcococdn.com/storage/2013-11/72980-IntheYear2000-XDCAM%20HD%201080i60%2050mbs-1000k.mp4 |
21:34
🔗
|
adam_arch |
wayback /save doesn't seem to like it though |
22:17
🔗
|
odie5533 |
joepie91: on my github |
22:17
🔗
|
odie5533 |
joepie91: https://github.com/odie5533/WarcQtViewer |
22:19
🔗
|
odie5533 |
joepie91: where'd you hear about it? :) |
22:32
🔗
|
odie5533 |
Sorta related to Archiveteam: How do you guys archive personal files? |
22:36
🔗
|
nico_ |
hi |
22:36
🔗
|
odie5533 |
hi |
22:36
🔗
|
Marcelo |
hi |
22:36
🔗
|
joepie91 |
odie5533: you mentioned it in here |
22:36
🔗
|
joepie91 |
:P |
22:36
🔗
|
joepie91 |
thanks |
22:36
🔗
|
odie5533 |
So I did :) |
22:36
🔗
|
odie5533 |
Sadly, it doesn't work well with large files. which is what I'm trying to fix |
22:37
🔗
|
* |
joepie91 stares at .exe |
22:37
🔗
|
joepie91 |
:P |
22:37
🔗
|
joepie91 |
what is this kind of file? how do I open it? |
22:37
🔗
|
odie5533 |
joepie91: just double click it! |
22:38
🔗
|
nico_ |
hum a .exe |
22:38
🔗
|
joepie91 |
odie5533: doesn't work |
22:38
🔗
|
joepie91 |
:P |
22:38
🔗
|
joepie91 |
:P :P :P |
22:38
🔗
|
nico_ |
virt-sandbox su nobody -c 'wine /tmp/a.exe' |
22:38
🔗
|
odie5533 |
Try clicking harder |
22:38
🔗
|
joepie91 |
nico_: oh, that looks like it might work :D |
22:38
🔗
|
odie5533 |
really weigh down on the mouse |
22:38
🔗
|
joepie91 |
haha |
22:38
🔗
|
nico_ |
joepie91: safety first ;) |
22:38
🔗
|
yipdw |
odie5533: I tarsnap them |
22:39
🔗
|
odie5533 |
yipdw: hmm? |
22:39
🔗
|
odie5533 |
oh, personal file archiving |
22:39
🔗
|
joepie91 |
my god the pyside pypi package is noisy |
22:40
🔗
|
odie5533 |
joepie91: just so you know it was actually a lot of work to get it all to compile into a single exe. Which I think is important to creating a truly portable program that everyone can use. |
22:40
🔗
|
yipdw |
that's Windows-ist |
22:40
🔗
|
yipdw |
:P |
22:46
🔗
|
odie5533 |
Yes, it works on any Windows |
22:46
🔗
|
odie5533 |
I tried it in a sandbox running XP and it works fine |
22:47
🔗
|
nico_ |
try it under windows 2000 |
22:47
🔗
|
nico_ |
:) |
22:47
🔗
|
odie5533 |
probably will work |
22:47
🔗
|
nico_ |
probably not |
22:47
🔗
|
nico_ |
lacks of api :( |
22:47
🔗
|
nico_ |
everything require at least windows xp sp2 nowadays |
22:48
🔗
|
odie5533 |
I don't think I have a 98 sandbox running to test that |
22:48
🔗
|
odie5533 |
so maybe it doesn't work on EVERY windows system, but all major opreating systems currently in use |
22:50
🔗
|
odie5533 |
joepie91: did you get it working? |
22:50
🔗
|
nico_ |
odie5533: i want to run it under 2k on alpha |
22:50
🔗
|
joepie91 |
odie5533: have to do other things first |
22:50
🔗
|
joepie91 |
pyside is still building anyway |
22:50
🔗
|
joepie91 |
29% |
22:50
🔗
|
joepie91 |
slow build is slow |
22:51
🔗
|
odie5533 |
joepie91: just apt-get install it? |
22:51
🔗
|
nico_ |
http://www.xanthos.se/~joachim/PWS-Info.GIF |
22:52
🔗
|
nico_ |
just for fun |
22:53
🔗
|
xmc |
woop woop woop off-topic siren |
22:53
🔗
|
xmc |
take it to #archiveteam-bs |
22:55
🔗
|
* |
joepie91 woops xmc |
23:17
🔗
|
odie5533 |
Written in Python as opposed to e.g. the zlib library, which I believe is just a wrapper for the C/C++ zlib library. |
23:17
🔗
|
odie5533 |
joepie91: I found a solution to getting the offsets in the gzip files. The regular Python Gzip library is written in Python and reads "members" one at a time, so it basically determines the offsets as it goes: http://hg.python.org/cpython/file/2.7/Lib/gzip.py |
23:18
🔗
|
joepie91 |
The user of the file doesn't have to worry about the compression, |
23:18
🔗
|
joepie91 |
but random access is not allowed.""" |
23:43
🔗
|
nico_ |
anyone know if there are a mean to make the wayback machine take a snapshot of a website now ? |
23:43
🔗
|
odie5533_ |
only of single pages |
23:43
🔗
|
nico_ |
anything else i need to make a warc ? |
23:44
🔗
|
odie5533_ |
What do you mean? |
23:45
🔗
|
nico_ |
i want to backup a whole domain |
23:45
🔗
|
nico_ |
because its owner is dead |
23:45
🔗
|
nico_ |
and the archive.org is way out of date |
23:46
🔗
|
odie5533_ |
definitely want to crawl it then |
23:46
🔗
|
odie5533_ |
If it's a small-ish site someone might run archivebot on it. |
23:52
🔗
|
nico_ |
anyone has a working wget line ? |
23:52
🔗
|
nico_ |
i will try |
23:53
🔗
|
nico_ |
wget --warc-file="sid" --warc-cdx=sid --domains="domain.tld" -l inf -m -p -U "Mozilla/5.0 (Photon; U; QNX x86pc; en-US; rv:1.6) Gecko/20040429" |
23:55
🔗
|
nico_ |
--random-wait -w 5 --retry-connrefused |
23:55
🔗
|
nico_ |
-t 25 |
23:56
🔗
|
nico_ |
ho |
23:56
🔗
|
nico_ |
http://www.archiveteam.org/index.php?title=Wget#Creating_WARC_with_wget |
23:56
🔗
|
nico_ |
it is already on the wiki |
23:57
🔗
|
ivan` |
what's the domain? |
23:58
🔗
|
nico_ |
sid.rstack.org |