Time |
Nickname |
Message |
00:11
🔗
|
godane |
i'm tracking down original diggnation hd episodes |
00:12
🔗
|
godane |
:-D |
01:33
🔗
|
SketchCow |
http://archive.org/details/messmame |
02:51
🔗
|
balrog |
ATZ0_: hmm? |
05:22
🔗
|
wp494 |
just got an email from puush regarding "important changes" |
05:22
🔗
|
wp494 |
will post more if anything of interest |
05:23
🔗
|
wp494 |
" * Stop offering permanent storage, and files will expire after not being accessed for: |
05:23
🔗
|
wp494 |
- Free users: 1 month |
05:23
🔗
|
wp494 |
- Pro users: up to 6 months" |
05:23
🔗
|
wp494 |
"How this will affect you after the 1st of August 2013: |
05:23
🔗
|
wp494 |
* We are going to start expiring files. At this point, any files which haven't been recently viewed by anyone will be automatically deleted after 1 month, or up to 6 months for pro users." |
05:23
🔗
|
wp494 |
and " * If you wish to grab a copy of your files before this begins, you can download an archive from your My Account page (Account -> Settings -> Pools -> Export)." |
05:23
🔗
|
wp494 |
seems a lot like imgur-style expiration to me, except on a more extreme scale |
05:24
🔗
|
wp494 |
if we were to start a project, it'd have to evolve into something like the urlteam project |
05:25
🔗
|
xmc |
imgur expires posts? didn't know that |
05:26
🔗
|
winr4r |
it looks like puush uses incremental IDs |
05:28
🔗
|
wp494 |
yeah, they do after 6 months IIRC |
05:28
🔗
|
wp494 |
(re. imgur) |
05:29
🔗
|
* |
xmc nods |
05:34
🔗
|
wp494 |
it should be easy to archive what exists already and then over the long-term archive what's uploaded afterwards |
05:35
🔗
|
wp494 |
provided if done in urlteam style |
05:44
🔗
|
wp494 |
any thoughts? |
05:44
🔗
|
wp494 |
channel name's probably going to be hard to come up with |
05:45
🔗
|
GLaDOS |
#pushharder |
05:46
🔗
|
GLaDOS |
You know, we wouldn't have to archive everything initially.. |
05:46
🔗
|
GLaDOS |
We'd just have to 'access' the file. |
05:52
🔗
|
wp494 |
good point |
05:52
🔗
|
wp494 |
but I wouldn't think we'd be able to keep it up depending on how many files they have |
05:57
🔗
|
wp494 |
probably better off in the long term just to grab anything we can |
05:57
🔗
|
wp494 |
in case they decide to make the limits even shorter if we were to go through with the plan of just accessing |
05:59
🔗
|
wp494 |
(which would suck for both us and users) |
06:14
🔗
|
underscor |
Besides, gobs of data is more fun |
07:12
🔗
|
omf_ |
Here is the shutdown notice - http://allthingsd.com/20130706/microsoft-quietly-shuts-down-msn-tv-once-known-as-webtv/ |
07:12
🔗
|
omf_ |
Closes at the end of september |
07:12
🔗
|
omf_ |
from looking at a hosted site it should not be a problem to grab we just need to build a username list |
07:13
🔗
|
omf_ |
This has pages going back to the late 90s I believe |
07:24
🔗
|
winr4r |
i'd be surprised if most of them weren't from the 90s |
07:24
🔗
|
winr4r |
hm :/ |
07:25
🔗
|
omf_ |
Just looking at the markup for some of those sites tells a story. I like finding shit like this |
07:28
🔗
|
poqpoq |
http://news.uscourts.gov/pacer-survey-shows-rise-user-satisfaction |
08:06
🔗
|
Nemo_bis |
they wouldn't be paying so much otherwise? |
08:10
🔗
|
winr4r |
so how do you guys find sites, anyway |
08:11
🔗
|
winr4r |
by which i mean, how do you get a list of websites or whatever hosted on a given service |
08:15
🔗
|
ersi |
Depends on the site - some, you just have to go with brute force |
08:15
🔗
|
ersi |
Others, you can scrape and discover users easily |
08:18
🔗
|
winr4r |
ersi: what about stuff like webtv and free webhosts? |
08:18
🔗
|
winr4r |
i.e. arbitrary usernames |
08:18
🔗
|
winr4r |
no standard format, no links between pages |
08:19
🔗
|
winr4r |
i might put a page together about this on the wiki |
08:21
🔗
|
winr4r |
and i'm finding old ODP data (from about 2009, i needed it once and never deleted it) quite useful |
08:24
🔗
|
omf_ |
winr4r, ODP data? |
08:26
🔗
|
winr4r |
omf_: Open Directory Project |
08:27
🔗
|
winr4r |
they offer dumps of their data, about 1.9 gigabytes |
08:27
🔗
|
winr4r |
(uncompressed) |
08:39
🔗
|
winr4r |
http://archiveteam.org/index.php?title=MSN_TV |
08:41
🔗
|
GLaDOS |
============ |
08:41
🔗
|
GLaDOS |
The code has been prepared to run the hell out of. |
08:41
🔗
|
GLaDOS |
To help out, join #jenga |
08:41
🔗
|
GLaDOS |
Xanga has 8 days left, and we've yet to download 4 million users. |
09:07
🔗
|
wp494 |
http://archiveteam.org/index.php?title=Puu.sh |
09:07
🔗
|
wp494 |
wiki page for puu.sh now up |
09:16
🔗
|
ersi |
winr4r: I'd look into searching through search engines and then Common Crawl. Then I'd go brute-forcing usernames |
09:17
🔗
|
winr4r |
ersi: are there any scrapable search engines? |
09:17
🔗
|
winr4r |
bing used to have a useful API, doesn't now |
09:18
🔗
|
underscor |
What does the shape of the urls you need look like, winr4r? |
09:18
🔗
|
underscor |
I can pull stuff out of wayback |
09:20
🔗
|
winr4r |
underscor: anything from community.webtv.net or community-X.webtv.net for values of X = 1..4 |
09:21
🔗
|
ersi |
winr4r: I know there's an "old script" alard made for scraping Google |
09:21
🔗
|
ersi |
somewhere |
09:40
🔗
|
underscor |
http://farm8.staticflickr.com/7433/9228353492_aa9169e927_k.jpg |
09:40
🔗
|
underscor |
Mmmmm |
09:40
🔗
|
underscor |
Explosion-y goodness from July 4th |
10:10
🔗
|
winr4r |
http://paste.archivingyoursh.it/nuxopefaci.py |
10:10
🔗
|
winr4r |
wrote that just now, takes list of shortened URLs on stdin, outputs non-shortened URLs on stdout |
10:11
🔗
|
winr4r |
dunno if anyone else would find it useful, but there it is |
10:11
🔗
|
winr4r |
i had a big list of t.co URLs from a twitter search, needed to convert to real URLs |
10:13
🔗
|
winr4r |
tbh i don't know if MSN TV will even merit using warrior, rather than one guy with a fast pipe and wget |
10:19
🔗
|
winr4r |
oh shit |
10:20
🔗
|
winr4r |
apparently yeah some people link some disgusting shit what the fuck |
10:20
🔗
|
ersi |
haha |
10:21
🔗
|
winr4r |
okay so one of the URLs to which a t.co link resolved was a google groups search with the query string "12+year+old+daughter+sex" |
10:22
🔗
|
winr4r |
am i fucked? |
10:24
🔗
|
winr4r |
i don't know how the fuck that showed up in a search for "webtv.net" on twitter, but it did |
10:25
🔗
|
ersi |
Pack your things |
10:25
🔗
|
ersi |
Before the vans arrive |
10:27
🔗
|
winr4r |
http://community-2.webtv.net/@HH!17!BF!62DA2CCF370F/TvFoutreach/COUNTDOWNTO666/ |
10:39
🔗
|
JackWS |
Hi all, intrested in the project. I was wondering if there is a standalone archiver? Got a load of Linux servers and a few Windows servers with a shed load of bandwidth going spare every month |
10:39
🔗
|
winr4r |
JackWS: xanga? yes, there is |
10:39
🔗
|
GLaDOS |
You can run the projects as standalone. |
10:40
🔗
|
winr4r |
https://github.com/ArchiveTeam/xanga-grab |
10:40
🔗
|
winr4r |
there aren't installation instructions there |
10:40
🔗
|
GLaDOS |
install pip (python), pip install seesaw, clone the project repo, run ./get-wget-lua.sh, then run-pipeline pipeline.py YOURNAMEHERE --concurrent amountofthreads --disable-web-server |
10:40
🔗
|
winr4r |
...yeah i was about to say something like that :) |
10:41
🔗
|
GLaDOS |
Should write something for it on the wiki. |
10:41
🔗
|
winr4r |
the dependency instructions at https://github.com/ArchiveTeam/greader-grab will probably work just as well for xanga-grab |
10:41
🔗
|
ersi |
Or just commit a README.md |
10:48
🔗
|
JackWS |
thanks for the info |
10:48
🔗
|
JackWS |
ill take a look |
10:48
🔗
|
winr4r |
:D |
10:58
🔗
|
winr4r |
http://www.faroo.com/hp/api/api.html |
10:58
🔗
|
winr4r |
well this exists |
11:00
🔗
|
ersi |
Cool |
11:00
🔗
|
BlueMax |
English, German and Chinese results |
11:00
🔗
|
BlueMax |
How specific |
11:01
🔗
|
winr4r |
oh, scratch that, it seems it doesn't support "site:" queries |
11:14
🔗
|
Rainbow |
Hi, having some issues compiling wget-lua for xanga-grab, anyone know what causes this issue? http://www.hastebin.com/yetorupupa.vbs |
11:20
🔗
|
winr4r |
googling the error i'm seeing that it happens when you don't have -ldl in LDFLAGS, but it's clear that you do |
11:32
🔗
|
Rainbow |
Damn, I have to go afk. If anyone finds what the issue is, please pm me. |
11:34
🔗
|
winr4r |
Rainbow: yup, paging GLaDOS |
11:34
🔗
|
GLaDOS |
I have no idea when it comes to building |
11:34
🔗
|
winr4r |
my bad! |
13:00
🔗
|
Rainbow |
\o/ Fixed it! |
13:00
🔗
|
winr4r |
Rainbow: how? |
13:01
🔗
|
Rainbow |
Left over lua install seemed to cause it |
13:01
🔗
|
Rainbow |
Odd as it sounds |
13:01
🔗
|
winr4r |
ah :) |
13:01
🔗
|
IceGuest_ |
WARNING:tornado.general:Connect error on fd 6: ECONNREFUSED |
13:26
🔗
|
JackWS |
Why would I be getting New item: Step 1 of 8 No HTTP response received from tracker. ? |
13:26
🔗
|
winr4r |
tracker down? |
13:26
🔗
|
JackWS |
working on my machine |
13:26
🔗
|
JackWS |
just not on my server :? |
13:28
🔗
|
winr4r |
you sure there's no outbound filtering? |
13:28
🔗
|
JackWS |
Should not be |
13:28
🔗
|
JackWS |
what ports is it wanting to use? |
13:29
🔗
|
winr4r |
not sure |
13:30
🔗
|
GLaDOS |
It just uses port 80 |
13:31
🔗
|
JackWS |
testing it in a VM before I deploy it onto a few servers |
13:38
🔗
|
JackWS |
ah I got it |
13:38
🔗
|
JackWS |
[screen is terminating] |
13:38
🔗
|
JackWS |
when trying to run |
13:44
🔗
|
JackWS |
ah got it running |
13:44
🔗
|
JackWS |
was just being funny I think |
13:45
🔗
|
JackWS |
are you able to enable the graph on the webserver site? would be nice to see hot much it is using |
15:26
🔗
|
WiK |
sup |
15:33
🔗
|
antomatic |
hey |
15:46
🔗
|
WiK |
hows it going antomatic ? |
15:47
🔗
|
antomatic |
ah, can't complain. Just sitting here staring at the Xanga leaderboard. :) |
16:56
🔗
|
db48x |
do we have a tool that breaks a megawarc back up into the original warcs? |
16:59
🔗
|
winr4r |
db48x: https://pypi.python.org/pypi/Warcat/ ? |
16:59
🔗
|
db48x |
not quite |
16:59
🔗
|
db48x |
it can extract records from a warc (or a megawarc) |
17:00
🔗
|
db48x |
but the original warc was a series of related records |
17:01
🔗
|
db48x |
metadata about the process used to create the warc, each request as it was made, and each response recieved |
17:01
🔗
|
winr4r |
i'll pass on that question, then |
17:03
🔗
|
db48x |
the warc viewer is pretty good |
17:03
🔗
|
db48x |
but I don't want to use wget to spider a site being served up by the warc viewer's proxy server |
17:05
🔗
|
db48x |
warc-to-zip is interesting, but alas it requires byte offsets |
17:06
🔗
|
db48x |
I can get the start addresses of the response records, but not their lengths |
17:06
🔗
|
xmc |
db48x: https://github.com/alard/megawarc "megawarc restore megafoo.warc.gz" |
17:07
🔗
|
xmc |
iirc it creates a file bit-for-bit identical to the original source |
17:07
🔗
|
xmc |
is that what you're looking for? |
17:07
🔗
|
db48x |
ah, that sounds promising |
17:12
🔗
|
db48x |
I will have to update the description on the warc ecosystem page |
17:21
🔗
|
xmc |
ooh, warcproxy |
17:21
🔗
|
xmc |
I was meaning to write that |
17:21
🔗
|
xmc |
cool that someone else did! |
17:21
🔗
|
xmc |
now to bend it to my will |
17:24
🔗
|
db48x |
heh |
17:39
🔗
|
xmc |
well, not now, maybe later. |
17:47
🔗
|
db48x |
xmc: thanks, btw |
17:48
🔗
|
db48x |
that turned out to be precisely what I needed |
17:48
🔗
|
xmc |
my pleasure |
17:48
🔗
|
xmc |
excellent |
17:49
🔗
|
db48x |
we ought to get something set up so that people can reclaim their data by putting in the site url |
17:50
🔗
|
xmc |
not a bad idea at all |
17:50
🔗
|
db48x |
hmm, there are 444 of these megawarcs; I had to download all the idx files to find the one containing the site I wanted |
17:51
🔗
|
db48x |
not sure I have 22 tb just laying around |
17:52
🔗
|
xmc |
@_@ |
17:52
🔗
|
xmc |
might be more reasonable to patch up the megawarc program to submit range-requests to the Archive and reassemble that way |
17:53
🔗
|
db48x |
that's what warc-to-zip does |
17:53
🔗
|
db48x |
you give it the url of a warc and a byte range, and it gives you a zip |
17:53
🔗
|
xmc |
ah cool |
17:54
🔗
|
db48x |
looks like the json files have the best information |
17:54
🔗
|
db48x |
{"target":{"container":"warc","offset":0,"size":29265692},"src_offsets":{"entry":0,"data":512,"next_entry":29266432},"header_fields":{"uid":1001,"chksum":0,"uname":"","gname":"","size":29265692,"devmajor":0,"name":"20130526205026/posterous.com-vividturtle.posterous.com-20130522-061616.warc.gz","devminor":0,"gid":1001,"mtime":1369567781.0,"mode":420,"linkname":"","type":"0"},"header_base64":"MjAxMzA1MjYyMDUwMjYvcG9zdGVyb3VzLmNvbS12aXZpZHR1 |
18:02
🔗
|
db48x |
yes, very nice |
18:02
🔗
|
db48x |
using the offset and offset+size as the byte range I get a very nice zip |
18:03
🔗
|
db48x |
so it would just be a matter of parsing the filenames from the json indexes to get the site urls |
18:04
🔗
|
xmc |
fantastique |
18:06
🔗
|
db48x |
precisimo |
19:45
🔗
|
arkhive |
I'm sure it's been mentioned, but if it hasn't... MSN TV is closing! |
19:46
🔗
|
arkhive |
Heh, my WebTV Philips/Magnavox client is in my recycling |
21:50
🔗
|
wp494 |
[14:45:46.746] <arkhive> I'm sure it's been mentioned, but if it hasn't... MSN TV is closing! |
21:50
🔗
|
wp494 |
we're aware |
21:51
🔗
|
wp494 |
also, puu.sh has now been added to the navbox |
21:51
🔗
|
wp494 |
(channel for those that weren't awake at 4 AM CDT: #pushharder) |
22:16
🔗
|
wp494 |
posterous still remains on the tracker and in warriors for whatever reason |
22:17
🔗
|
wp494 |
what gives, if I can ask? |