[00:00] doing a small grab [00:01] ok first smallest hit http://ext-51.livejournal.com/feed/ [00:07] it could be made better for large ones, with some threading [00:08] does the job well tbh [00:09] out of the first 1000 urls, only 10 still remain and then thats if they have a page on livejournal [00:10] Grabbing livejournal would be very appreciated -- there is a lot of important historical stuff there, buried among a lot of other stuff. [00:11] (It also has lots of non-public stuff, but hopefully much of that got copied over to dreamwidth...) [00:11] ill throw a complete job list on my server and post the results on github when done [00:12] we need to know when they stop. I presume 999999 [00:12] my id number is ext-3542056 [00:12] cool, good [00:12] so there's that many numbers [00:13] so at some stage it had over a million profiles [00:13] ah, should we aim for 10 million? [00:14] SimpBrain, do you want to do halves each? [00:16] I cant start it yet, my server is down for server surgery [00:16] just creating a github place for the modified script [00:17] ok [00:19] dont know python that well so it looks ok and runs without errors [00:19] https://github.com/SimpleBrain/livejournal-dump/ [00:20] currently on 2300/9999 of the test scrape [00:29] meh looks ok, doing a 1-999999 scrape now [00:32] hopefully a 9999-999999 scrape. [00:32] No need to repeat work. :-) [00:33] i abandoned the local scrape [00:33] doing on my server which is quicker [00:33] just used my results so far to verify its grabbing same results [00:34] Argh, Funny or Die took down their own "Art of the Deal" parody film. I can't believe I have to go on The Pirate Bay to get free web content [00:43] SimpBrain: ah, that makes sense [00:43] at 10% already [00:44] drats didnt do it inside a screen session [00:44] heh [00:44] always do *everything* in a screen session. :-) [00:45] i had no sessions active, didnt notice [00:45] * JesseW is very embarrassed that it took me over a decade of using real unix shells to learn about and start using screen. [00:46] Now I love it. [00:49] *** ndiddy has quit IRC (Read error: Operation timed out) [00:55] *** robink has quit IRC (Ping timeout: 190 seconds) [00:57] *** robink has joined #archiveteam-bs [01:00] *** robink has quit IRC (Ping timeout: 190 seconds) [01:01] *** robink has joined #archiveteam-bs [01:02] *** robink has quit IRC (Remote host closed the connection) [01:08] ok got 3 jobs running [01:08] 1-99999, 100000-1100000, 1100001-1199999 [01:08] latter 2 are hitting more profiles than the first one [01:21] heh, I'm not surprised [01:27] ok [01:27] *** dashcloud has quit IRC (Read error: Operation timed out) [01:27] ive done a bit more site browsing [01:27] http://www.livejournal.com/profile?userid=1&t=I [01:28] http://www.livejournal.com/profile?userid=77536449&t=I [01:28] there's 77 million profiles [01:30] Yikes, that's a lot even for us [01:30] a lot may be purged [01:30] *** dashcloud has joined #archiveteam-bs [01:30] Yeah [01:30] and a lot wont have a journal [01:31] ill do the ext- scrape first since there's only 3.5million links then go for this big one [01:47] *** dashcloud has quit IRC (Read error: Operation timed out) [01:50] *** dashcloud has joined #archiveteam-bs [02:01] SimpBrain: is livejournal going away? [02:04] SimpBrain: if you'd like and SketchCow agrees I'd be happy to create a warrior project for that site [02:05] If we're going to do a warrior grab we might not need the whole discovery. We'll probably go through all URLs in the grab and grab an websites if it exists for the ID [02:08] not going away, just might be worth doing some prep work [02:08] scraping the site for profiles and working journals [02:08] What do you think about a loong term archiving project for livejournal? [02:08] long* [02:08] As in slowly grabbing and saving accounts [02:09] yeah [02:09] could be worth, getting a copy. at least if worst case happens we have the leg work sorted [02:10] SketchCow: please let us know if we can do a long term archiving project for livejournal [02:10] Maybe over a year or so or longer [02:11] * JesseW is enthusastic about a looong term arkiving of livejournal. [02:11] Hm, maybe the verb for creating a warrior project should be "arkiving" ;-P [02:12] :D [02:14] livejournal doesn't have a much javascripts, so works perfectly in wayback too at the moment [02:21] there's a small bit in the web page source [02:24] they done a number on the purged accounts on lj [02:24] last active account was about 4500, currently at 70000 and its still yet to find anyone [02:25] a lot of folks moved over to dreamwidth some years ago [02:25] that's likely why [02:25] They may have also skipped numbers somewhere along the way - re-started numbering at 10k or somesuch. [02:26] Er, 100k, I mean. [02:26] yeah i got results in the 99k range, so it'll hit someone soon [02:28] A lot of people jumped ship, but I find it hard to believe you'd go through 50k+ accounts and not find a single existing, long-abandoned account among them. [02:29] done a couple of quick tests on other machines, it's dead space [02:30] I'm not arguing it's dead space, I'm just skeptical there was ever anything there in the first place. [02:30] true [02:30] guessing some of those low numbers are permanent accounts [02:30] Hm. I had a similar thing with the URL shortnerner migre.me -- they skipped about 5 million possible values at one point. [02:31] I documented it all on the URLteam page. [02:31] BTW, what's the archiving situation with the archiveteam wiki? [02:31] Are there exported copies I can grab from IA? [02:32] If not, we should really do that. [02:32] I think there are a couple older ones at IA, that are linked from the wiki itself someplace. Um... [02:33] btw lj refers those ext numbers as internal usernames [02:33] It's in the FAQ, actually. https://archive.org/details/wiki-archiveteamorg [02:34] so those act as a temp username till you create your subdomain [02:35] *** JesseW has quit IRC (Quit: Leaving.) [02:38] *** JesseW has joined #archiveteam-bs [02:38] *** yipdw_ has joined #archiveteam-bs [02:41] *** yipdw has quit IRC (Ping timeout: 506 seconds) [02:43] ok... [02:44] Either you are trying to access a page you do no not have permission to [02:44] view or your ip address is banned. If you feel this is in error, please [02:44] email support at webmaster@livejournal.com [02:45] Think I've managed to work out why nobody's scraped Livejournal yet... :< [02:47] i got banned due to doing it too quick [02:49] with the low scrape, i was getting more misses than hits and this script is going too quick if it cant find it [02:49] will add a timer wait to the error page one [03:00] *** JesseW has quit IRC (Quit: Leaving.) [03:11] *** tomwsmf-a has quit IRC (Read error: Operation timed out) [03:18] both machines have done over 10k and not been touched yet [03:18] will do a profile grab from my home pc which will be slower and wont run afoul from them [03:43] *** Boppen has joined #archiveteam-bs [03:48] *** bwn has quit IRC (Ping timeout: 492 seconds) [03:58] *** ndiddy has joined #archiveteam-bs [04:03] *** xXx_ndidd has joined #archiveteam-bs [04:12] *** xXx_ndidd has quit IRC (Read error: Connection reset by peer) [04:13] *** xXx_ndidd has joined #archiveteam-bs [04:16] *** Boppen has quit IRC (Ping timeout: 200 seconds) [04:16] *** ndiddy has quit IRC (Read error: Operation timed out) [04:21] *** JesseW has joined #archiveteam-bs [05:10] *** xXx_ndidd has quit IRC (Read error: Connection reset by peer) [05:21] *** Sk1d has quit IRC (Ping timeout: 250 seconds) [05:30] *** Sk1d has joined #archiveteam-bs [06:08] *** VADemon has quit IRC (Quit: left4dead) [06:22] *** Stiletto has joined #archiveteam-bs [06:31] https://archive.org/details/personal_archives <- interesting, apparently aborted IA project from 2012 [06:37] *** dashcloud has quit IRC (Read error: Operation timed out) [06:38] *** dashcloud has joined #archiveteam-bs [07:27] *** vitzli has joined #archiveteam-bs [07:37] *** JesseW has quit IRC (Quit: Leaving.) [07:43] *** dashcloud has quit IRC (Read error: Operation timed out) [07:45] *** DFJustin has quit IRC (Remote host closed the connection) [07:46] *** dashcloud has joined #archiveteam-bs [07:50] *** metalcamp has joined #archiveteam-bs [08:00] *** DFJustin has joined #archiveteam-bs [08:00] *** swebb sets mode: +o DFJustin [08:07] *** brayden has quit IRC (Read error: Connection reset by peer) [08:07] *** brayden has joined #archiveteam-bs [08:07] *** swebb sets mode: +o brayden [09:38] *** vitzli has quit IRC (Leaving) [09:45] SimpBrain: I’m going after Yahoo! Groups for a few months now, fyi. [09:48] *** bwn has joined #archiveteam-bs [10:12] PurpleSym, oh cool, was it easy to do with all the closed groups? [10:23] I’m not doing closed groups, SimpBrain [10:25] Not until I got all public groups. And that’s gonna take a while. [10:26] figures, there's like almost 15-20 years of archives there [10:30] Yeah, and it’s a mess. [10:31] Here’s some stats: http://archiveteam.org/index.php?title=Yahoo!_Groups#Statistics [10:35] cool, seems like companies never like people grabbing the data easily [11:43] damn i hope this becomes available after it finishes https://www.youtube.com/watch?v=H6_Nui_cBYM [11:43] DJ EZ 24hr Set in Aid of Cancer Research UK [11:44] 17 hours out of 24 done so far [11:57] *** schbirid has joined #archiveteam-bs [12:53] SketchCow: all 2009 mp3s of kpfa is uploaded now [13:10] *** snape has quit IRC (Hey! Where'd my controlling terminal go?) [14:35] *** ohhdemgir has quit IRC (Read error: Operation timed out) [14:48] *** zhongfu has quit IRC (Remote host closed the connection) [15:18] PotcFdk: I opened up another pull request for you on github. [16:00] *** snape has joined #archiveteam-bs [16:28] https://www.youtube.com/watch?v=zSfcpUtdid0 nsfw [16:29] https://www.youtube.com/watch?v=9Nbr_OLqTOo same [16:33] one way of getting copies sold [17:19] https://twitter.com/search?f=tweets&vertical=default&q=%23DJEZ24HourSet&src=tyah [17:59] *** zhongfu has joined #archiveteam-bs [18:06] *** zhongfu has quit IRC (Remote host closed the connection) [18:11] *** JesseW has joined #archiveteam-bs [18:15] *** metalcamp has quit IRC (Ping timeout: 252 seconds) [18:16] *** zhongfu has joined #archiveteam-bs [19:04] *** godane has quit IRC (Read error: Operation timed out) [19:08] bsmith095: OK, I've got the fanfiction file. [19:08] (well, 99.9% -- there appears to be something preventing it from finishing. :-( [19:09] I can't repack it until after I get some more free space, though. [19:33] *** Infreq has quit IRC (Ping timeout: 258 seconds) [19:34] *** Infreq has joined #archiveteam-bs [19:35] *** zino_ has joined #archiveteam-bs [19:35] *** decay_ has joined #archiveteam-bs [19:38] *** schbirid has quit IRC (hub.efnet.us irc.Prison.NET) [19:38] *** zino has quit IRC (hub.efnet.us irc.Prison.NET) [19:38] *** decay has quit IRC (hub.efnet.us irc.Prison.NET) [19:38] *** achip has quit IRC (hub.efnet.us irc.Prison.NET) [19:42] *** schbirid2 has joined #archiveteam-bs [19:57] *** achip has joined #archiveteam-bs [20:05] I had been wondering why my home bandwidth stayed over 200Mbit for such extended periods without throttling. Just found a 3 month old letter from my ISP: The basic tier is upped from 100Mbit to 250Mbit. [20:05] So, yay. [20:06] But it does mean I have less incentive to go do a real upgrade. [20:12] *** ndiddy has joined #archiveteam-bs [20:12] Lucky you. I get 1.2mbps/200kbps... when it's not raining. Supposedly 1gbps fiber is coming this summer; I'll believe it when I see it. [20:13] several years ago there was an article about how the US have to be at war all the time to keep their economy running. probably about 10 years ago or more. it had at least one chart in it. anyone remember that? :D [20:13] i thought it was called "the cost of empire" but apparently not [20:29] *** metalcamp has joined #archiveteam-bs [20:48] *** JesseW has quit IRC (Quit: Leaving.) [21:26] *** metalcamp has quit IRC (Ping timeout: 252 seconds) [21:32] *** bwn has quit IRC (Ping timeout: 246 seconds) [21:51] *** Boppen has joined #archiveteam-bs [21:54] *** mismatch_ has joined #archiveteam-bs [22:03] *** schbirid2 has quit IRC (Quit: Leaving) [22:07] *** Boppen has quit IRC (hub.se irc.du.se) [22:11] *** bwn has joined #archiveteam-bs [22:25] *** Boppen has joined #archiveteam-bs [22:30] *** JesseW has joined #archiveteam-bs [22:30] *** bwn has quit IRC (Read error: Operation timed out) [22:44] *** Boppen has quit IRC (Ping timeout: 200 seconds) [22:47] *** Boppen has joined #archiveteam-bs [22:48] JesseW: does it hashcheck ok? [22:49] zino_: lucky bastard :) [22:50] bsmith095: hm, will check [22:51] MrRadar: you have an exact copy of my original ffnet file, what was that link? [22:51] JesseW: not the torrent itself, the md5 from the archive page. [22:53] *** Boppen has quit IRC (hub.se irc.du.se) [22:53] My copy is 4,442,120 bytes smaller. [22:53] bsmith095: I grabbed it using the original magnet link in the Reddit post [22:53] So I'm pretty certain the hashes won't match. :-) [22:53] hmm, diff and merge with the archive copy? or will that take forever [22:54] I think I'll try switching the torrent over to the IA torrent -- maybe that will work. [22:55] JesseW: nope, that won't work [22:55] No? [22:55] The IA torrent doesn't contain the actual FF.net data [22:55] Ah, that's unhelpful. [22:55] Since it's over the 25GB file size limit for torrents [22:55] md5 from my ia upload for the Fanfiction.tar.gz file is 4ded153ca5c091a64dd9aca1c6f4be88 [22:56] MrRadar: so JesseW downloading from that torrent was a complete waste of time?! [22:56] From the IA torrent? [22:56] hardly -- it got me 99% [22:57] yes , that one. [22:57] JesseW: try doing a range request from the IA server for the missing bytes from your download [22:57] You can use wget with the -r option [22:57] JesseW: still, it probably won't exttract properly [22:57] Good idea [22:57] MrRadar: [22:59] Err, I meant curl [22:59] *** mismatch_ has quit IRC (Ping timeout: 499 seconds) [22:59] JesseW: my original magnet link, was my own torrent of the gz file, that i created myself. the IA upload is a copy of that file. AFAIK MrRadar, that torrent link you found on my post is still being seeded. [23:00] Yes, that's also how I uploaded it to the IA [23:00] I copied the torrent file out my client's torrent cache and uploaded that [23:00] And the IA downloaded it [23:02] The missing piece is the *last* one, interestingly enough. [23:03] Hash: d934709d1c7f1bf26d826718804de5f7a53757dc [23:03] Magnet: magnet:?xt=urn:btih:d934709d1c7f1bf26d826718804de5f7a53757dc&dn=Fanfiction.tar.gz&tr=udp%3A%2F%2Ftracker.openbittorrent.com%3A80 [23:03] great, so thats probably the tail end of the gz file, otherwise known as the only bit you absolutely need. [23:04] why is compression so damn fragile?! [23:04] That is a different magnet link from the one in the reddit post, oddly. [23:04] Actually, for gzip files that will just result in truncated data [23:05] Even for zip files the extractor should still be able to reconstruct the catalog from the rest of the file, at least well enough to extract the data that *is* present [23:05] MrRadar: can you hashcheck your copy, now i'm worried. [23:05] Sure, give me a few minutes [23:06] *** rduser has quit IRC (Ping timeout: 260 seconds) [23:06] the hash of the file i originally upload to IA is 4ded153ca5c091a64dd9aca1c6f4be88 , just so you know. [23:06] *** Rickster has quit IRC (Ping timeout: 260 seconds) [23:07] The "Torrent Hash" I have is d934709d1c7f1bf26d826718804de5f7a53757dc [23:07] The SHA-256 of the actual file is 9999fc66f05bc7353a797d5a3a16cfd73a71b401bc7194a1537ac72b432caffa [23:08] The IA was able to download the whole torrent just fine when I uploaded it [23:08] And the SHA-1 hashes for the file match between our two items [23:08] *** Famicoman has quit IRC (Ping timeout: 260 seconds) [23:09] *** Simpbrai_ has quit IRC (Remote host closed the connection) [23:10] *** bauruine has quit IRC (Ping timeout: 260 seconds) [23:10] *** mismatch_ has joined #archiveteam-bs [23:11] *** rduser has joined #archiveteam-bs [23:11] I'm going to reverify my local data -- maybe something went weird there [23:12] MrRadar: I have the same torrent hash. [23:12] *** bauruine has joined #archiveteam-bs [23:13] *** Rickster has joined #archiveteam-bs [23:14] *** Simpbrai_ has joined #archiveteam-bs [23:17] I'm very confused why the infohash listed in the reddit thread is different from the one in the copy of the torrent in IA. [23:19] IIRC I ran into the same weirdness but when I tried re-adding the magnet link it told me that it was a duplicate of the torrent I uploaded [23:19] *re-adding the magnet link to my torrent client [23:23] JesseW: yeah, that's weird [23:23] MrRadar: have you tried to open the file yet? [23:24] Yes, I actually decompressed the file before I uploaded it to make sure it was valid [23:24] I should have an MD5 to compare in about 30 seconds [23:24] OK, I'm going to try grabbing the last bit from IA [23:24] JesseW: the reddit thread was *my* torrent, the IA torrent is derived. [23:24] Well, there's two torrents on my IA item [23:25] but the file is the same so it really shouldn't matter [23:25] One is a the .torrent file retrieved from your original magnet link [23:25] actually not yet, I'll wait for it to finish trying to verify the current data [23:25] The other is the IA's derived torrent [23:25] right, and IA's derived torrent is useless [23:25] Correct [23:25] ohhhh, ok i get it now. [23:26] i just really hope i didnt screw it up again, i recently deleted the original uncompressed files to free up some room. [23:26] Whoops, the tool I was using to hash the file was accidentally set to CRC32 instead of MD5 like I wanted. Redoing now [23:27] didnt know you could check crc32 manually [23:29] OK, I tried downloading the magnet link (3E2H) in a separate instance -- and the resulting torrent hash *is* the other one (d93). Strange. [23:29] Yeah, that's exactly what I saw [23:29] magnet links are confusing. :-/ [23:29] Also, neat, this article references a manual uploaded to the IA by SketchCow: https://www.muckrock.com/news/archives/2016/feb/24/hunt-governments-oldest-computer/ [23:30] For the Gateway Liberty 2000 laptop from 1994 https://archive.org/details/gateway-service-manual-liberty-user-manual [23:33] Of course [23:33] Apparently the federal General Services Administration still has one in use [23:35] Fear my evenings [23:35] I am so confused how these two magnet links are apparently the same: [23:35] magnet:?xt=urn:btih:3E2HBHI4P4N7E3MCM4MIATPF66STOV64&dn=Fanfiction.tar.gz&tr=udp://tracker.openbittorrent.com:80 [23:35] magnet:?xt=urn:btih:d934709d1c7f1bf26d826718804de5f7a53757dc&dn=Fanfiction.tar.gz&tr=udp://tracker.openbittorrent.com:80 [23:36] The hash lengths are different [23:36] So they're probably made by different hash algorithms [23:36] OH RIGHT -- magnet links use some bizzare encoding where the first few bytes define the algorithm. :-( [23:37] * JesseW goes to read (or if necessary, improve) the magnet URI page on http://fileformats.archiveteam.org [23:37] apparently improve, because it's a redlink [23:38] Yeah, the first one is 32 characters long which is the same as MD5, the second is 40 which is the same as SHA1 [23:41] http://www.spacex.com/webcast [23:41] "Is the info-hash hex encoded, for a total of 40 characters. For compatability with existing links in the wild, clients should also support the 32 character base32 encoded info-hash." http://www.bittorrent.org/beps/bep_0009.html [23:41] grumblegrumblegrumble [23:42] MD5 for Fanfiction.tar.gz is 4ded153ca5c091a64dd9aca1c6f4be88 [23:43] and confirmed that 3E2HBHI4P4N7E3MCM4MIATPF66STOV64 is the base32 encoding of d934709d1c7f1bf26d826718804de5f7a53757dc [23:46] *** Boppen has joined #archiveteam-bs [23:49] MrRadar and/or bsmith095 -- could you start seeding the torrent, that maybe I can grab the last piece from you? [23:50] I'm currently seeding [23:50] Try refreshing the tracker [23:54] *** bwn has joined #archiveteam-bs [23:55] I'm currently at 62% verifying local data, so it'll still be a while before I can try