[00:03] Awesome JAA :) [00:03] So we have a major project coming up for tinypic. [00:03] They have 9 so-called silo's on which they store images/videos (there's a lot more images than videos). On silo'd 2-9 there's 2^36/10 IDs, while on silo 1 there's 2^36*3/10 IDs. This brings the total number of IDs we will try to go through to ~75.5 billion. [00:04] Only a small part of the IDs actually hold an image or video. I was not able to figure out a pattern for these IDs. [00:04] #tinydick for the channel [00:04] Doing some last tests on the scripts, and will be starting tonight [00:04] Attempted to strip a ton of URLs and deduplication is implemented. [00:17] *** Binzhou5 has joined #archiveteam-bs [00:18] Sorry about that [00:18] No worries. [00:18] So POST requests aren't too much of a problem for the archival itself, although it means that some custom code is needed and we can't just throw it into ArchiveBot. [00:19] But the Wayback Machine cannot play back POST requests, so the pagination will not work there. [00:20] Why can't the Wayback do POST? [00:20] GET requests actually work as well in this particular case: https://disc.yourwebapps.com/discussion.cgi?disc=1&pagemark=20 [00:22] I'm not entirely sure. It should be possible to support it in principle, but it does complicate things since you need to not only consider the URI but also the request body. But I think the WBM uses POST requests for other things, and that would clash with the playback somehow. [00:24] Would pages associated with the site such as this have issues? [00:24] http://disc.yourwebapps.com/discussion.cgi?id=154531;article=32685 [00:25] Using wayback machine data for testing the patterns does yield something. [00:27] On an unrelated note, my archive attempt of http://saure.org/phpBB_04/ aka. forum.powersdr.de already failed due to the broken cookies. I'll have to look into this in more detail tomorrow. [00:43] *** Binzhou5 has quit IRC (Quit: Page closed) [01:57] *** dashcloud has joined #archiveteam-bs [02:39] *** OrIdow5 has joined #archiveteam-bs [03:00] *** systwiALT has joined #archiveteam-bs [03:09] *** ephemer0l has joined #archiveteam-bs [03:27] *** odemgi_ has joined #archiveteam-bs [03:30] *** odemgi has quit IRC (Ping timeout: 252 seconds) [03:39] *** pew has joined #archiveteam-bs [03:42] *** Hooloovoo has joined #archiveteam-bs [03:44] Hello, midway through last month, ArchiveTeam (through ArchiveBot, at least going by the IA items) did a crawl of the dying webcomic host ComicGenesis (AKA Keenspace). ComicGenesis divided into subdomains by accounts, and there is at present no complete official listing; you seem to have found and crawled about 5,000 of them. I have a list (which is nonetheless technically incomplete) of about 11,000 valid subdomains (in addition [03:44] to something like 8,000 invalid ones, most of which were deleted in a "purge" in 2004). This is in addition to a small (23) list of custom domains that I have not checked to see if you have gotten. [03:44] This is in addition to some information on how subdomains work, the lack of which currently results in link breakage. [03:45] Should it be better that I go to #archivebot or another channel for this? [03:58] *** m007a83_ has joined #archiveteam-bs [03:59] *** m007a83_ has quit IRC (Client Quit) [04:00] *** m007a83 has quit IRC (Ping timeout: 252 seconds) [05:57] *** m007a83 has joined #archiveteam-bs [05:58] *** BartoCH_ has joined #archiveteam-bs [06:05] *** BartoCH has quit IRC (Ping timeout: 745 seconds) [06:12] OrIdow5: *PLEASE* put your list up somewhere (github, you could upload it to archive.org, etc) and link to it. I don't know if/when we'll get to grabbing it, but [06:12] thank you SO much for offering it. [06:20] *** Raccoon has quit IRC (Ping timeout: 258 seconds) [06:23] *** Raccoon has joined #archiveteam-bs [06:29] *** Raccoon has quit IRC (Ping timeout: 360 seconds) [06:32] *** n00buser has quit IRC (Ping timeout: 252 seconds) [06:58] Here it is: curl https://pastebin.com/raw/9e87Xctm | base64 -d --ignore-garbage | zcat [06:59] Sha1 of decompressed file is 7ccd8556a48dc3bed68f858b08ca6f8ba68179fa [07:02] About 39% return errors (403s and 404s), according to my random sample [07:14] Concerning domains: abc.comicgenesis.com will also be hosted on abc.comicgen.com and abc.keenspace.com. These are not redirects. So far as I can tell, this is completely automatic, and applies to all existent comics. comicgenesis.com is the most-linked-to nowadays, however, keenspace.com was the only(?) one available in the early 2000s. Most comics use relative links for linking to pages, but I found one whose creator chose to us [07:14] e absolute links for whatever reason, and presumably there are more like this. In order to keep the most incoming links working, and to properly preserve the occasional comic with internal links, the safest option would be to do one crawl on the list as I have provided it, another on s/comicgenesis.com/comicgen.com/, and another on s/comicgenesis.com/keenspace.com/, this at the cost of triplicating your bandwidth and storage usag [07:14] e; but you are the ones who have the knowledge and will be bearing the cost of resource usage, so I leave whether to do this up to you. [07:17] List mirror: https://transfer.notkiska.pw/Hx5kd/comicgenesis.com.txt [07:19] If the content is identical, we can easily deduplicate the resulting WARCs, so space shouldn’t be a concern. [07:36] That's good. [08:07] *** Dragnog2 has joined #archiveteam-bs [08:36] *** n00buser has joined #archiveteam-bs [08:39] *** deevious has quit IRC (Remote host closed the connection) [09:43] *** bluefoo has quit IRC (Ping timeout: 615 seconds) [11:06] *** Dragnog2 has quit IRC (Quit: Connection closed for inactivity) [11:56] *** anjacks0n has joined #archiveteam-bs [11:57] *** fredgido has joined #archiveteam-bs [12:19] Additionally, here is a list of external domains https://pastebin.com/KZN3Q7sH I found most of these incidentally (from a now-offline ComicGenesis "guide"), and I do not think this list is nearly complete, but it's better than nothing. Some may be on Keenspot instead of Keenspace - it's difficult to distinguish, because they serve from the same IP range, despite their use of different backends. [12:19] *** OrIdow5 has left [12:19] *** BlueMax has quit IRC (Read error: Connection reset by peer) [13:14] *** fredgido has quit IRC (Ping timeout: 252 seconds) [13:46] *** C4K3 has joined #archiveteam-bs [14:51] *** larryv has joined #archiveteam-bs [15:01] they also have a forum: http://forums.comicgenesis.com/ [15:03] That seems to have been run through ArchiveBot as well in July. [15:03] Oh good. [15:05] Hmm, or not. I ran a job for it a year ago apparently, but I can't find a recent job. There is stuff in the WBM though, e.g. http://web.archive.org/web/20190719002808/http://forums.comicgenesis.com/viewtopic.php?f=4&t=82623&start=1180 [15:06] Ah yeah, it was part of the list in job cyu5via8ppfnkk2yc9sh3xom8. [16:11] *** anjacks0n has quit IRC (Quit: anjacks0n) [16:14] *** ShellyRol has quit IRC (Ping timeout: 745 seconds) [16:55] *** systwiALT has quit IRC (Read error: Operation timed out) [16:59] *** ShellyRol has joined #archiveteam-bs [17:41] *** DogsRNice has joined #archiveteam-bs [18:57] *** killsushi has joined #archiveteam-bs [19:25] *** anjacks0n has joined #archiveteam-bs [19:25] *** anjacks0n has quit IRC (Read error: Connection reset by peer) [19:26] *** anjacks0n has joined #archiveteam-bs [19:32] *** anjacks0n has quit IRC (Quit: anjacks0n) [19:43] *** n00buser has quit IRC (Remote host closed the connection) [19:43] *** n00buser has joined #archiveteam-bs [20:24] *** DogsRNice has quit IRC (Read error: Connection reset by peer) [20:53] *** C4K3 has quit IRC (Ping timeout: 745 seconds) [20:57] *** coderobe has quit IRC (Remote host closed the connection) [20:58] *** fredgido has joined #archiveteam-bs [21:07] *** n00buser has quit IRC (Remote host closed the connection) [21:08] *** n00buser has joined #archiveteam-bs [21:32] JAA: Some of the candidates have "Senate twitter accounts", do we want those on the list? (example: https://twitter.com/SenGillibrand [21:37] *** coderobe has joined #archiveteam-bs [21:57] hook54321: I see no reason not to include them. I expect there to be many campaign posts on those even if they have separate campaign accounts as well. [22:28] *** Mateon1 has quit IRC (Read error: Operation timed out) [22:28] *** Mateon1 has joined #archiveteam-bs [22:42] *** Mateon1 has quit IRC (Ping timeout: 745 seconds) [22:43] *** Mateon1 has joined #archiveteam-bs [22:44] WHERE ARE MY HUGS [22:45] *** n00buser has quit IRC (Remote host closed the connection) [22:45] *** n00buser has joined #archiveteam-bs [22:47] SketchCow: http://xor.meo.ws/xzy0sC6fL3om9mMX4uPP7mjSbGTRAM80/x.gif [23:06] *** godane has quit IRC (Ping timeout: 258 seconds) [23:10] *** bluefoo has joined #archiveteam-bs [23:13] *** t3 has joined #archiveteam-bs [23:14] *** t3 has quit IRC (Client Quit) [23:29] *** BlueMax has joined #archiveteam-bs [23:29] *** godane has joined #archiveteam-bs [23:29] *** t3 has joined #archiveteam-bs