#archiveteam-bs 2019-08-27,Tue

↑back Search

Time Nickname Message
00:03 🔗 arkiver Awesome JAA :)
00:03 🔗 arkiver So we have a major project coming up for tinypic.
00:03 🔗 arkiver They have 9 so-called silo's on which they store images/videos (there's a lot more images than videos). On silo'd 2-9 there's 2^36/10 IDs, while on silo 1 there's 2^36*3/10 IDs. This brings the total number of IDs we will try to go through to ~75.5 billion.
00:04 🔗 arkiver Only a small part of the IDs actually hold an image or video. I was not able to figure out a pattern for these IDs.
00:04 🔗 arkiver #tinydick for the channel
00:04 🔗 arkiver Doing some last tests on the scripts, and will be starting tonight
00:04 🔗 arkiver Attempted to strip a ton of URLs and deduplication is implemented.
00:17 🔗 Binzhou5 has joined #archiveteam-bs
00:18 🔗 Binzhou5 Sorry about that
00:18 🔗 JAA No worries.
00:18 🔗 JAA So POST requests aren't too much of a problem for the archival itself, although it means that some custom code is needed and we can't just throw it into ArchiveBot.
00:19 🔗 JAA But the Wayback Machine cannot play back POST requests, so the pagination will not work there.
00:20 🔗 Binzhou5 Why can't the Wayback do POST?
00:20 🔗 JAA GET requests actually work as well in this particular case: https://disc.yourwebapps.com/discussion.cgi?disc=1&pagemark=20
00:22 🔗 JAA I'm not entirely sure. It should be possible to support it in principle, but it does complicate things since you need to not only consider the URI but also the request body. But I think the WBM uses POST requests for other things, and that would clash with the playback somehow.
00:24 🔗 Binzhou5 Would pages associated with the site such as this have issues?
00:24 🔗 Binzhou5 http://disc.yourwebapps.com/discussion.cgi?id=154531;article=32685
00:25 🔗 arkiver Using wayback machine data for testing the patterns does yield something.
00:27 🔗 JAA On an unrelated note, my archive attempt of http://saure.org/phpBB_04/ aka. forum.powersdr.de already failed due to the broken cookies. I'll have to look into this in more detail tomorrow.
00:43 🔗 Binzhou5 has quit IRC (Quit: Page closed)
01:57 🔗 dashcloud has joined #archiveteam-bs
02:39 🔗 OrIdow5 has joined #archiveteam-bs
03:00 🔗 systwiALT has joined #archiveteam-bs
03:09 🔗 ephemer0l has joined #archiveteam-bs
03:27 🔗 odemgi_ has joined #archiveteam-bs
03:30 🔗 odemgi has quit IRC (Ping timeout: 252 seconds)
03:39 🔗 pew has joined #archiveteam-bs
03:42 🔗 Hooloovoo has joined #archiveteam-bs
03:44 🔗 OrIdow5 Hello, midway through last month, ArchiveTeam (through ArchiveBot, at least going by the IA items) did a crawl of the dying webcomic host ComicGenesis (AKA Keenspace). ComicGenesis divided into subdomains by accounts, and there is at present no complete official listing; you seem to have found and crawled about 5,000 of them. I have a list (which is nonetheless technically incomplete) of about 11,000 valid subdomains (in addition
03:44 🔗 OrIdow5 to something like 8,000 invalid ones, most of which were deleted in a "purge" in 2004). This is in addition to a small (23) list of custom domains that I have not checked to see if you have gotten.
03:44 🔗 OrIdow5 This is in addition to some information on how subdomains work, the lack of which currently results in link breakage.
03:45 🔗 OrIdow5 Should it be better that I go to #archivebot or another channel for this?
03:58 🔗 m007a83_ has joined #archiveteam-bs
03:59 🔗 m007a83_ has quit IRC (Client Quit)
04:00 🔗 m007a83 has quit IRC (Ping timeout: 252 seconds)
05:57 🔗 m007a83 has joined #archiveteam-bs
05:58 🔗 BartoCH_ has joined #archiveteam-bs
06:05 🔗 BartoCH has quit IRC (Ping timeout: 745 seconds)
06:12 🔗 Somebody2 OrIdow5: *PLEASE* put your list up somewhere (github, you could upload it to archive.org, etc) and link to it. I don't know if/when we'll get to grabbing it, but
06:12 🔗 Somebody2 thank you SO much for offering it.
06:20 🔗 Raccoon has quit IRC (Ping timeout: 258 seconds)
06:23 🔗 Raccoon has joined #archiveteam-bs
06:29 🔗 Raccoon has quit IRC (Ping timeout: 360 seconds)
06:32 🔗 n00buser has quit IRC (Ping timeout: 252 seconds)
06:58 🔗 OrIdow5 Here it is: curl https://pastebin.com/raw/9e87Xctm | base64 -d --ignore-garbage | zcat
06:59 🔗 OrIdow5 Sha1 of decompressed file is 7ccd8556a48dc3bed68f858b08ca6f8ba68179fa
07:02 🔗 OrIdow5 About 39% return errors (403s and 404s), according to my random sample
07:14 🔗 OrIdow5 Concerning domains: abc.comicgenesis.com will also be hosted on abc.comicgen.com and abc.keenspace.com. These are not redirects. So far as I can tell, this is completely automatic, and applies to all existent comics. comicgenesis.com is the most-linked-to nowadays, however, keenspace.com was the only(?) one available in the early 2000s. Most comics use relative links for linking to pages, but I found one whose creator chose to us
07:14 🔗 OrIdow5 e absolute links for whatever reason, and presumably there are more like this. In order to keep the most incoming links working, and to properly preserve the occasional comic with internal links, the safest option would be to do one crawl on the list as I have provided it, another on s/comicgenesis.com/comicgen.com/, and another on s/comicgenesis.com/keenspace.com/, this at the cost of triplicating your bandwidth and storage usag
07:14 🔗 OrIdow5 e; but you are the ones who have the knowledge and will be bearing the cost of resource usage, so I leave whether to do this up to you.
07:17 🔗 PurpleSym List mirror: https://transfer.notkiska.pw/Hx5kd/comicgenesis.com.txt
07:19 🔗 PurpleSym If the content is identical, we can easily deduplicate the resulting WARCs, so space shouldn’t be a concern.
07:36 🔗 OrIdow5 That's good.
08:07 🔗 Dragnog2 has joined #archiveteam-bs
08:36 🔗 n00buser has joined #archiveteam-bs
08:39 🔗 deevious has quit IRC (Remote host closed the connection)
09:43 🔗 bluefoo has quit IRC (Ping timeout: 615 seconds)
11:06 🔗 Dragnog2 has quit IRC (Quit: Connection closed for inactivity)
11:56 🔗 anjacks0n has joined #archiveteam-bs
11:57 🔗 fredgido has joined #archiveteam-bs
12:19 🔗 OrIdow5 Additionally, here is a list of external domains https://pastebin.com/KZN3Q7sH I found most of these incidentally (from a now-offline ComicGenesis "guide"), and I do not think this list is nearly complete, but it's better than nothing. Some may be on Keenspot instead of Keenspace - it's difficult to distinguish, because they serve from the same IP range, despite their use of different backends.
12:19 🔗 OrIdow5 has left
12:19 🔗 BlueMax has quit IRC (Read error: Connection reset by peer)
13:14 🔗 fredgido has quit IRC (Ping timeout: 252 seconds)
13:46 🔗 C4K3 has joined #archiveteam-bs
14:51 🔗 larryv has joined #archiveteam-bs
15:01 🔗 Ctrl-S_ they also have a forum: http://forums.comicgenesis.com/
15:03 🔗 JAA That seems to have been run through ArchiveBot as well in July.
15:03 🔗 Ctrl-S_ Oh good.
15:05 🔗 JAA Hmm, or not. I ran a job for it a year ago apparently, but I can't find a recent job. There is stuff in the WBM though, e.g. http://web.archive.org/web/20190719002808/http://forums.comicgenesis.com/viewtopic.php?f=4&t=82623&start=1180
15:06 🔗 JAA Ah yeah, it was part of the list in job cyu5via8ppfnkk2yc9sh3xom8.
16:11 🔗 anjacks0n has quit IRC (Quit: anjacks0n)
16:14 🔗 ShellyRol has quit IRC (Ping timeout: 745 seconds)
16:55 🔗 systwiALT has quit IRC (Read error: Operation timed out)
16:59 🔗 ShellyRol has joined #archiveteam-bs
17:41 🔗 DogsRNice has joined #archiveteam-bs
18:57 🔗 killsushi has joined #archiveteam-bs
19:25 🔗 anjacks0n has joined #archiveteam-bs
19:25 🔗 anjacks0n has quit IRC (Read error: Connection reset by peer)
19:26 🔗 anjacks0n has joined #archiveteam-bs
19:32 🔗 anjacks0n has quit IRC (Quit: anjacks0n)
19:43 🔗 n00buser has quit IRC (Remote host closed the connection)
19:43 🔗 n00buser has joined #archiveteam-bs
20:24 🔗 DogsRNice has quit IRC (Read error: Connection reset by peer)
20:53 🔗 C4K3 has quit IRC (Ping timeout: 745 seconds)
20:57 🔗 coderobe has quit IRC (Remote host closed the connection)
20:58 🔗 fredgido has joined #archiveteam-bs
21:07 🔗 n00buser has quit IRC (Remote host closed the connection)
21:08 🔗 n00buser has joined #archiveteam-bs
21:32 🔗 hook54321 JAA: Some of the candidates have "Senate twitter accounts", do we want those on the list? (example: https://twitter.com/SenGillibrand
21:37 🔗 coderobe has joined #archiveteam-bs
21:57 🔗 JAA hook54321: I see no reason not to include them. I expect there to be many campaign posts on those even if they have separate campaign accounts as well.
22:28 🔗 Mateon1 has quit IRC (Read error: Operation timed out)
22:28 🔗 Mateon1 has joined #archiveteam-bs
22:42 🔗 Mateon1 has quit IRC (Ping timeout: 745 seconds)
22:43 🔗 Mateon1 has joined #archiveteam-bs
22:44 🔗 SketchCow WHERE ARE MY HUGS
22:45 🔗 n00buser has quit IRC (Remote host closed the connection)
22:45 🔗 n00buser has joined #archiveteam-bs
22:47 🔗 Fusl SketchCow: http://xor.meo.ws/xzy0sC6fL3om9mMX4uPP7mjSbGTRAM80/x.gif
23:06 🔗 godane has quit IRC (Ping timeout: 258 seconds)
23:10 🔗 bluefoo has joined #archiveteam-bs
23:13 🔗 t3 has joined #archiveteam-bs
23:14 🔗 t3 has quit IRC (Client Quit)
23:29 🔗 BlueMax has joined #archiveteam-bs
23:29 🔗 godane has joined #archiveteam-bs
23:29 🔗 t3 has joined #archiveteam-bs

irclogger-viewer