[00:04] *** tech234a has quit IRC (Quit: Connection closed for inactivity) [01:24] *** tech234a has joined #urlteam [01:54] Thanks, VADemon. Let me know if you have questions, tech234a [02:45] VADemon you said you were a native русский speaker [02:45] yes? [02:46] correct [02:49] my problem is fitting http://чоч.рф/ and http://сёр.рф/ into the alive section that is found here https://www.archiveteam.org/index.php?title=URLTeam#Alive [02:49] However It needs to be in alphabetical order [02:49] This is where I run into my problem VADemon [02:49] I dont care if the only info you put in for the comments section is that they are alive [02:50] I just want them put in alphabetically [02:50] and dont even get me started on fucking emoji URLs [02:50] WHOS IDEA WAS THAT [02:51] keep cyrillic alphabet after the latin, and сёр.рф comes first then чоч.рф [02:52] Ok then [02:52] thanks [02:52] I will get to that after I finish sorting a few video lists out unless Somebody2 you want to get that down [02:52] btw mediawiki should allow you to sort the column without first clicking on it [02:53] VADemon if you could also find out which Cyrillic Script those URL shorteners use and come back to us with a list to add that to the tracker I will buy you a game on steam that is under $5 I would be so happy [02:53] BUT dont ask me how any of MWiki works. its worse than ancient glyphs [02:54] Thats fine lol [03:18] *** Zerote has quit IRC (Ping timeout: 260 seconds) [03:46] чоч.рф: 0-9а-яА-Я 3-chars sequential (seems to be sequential, but the number they claim 88k doesnt add up) [03:46] Flashfire: not found is 302 to /404.html, found is 302 to link, link deduplication, HEAD requests work [03:47] Ok I will add that to the wiki when/if I get a chance I have other stuff on my plate first [03:51] Very cool of сёр.рф, gotta be an oldfag - the descriptions are lovely. And they also tell what they use [03:55] Which is? I dont read russian [03:57] сёр.рф: 0-9а-яА-Я-_ link deduplication, case sensitive, 200 on not found, 200 on found - redirect via JS location. link are random, user-specified links are min=9 in length [03:59] example generated: http://сёр.рф/УрЫ/ ex. user: http://сёр.рф/классный-торрент-клиент/ [04:35] *** Hani111 has joined #urlteam [04:44] *** Hani has quit IRC (Ping timeout: 615 seconds) [04:44] *** Hani111 is now known as Hani [05:22] *** warmwaffl has quit IRC (Remote host closed the connection) [05:31] *** cats_ has joined #urlteam [05:32] who decides what gets added to the urlteam scrape list [05:32] and who also decided that what could essentially be considered an attack by blasting random incrementing strings at a server was ever a good idea [05:37] Sorry whats wrong? [05:37] cats_ do you have a problem you wish to address? [05:37] I own catbox.moe, which you're "scraping" the now defunct qt.catbox.moe [05:38] Somebody2 [05:38] cats_ I can pause the tracker but we are doing it to try and preserve info. Is there any way of getting a database dump of the urls then cats_ [05:39] Pause the tracker for catbox.moe [05:40] Arkiver Chfoo JAA Somebody2 Hcross someone with more authority needed pronto [05:40] cats_ I can pause the tracker on catbox for a little while but are you willing to negotiate? [05:41] Who decided that hosting a service that would cause thousands of links to die when (not if) it eventually goes down despite the actual target URLs still being perfectly fine was ever a good idea? [05:41] I read your "appeal" page already and I've already had interactions with someone who was affiliated with Archivteam before (wubthecaptain) [05:42] cats_: If we're causing issues for you, we can reduce the request rate. But the ideal solution would be if you could produce a database dump, then we don't have to scrape at all. [05:42] i think you're misunderstanding the point here - these url's aren't yours to collect [05:42] they contain potentially private information [05:42] Then they probably shouldn't be publicly accessible. [05:43] okay then [05:43] here's the deal [05:44] either stop completely, or i'm just going to wipe the DB [05:44] https://files.catbox.moe/mg974m.png [05:44] I can certainly slow it down for you [05:45] as i said before, the shortener is a now defunct service. if you want the *publicly accessible links*, i recommend you use things which scrape the internet for those links, like... google? [05:46] Ok currently pausing the tracker [05:46] and, as it says on the front page (even though I know you enjoy meme'ing about "when not if"), the links will stay accessible for the present future [05:46] How long is the present future lasting till? [05:46] I have no reason not to believe you on that, but shit happens. Bus factor etc. [05:46] not to mention i wasn't contacted previously [05:46] as in, at all [05:47] (sans wubthecaptain, 3 years ago) [05:47] Are you willing to give a database dump then cats_ If we promise not to scrape? [05:48] maybe you should be more concerned about other things, like the *other* file host that just shut down with over 4 million files [05:48] although i know that's not this channel's forte [05:48] Mixtape.moe? We archived what we could of that. [05:49] But yes, not part of this particular project. We're focusing on URL shorteners here. [05:49] tell you what, since you've been at least cordial with me (except JAA), i'll make a deal [05:50] i'll go through my access logs for the past... 3 years, and i'll give you every link that has a referrer (meaning it was linked somewhere on the internet) [05:50] *** icedice has joined #urlteam [05:51] alternatively, I'll give you ever anonymous shortened URL [05:51] Thats the entire length of the services life? [05:51] i urge you to take the second option. [05:52] No chance of both options? Not that I am in a position to bargain but out of curiosity? If I cant have both options I think the second option sounds best [05:52] catbox just turned 4... link shortener became active at the end of 2016, so about [05:53] https://files.catbox.moe/4rlg87.png [05:53] So if we can take both of those options together that would be the best but if thats not possible then just the second option [05:53] 9300 is better than nothing [05:53] (anon url's) [05:54] 9500* [05:54] Combined with what we already have I think that if we cant take both of your options together then the anon urls are the option we will take [05:55] Where would we be able to pick that information up from? [05:55] you can't have gotten very much if you're only at 2m*** [05:55] also, you were scraping 5 characters, and the shorts are 6 [05:57] give me a couple minutes to confer [05:57] alright then [05:59] *** jornane has joined #urlteam [06:00] *** jornbaer has quit IRC (Quit: Vedlikehold) [06:00] *** agris has joined #urlteam [06:14] *** tech234a has quit IRC (Quit: Connection closed for inactivity) [06:58] sorry, i had to step away for a little bit longer than expected [06:58] All Good [06:59] i've decided that for right now, dumping the anonymous URLs is the best option [07:00] compiling the referrer links will take time [07:00] Alright [07:00] Where will these be dumped? [07:00] and of course, we're *both* concerned with time [07:00] Yes [07:00] do you want it in an .sql file or [07:00] json [07:01] csv [07:01] Would I be able to get one of each or wait until one of the other operators comes back? [07:02] i'm fine with waiting for another op [07:02] Somebody2 JAA chfoo HCross arkiver little help here again please [07:02] thank you [07:09] cats_: The file format doesn't matter very much as long as it's clear what it contains. I guess it'll just be shortcode + URL, or do you keep any further relevant information? (One service that gave me their file had some spam check flag, for example.) Personally, I'd prefer JSON or CSV over SQL. [07:12] We might want to run these short URLs through ArchiveBot if that's ok with you to get both the redirect and the target page into the Wayback Machine. I see that we sent 2.4 million requests your way already, so I guess the ~9k won't matter too much. It'll also be a much slower rate I think, and we can slow it down further if needed. [07:12] there's a date and potentially malicious bool [07:12] 90~% of the potentially malicious flagged ones are other shorteners though [07:13] I see, would be great if you could include those fields as well. [07:13] i had people layering shortners to disguise ip grabbers [07:13] sometimes 5-6 levels [07:14] also the rate wasn't a problem necessarily (nginx was panicking a bit) [07:14] it's just getting thousands of requests that are consistently 404 and seemingly programatic is... suspicious to say the least [07:14] malicious, to say the worst - in addition to not being contacted beforehand [07:16] do you want the incremental ID #'s as well or should i just drop that column [07:23] cats_ if we slow it down a lot and now that you know its us could we resume the grab at a severly reduced rate? Or is that asking too much forgive me for not having as much experience in negotiatiomn [07:23] there's some phrase that goes "give an inch, they take a mile" that applies here [07:24] Yeah, understandable. (Sorry for the slow responses by the way. I'm on a train and the connection is... not so great.) [07:24] Ignore me in that case cats_ [07:24] i'm not sure what those 3863 urls are worth to you [07:25] compared to the 9466 anon [07:25] Incremental ID sounds potentially useful. [07:25] 3863? [07:26] The ones we wont get JAA [07:26] 13329 (total) - 9466 (anon) = 3863 (associated with a user) [07:26] Ah [07:28] cats_ they arent worth losing out on our deal but if possible we would like them as well. Understandable that you dont want to give them though [07:31] well, you have 797,448,960 possible urls to scan [07:32] The most interesting ones of those would be the ones linked somewhere on the web. So I still like the referrer idea. Would be interesting also how many are remaining afterwards. [07:44] ah whatever [07:45] i was giving you guys a hard time earlier because i was just upset but i don't really care at this point [07:45] give me a minute to anonymize the table [07:45] Our apologies cats_ we only had good intentions didnt mean to upset you at all friend [07:46] !a https://www.youtube.com/channel/UCHai12P6Gh7PaIYZGnzyrSA [07:46] Ignore that [08:03] https://qt.catbox.moe/shortenedurls.json [08:04] JAA can you download that I dont have enough space on my computer for extra files [08:04] I have like 2GB left [08:04] If that and dont want to accidentally delete this [08:05] >2gb [08:05] wh [08:05] Because the laptop I am using at the moment is a 2011 Macbook Pro which I am using until it dies and I have to upgrade [08:06] that's... impressive to say the least [08:06] What that is still running? [08:06] everything about it [08:07] anyway, that's everything [08:07] Everything is original except for the hard drive in it which is a 250GB SSD [08:07] would you be upset if i deleted the service now? [08:07] ;o [08:08] Thanks a lot! [08:08] *** legoktm has joined #urlteam [08:08] I mean I am upset every time I hear about data loss but like it happens. Just give us some time to download the file and sort [08:08] the link shortener was nothing but a pain anyway. 80% malicious 20% real [08:08] at least with a file host you can weed out the shit [08:09] Only a little since I'd like to throw it into ArchiveBot first so the redirects will work in the Wayback Machine. But at least the list won't be lost. :-) [08:09] i was kitten. i'm still going to keep the service up [08:09] I see a lot of files.catbox.moe URLs in there. As if those URLs weren't short enough already. [08:10] some people don't like extensions [08:10] Cheers. (does this mean I can turn on the tracker again so people can stop complaining about no work units ;P) [08:10] \/shrug [08:10] Flashfire: Please do, but add another shortener instead. :-) [08:10] i'm sure the work units were being done incredibly fast since i was 444'ing them for the 30 minutes before i came here [08:10] Plenty of other work to do. [08:11] I will look at what else I can put in. (People will complain regardless) [08:12] If you need anything else, you know where to get me https://catbox.moe/contact.php [08:12] I would really like to see Linktree done but I dont have the knowledge to put that one through [08:12] Thanks very much [08:12] and for the future, you guys should knock before you start busting doors ;v [08:13] peace peace [08:16] JAA when are we starting on the google short urls? [08:21] Flashfire: Right. I was thinking it would be nice to switch to WARC output first. But on the other hand, it might be years before that happens. [08:22] I mean I dont have the coding knowledge required and am to busy to pick it up [08:22] I think we could easily do it with a separate warrior project, but I'm not sure whether we want to do that. [08:23] After we assfucked google for google plus I think we should avoid another warrior project against google at least for a month [08:23] There's only one default project, so we can't split up the workers between URLTeam and such a separate effort. [08:24] Or we should continue immediately since they're used to it already. :-) [08:24] lol [08:25] Its not worth adding bitly aliases to the tracker is it? [08:29] Nope. We can be happy if we ever get close to complete bit.ly coverage. [08:30] Alright [08:31] *** cats_ has quit IRC (Quit: Page closed) [08:31] *** MR9K has joined #urlteam [08:32] *** cats_ has joined #urlteam [08:32] oops, i see you readded catbox [08:32] i'm blocking your project, give me a second [08:32] i was* [08:32] oh sorry [08:32] you're just getting 444's for eveyrthing [08:32] Did we? That was not intended. [08:33] So ill stop it again? [08:33] 444 is just connection closures, right? [08:33] I got confused [08:33] Yes [08:33] Sorry, I wasn't clear above. [08:33] that you are [08:33] https://files.catbox.moe/izw8c2.png [08:35] Thats ok I disabled it again [08:36] are you going to scrape just the ones i gave you? [08:36] No idea at this point [08:38] if you just need your slaves to work out the last couple work units you can turn it back on then [08:38] they'll still 444 [08:38] Yeah, I'll run the ones you gave me through ArchiveBot. We won't be grabbing anything further from catbox.moe through this project. [08:40] *** Smiley has quit IRC (Read error: Operation timed out) [08:40] https://ffm.to is a shortener I just found through instagram [08:41] Adding it to my dump for now will integrate into main list later [08:45] as long as archivebot doesn't have the same useragent [08:47] *** Smiley has joined #urlteam [08:59] It doesn't. I don't remember the full one, but it starts with "ArchiveTeam ArchiveBot/". [09:13] *** justas has joined #urlteam [10:22] *** cats_ has quit IRC (Quit: Page closed) [10:24] *** SilSte has joined #urlteam [10:34] *** Zerote has joined #urlteam [12:51] *** justas is now known as jut [14:22] *** mtntmnky_ has joined #urlteam [14:24] *** mtntmnky has quit IRC (Remote host closed the connection) [15:44] *** tech234a has joined #urlteam [16:19] what's the advantage if urlteam switched to .warc format? [17:27] Sigh -- this is why I wanted to try and *find out* if there were any 4 or 5 character shortcodes. [17:28] Also, why it's better to actually check if the owners of shortening services are acessible first. [17:28] In any case, there's no need to run the job further. [17:32] and clearing the invalid ones out of dlvr-it again too [17:37] *** Terbium has quit IRC (Quit: Terbium) [17:37] *** Terbium has joined #urlteam [17:53] *** Terbium has quit IRC (Quit: Terbium) [17:56] *** Terbium has joined #urlteam [18:29] Here are some URLs I discovered on URL shorteners using my script. My script seems to struggle with less common URL shorteners. https://drive.google.com/open?id=1F-9LAj_jOs1JV_PTxU91n98ARr-LWEcP [18:34] LMK what else to run it against [19:04] *** morgan_ has joined #urlteam [19:39] *** Zerote has quit IRC (Ping timeout: 260 seconds) [20:28] *** icedice2 has joined #urlteam [20:28] *** icedice has quit IRC (Ping timeout: 252 seconds) [20:35] *** icedice2 has quit IRC (Quit: Leaving) [20:46] marked: The short URL redirects would be available in the Wayback Machine. And we could also easily record that a certain shortcode did not exist at a certain time (by keeping the 404 or whatever the service replies with in the WARCs). [20:51] *** odemg has joined #urlteam [21:18] *** Zerote has joined #urlteam [22:43] Nice! [23:34] *** tech234a has quit IRC (Quit: Connection closed for inactivity)