#urlteam 2019-04-07,Sun

↑back Search

Time Nickname Message
00:04 πŸ”— tech234a has quit IRC (Quit: Connection closed for inactivity)
01:24 πŸ”— tech234a has joined #urlteam
01:54 πŸ”— Somebody2 Thanks, VADemon. Let me know if you have questions, tech234a
02:45 πŸ”— Flashfire VADemon you said you were a native русский speaker
02:45 πŸ”— Flashfire yes?
02:46 πŸ”— VADemon correct
02:49 πŸ”— Flashfire my problem is fitting http://Ρ‡ΠΎΡ‡.Ρ€Ρ„/ and http://сёр.Ρ€Ρ„/ into the alive section that is found here https://www.archiveteam.org/index.php?title=URLTeam#Alive
02:49 πŸ”— Flashfire However It needs to be in alphabetical order
02:49 πŸ”— Flashfire This is where I run into my problem VADemon
02:49 πŸ”— Flashfire I dont care if the only info you put in for the comments section is that they are alive
02:50 πŸ”— Flashfire I just want them put in alphabetically
02:50 πŸ”— Flashfire and dont even get me started on fucking emoji URLs
02:50 πŸ”— Flashfire WHOS IDEA WAS THAT
02:51 πŸ”— VADemon keep cyrillic alphabet after the latin, and сёр.Ρ€Ρ„ comes first then Ρ‡ΠΎΡ‡.Ρ€Ρ„
02:52 πŸ”— Flashfire Ok then
02:52 πŸ”— Flashfire thanks
02:52 πŸ”— Flashfire I will get to that after I finish sorting a few video lists out unless Somebody2 you want to get that down
02:52 πŸ”— VADemon btw mediawiki should allow you to sort the column without first clicking on it
02:53 πŸ”— Flashfire VADemon if you could also find out which Cyrillic Script those URL shorteners use and come back to us with a list to add that to the tracker I will buy you a game on steam that is under $5 I would be so happy
02:53 πŸ”— VADemon BUT dont ask me how any of MWiki works. its worse than ancient glyphs
02:54 πŸ”— Flashfire Thats fine lol
03:18 πŸ”— Zerote has quit IRC (Ping timeout: 260 seconds)
03:46 πŸ”— VADemon Ρ‡ΠΎΡ‡.Ρ€Ρ„: 0-9Π°-яА-Π― 3-chars sequential (seems to be sequential, but the number they claim 88k doesnt add up)
03:46 πŸ”— VADemon Flashfire: not found is 302 to /404.html, found is 302 to link, link deduplication, HEAD requests work
03:47 πŸ”— Flashfire Ok I will add that to the wiki when/if I get a chance I have other stuff on my plate first
03:51 πŸ”— VADemon Very cool of сёр.Ρ€Ρ„, gotta be an oldfag - the descriptions are lovely. And they also tell what they use
03:55 πŸ”— Flashfire Which is? I dont read russian
03:57 πŸ”— VADemon сёр.Ρ€Ρ„: 0-9Π°-яА-Π―-_ link deduplication, case sensitive, 200 on not found, 200 on found - redirect via JS location. link are random, user-specified links are min=9 in length
03:59 πŸ”— VADemon example generated: http://сёр.Ρ€Ρ„/Π£Ρ€Π«/ ex. user: http://сёр.Ρ€Ρ„/классный-Ρ‚ΠΎΡ€Ρ€Π΅Π½Ρ‚-ΠΊΠ»ΠΈΠ΅Π½Ρ‚/
04:35 πŸ”— Hani111 has joined #urlteam
04:44 πŸ”— Hani has quit IRC (Ping timeout: 615 seconds)
04:44 πŸ”— Hani111 is now known as Hani
05:22 πŸ”— warmwaffl has quit IRC (Remote host closed the connection)
05:31 πŸ”— cats_ has joined #urlteam
05:32 πŸ”— cats_ who decides what gets added to the urlteam scrape list
05:32 πŸ”— cats_ and who also decided that what could essentially be considered an attack by blasting random incrementing strings at a server was ever a good idea
05:37 πŸ”— Flashfire Sorry whats wrong?
05:37 πŸ”— Flashfire cats_ do you have a problem you wish to address?
05:37 πŸ”— cats_ I own catbox.moe, which you're "scraping" the now defunct qt.catbox.moe
05:38 πŸ”— Flashfire Somebody2
05:38 πŸ”— Flashfire cats_ I can pause the tracker but we are doing it to try and preserve info. Is there any way of getting a database dump of the urls then cats_
05:39 πŸ”— Flashfire Pause the tracker for catbox.moe
05:40 πŸ”— Flashfire Arkiver Chfoo JAA Somebody2 Hcross someone with more authority needed pronto
05:40 πŸ”— Flashfire cats_ I can pause the tracker on catbox for a little while but are you willing to negotiate?
05:41 πŸ”— JAA Who decided that hosting a service that would cause thousands of links to die when (not if) it eventually goes down despite the actual target URLs still being perfectly fine was ever a good idea?
05:41 πŸ”— cats_ I read your "appeal" page already and I've already had interactions with someone who was affiliated with Archivteam before (wubthecaptain)
05:42 πŸ”— JAA cats_: If we're causing issues for you, we can reduce the request rate. But the ideal solution would be if you could produce a database dump, then we don't have to scrape at all.
05:42 πŸ”— cats_ i think you're misunderstanding the point here - these url's aren't yours to collect
05:42 πŸ”— cats_ they contain potentially private information
05:42 πŸ”— JAA Then they probably shouldn't be publicly accessible.
05:43 πŸ”— cats_ okay then
05:43 πŸ”— cats_ here's the deal
05:44 πŸ”— cats_ either stop completely, or i'm just going to wipe the DB
05:44 πŸ”— cats_ https://files.catbox.moe/mg974m.png
05:44 πŸ”— Flashfire I can certainly slow it down for you
05:45 πŸ”— cats_ as i said before, the shortener is a now defunct service. if you want the *publicly accessible links*, i recommend you use things which scrape the internet for those links, like... google?
05:46 πŸ”— Flashfire Ok currently pausing the tracker
05:46 πŸ”— cats_ and, as it says on the front page (even though I know you enjoy meme'ing about "when not if"), the links will stay accessible for the present future
05:46 πŸ”— Flashfire How long is the present future lasting till?
05:46 πŸ”— JAA I have no reason not to believe you on that, but shit happens. Bus factor etc.
05:46 πŸ”— cats_ not to mention i wasn't contacted previously
05:46 πŸ”— cats_ as in, at all
05:47 πŸ”— cats_ (sans wubthecaptain, 3 years ago)
05:47 πŸ”— Flashfire Are you willing to give a database dump then cats_ If we promise not to scrape?
05:48 πŸ”— cats_ maybe you should be more concerned about other things, like the *other* file host that just shut down with over 4 million files
05:48 πŸ”— cats_ although i know that's not this channel's forte
05:48 πŸ”— JAA Mixtape.moe? We archived what we could of that.
05:49 πŸ”— JAA But yes, not part of this particular project. We're focusing on URL shorteners here.
05:49 πŸ”— cats_ tell you what, since you've been at least cordial with me (except JAA), i'll make a deal
05:50 πŸ”— cats_ i'll go through my access logs for the past... 3 years, and i'll give you every link that has a referrer (meaning it was linked somewhere on the internet)
05:50 πŸ”— icedice has joined #urlteam
05:51 πŸ”— cats_ alternatively, I'll give you ever anonymous shortened URL
05:51 πŸ”— Flashfire Thats the entire length of the services life?
05:51 πŸ”— cats_ i urge you to take the second option.
05:52 πŸ”— Flashfire No chance of both options? Not that I am in a position to bargain but out of curiosity? If I cant have both options I think the second option sounds best
05:52 πŸ”— cats_ catbox just turned 4... link shortener became active at the end of 2016, so about
05:53 πŸ”— cats_ https://files.catbox.moe/4rlg87.png
05:53 πŸ”— Flashfire So if we can take both of those options together that would be the best but if thats not possible then just the second option
05:53 πŸ”— cats_ 9300 is better than nothing
05:53 πŸ”— cats_ (anon url's)
05:54 πŸ”— cats_ 9500*
05:54 πŸ”— Flashfire Combined with what we already have I think that if we cant take both of your options together then the anon urls are the option we will take
05:55 πŸ”— Flashfire Where would we be able to pick that information up from?
05:55 πŸ”— cats_ you can't have gotten very much if you're only at 2m***
05:55 πŸ”— cats_ also, you were scraping 5 characters, and the shorts are 6
05:57 πŸ”— cats_ give me a couple minutes to confer
05:57 πŸ”— Flashfire alright then
05:59 πŸ”— jornane has joined #urlteam
06:00 πŸ”— jornbaer has quit IRC (Quit: Vedlikehold)
06:00 πŸ”— agris has joined #urlteam
06:14 πŸ”— tech234a has quit IRC (Quit: Connection closed for inactivity)
06:58 πŸ”— cats_ sorry, i had to step away for a little bit longer than expected
06:58 πŸ”— Flashfire All Good
06:59 πŸ”— cats_ i've decided that for right now, dumping the anonymous URLs is the best option
07:00 πŸ”— cats_ compiling the referrer links will take time
07:00 πŸ”— Flashfire Alright
07:00 πŸ”— Flashfire Where will these be dumped?
07:00 πŸ”— cats_ and of course, we're *both* concerned with time
07:00 πŸ”— Flashfire Yes
07:00 πŸ”— cats_ do you want it in an .sql file or
07:00 πŸ”— cats_ json
07:01 πŸ”— cats_ csv
07:01 πŸ”— Flashfire Would I be able to get one of each or wait until one of the other operators comes back?
07:02 πŸ”— cats_ i'm fine with waiting for another op
07:02 πŸ”— Flashfire Somebody2 JAA chfoo HCross arkiver little help here again please
07:02 πŸ”— Flashfire thank you
07:09 πŸ”— JAA cats_: The file format doesn't matter very much as long as it's clear what it contains. I guess it'll just be shortcode + URL, or do you keep any further relevant information? (One service that gave me their file had some spam check flag, for example.) Personally, I'd prefer JSON or CSV over SQL.
07:12 πŸ”— JAA We might want to run these short URLs through ArchiveBot if that's ok with you to get both the redirect and the target page into the Wayback Machine. I see that we sent 2.4 million requests your way already, so I guess the ~9k won't matter too much. It'll also be a much slower rate I think, and we can slow it down further if needed.
07:12 πŸ”— cats_ there's a date and potentially malicious bool
07:12 πŸ”— cats_ 90~% of the potentially malicious flagged ones are other shorteners though
07:13 πŸ”— JAA I see, would be great if you could include those fields as well.
07:13 πŸ”— cats_ i had people layering shortners to disguise ip grabbers
07:13 πŸ”— cats_ sometimes 5-6 levels
07:14 πŸ”— cats_ also the rate wasn't a problem necessarily (nginx was panicking a bit)
07:14 πŸ”— cats_ it's just getting thousands of requests that are consistently 404 and seemingly programatic is... suspicious to say the least
07:14 πŸ”— cats_ malicious, to say the worst - in addition to not being contacted beforehand
07:16 πŸ”— cats_ do you want the incremental ID #'s as well or should i just drop that column
07:23 πŸ”— Flashfire cats_ if we slow it down a lot and now that you know its us could we resume the grab at a severly reduced rate? Or is that asking too much forgive me for not having as much experience in negotiatiomn
07:23 πŸ”— cats_ there's some phrase that goes "give an inch, they take a mile" that applies here
07:24 πŸ”— JAA Yeah, understandable. (Sorry for the slow responses by the way. I'm on a train and the connection is... not so great.)
07:24 πŸ”— Flashfire Ignore me in that case cats_
07:24 πŸ”— cats_ i'm not sure what those 3863 urls are worth to you
07:25 πŸ”— cats_ compared to the 9466 anon
07:25 πŸ”— JAA Incremental ID sounds potentially useful.
07:25 πŸ”— JAA 3863?
07:26 πŸ”— Flashfire The ones we wont get JAA
07:26 πŸ”— cats_ 13329 (total) - 9466 (anon) = 3863 (associated with a user)
07:26 πŸ”— JAA Ah
07:28 πŸ”— Flashfire cats_ they arent worth losing out on our deal but if possible we would like them as well. Understandable that you dont want to give them though
07:31 πŸ”— cats_ well, you have 797,448,960 possible urls to scan
07:32 πŸ”— JAA The most interesting ones of those would be the ones linked somewhere on the web. So I still like the referrer idea. Would be interesting also how many are remaining afterwards.
07:44 πŸ”— cats_ ah whatever
07:45 πŸ”— cats_ i was giving you guys a hard time earlier because i was just upset but i don't really care at this point
07:45 πŸ”— cats_ give me a minute to anonymize the table
07:45 πŸ”— Flashfire Our apologies cats_ we only had good intentions didnt mean to upset you at all friend
07:46 πŸ”— Flashfire !a https://www.youtube.com/channel/UCHai12P6Gh7PaIYZGnzyrSA
07:46 πŸ”— Flashfire Ignore that
08:03 πŸ”— cats_ https://qt.catbox.moe/shortenedurls.json
08:04 πŸ”— Flashfire JAA can you download that I dont have enough space on my computer for extra files
08:04 πŸ”— Flashfire I have like 2GB left
08:04 πŸ”— Flashfire If that and dont want to accidentally delete this
08:05 πŸ”— cats_ >2gb
08:05 πŸ”— cats_ wh
08:05 πŸ”— Flashfire Because the laptop I am using at the moment is a 2011 Macbook Pro which I am using until it dies and I have to upgrade
08:06 πŸ”— cats_ that's... impressive to say the least
08:06 πŸ”— Flashfire What that is still running?
08:06 πŸ”— cats_ everything about it
08:07 πŸ”— cats_ anyway, that's everything
08:07 πŸ”— Flashfire Everything is original except for the hard drive in it which is a 250GB SSD
08:07 πŸ”— cats_ would you be upset if i deleted the service now?
08:07 πŸ”— cats_ ;o
08:08 πŸ”— JAA Thanks a lot!
08:08 πŸ”— legoktm has joined #urlteam
08:08 πŸ”— Flashfire I mean I am upset every time I hear about data loss but like it happens. Just give us some time to download the file and sort
08:08 πŸ”— cats_ the link shortener was nothing but a pain anyway. 80% malicious 20% real
08:08 πŸ”— cats_ at least with a file host you can weed out the shit
08:09 πŸ”— JAA Only a little since I'd like to throw it into ArchiveBot first so the redirects will work in the Wayback Machine. But at least the list won't be lost. :-)
08:09 πŸ”— cats_ i was kitten. i'm still going to keep the service up
08:09 πŸ”— JAA I see a lot of files.catbox.moe URLs in there. As if those URLs weren't short enough already.
08:10 πŸ”— cats_ some people don't like extensions
08:10 πŸ”— Flashfire Cheers. (does this mean I can turn on the tracker again so people can stop complaining about no work units ;P)
08:10 πŸ”— cats_ \/shrug
08:10 πŸ”— JAA Flashfire: Please do, but add another shortener instead. :-)
08:10 πŸ”— cats_ i'm sure the work units were being done incredibly fast since i was 444'ing them for the 30 minutes before i came here
08:10 πŸ”— JAA Plenty of other work to do.
08:11 πŸ”— Flashfire I will look at what else I can put in. (People will complain regardless)
08:12 πŸ”— cats_ If you need anything else, you know where to get me https://catbox.moe/contact.php
08:12 πŸ”— Flashfire I would really like to see Linktree done but I dont have the knowledge to put that one through
08:12 πŸ”— Flashfire Thanks very much
08:12 πŸ”— cats_ and for the future, you guys should knock before you start busting doors ;v
08:13 πŸ”— cats_ peace peace
08:16 πŸ”— Flashfire JAA when are we starting on the google short urls?
08:21 πŸ”— JAA Flashfire: Right. I was thinking it would be nice to switch to WARC output first. But on the other hand, it might be years before that happens.
08:22 πŸ”— Flashfire I mean I dont have the coding knowledge required and am to busy to pick it up
08:22 πŸ”— JAA I think we could easily do it with a separate warrior project, but I'm not sure whether we want to do that.
08:23 πŸ”— Flashfire After we assfucked google for google plus I think we should avoid another warrior project against google at least for a month
08:23 πŸ”— JAA There's only one default project, so we can't split up the workers between URLTeam and such a separate effort.
08:24 πŸ”— JAA Or we should continue immediately since they're used to it already. :-)
08:24 πŸ”— Flashfire lol
08:25 πŸ”— Flashfire Its not worth adding bitly aliases to the tracker is it?
08:29 πŸ”— JAA Nope. We can be happy if we ever get close to complete bit.ly coverage.
08:30 πŸ”— Flashfire Alright
08:31 πŸ”— cats_ has quit IRC (Quit: Page closed)
08:31 πŸ”— MR9K has joined #urlteam
08:32 πŸ”— cats_ has joined #urlteam
08:32 πŸ”— cats_ oops, i see you readded catbox
08:32 πŸ”— cats_ i'm blocking your project, give me a second
08:32 πŸ”— cats_ i was*
08:32 πŸ”— Flashfire oh sorry
08:32 πŸ”— cats_ you're just getting 444's for eveyrthing
08:32 πŸ”— JAA Did we? That was not intended.
08:33 πŸ”— Flashfire So ill stop it again?
08:33 πŸ”— JAA 444 is just connection closures, right?
08:33 πŸ”— Flashfire I got confused
08:33 πŸ”— JAA Yes
08:33 πŸ”— JAA Sorry, I wasn't clear above.
08:33 πŸ”— cats_ that you are
08:33 πŸ”— cats_ https://files.catbox.moe/izw8c2.png
08:35 πŸ”— Flashfire Thats ok I disabled it again
08:36 πŸ”— cats_ are you going to scrape just the ones i gave you?
08:36 πŸ”— Flashfire No idea at this point
08:38 πŸ”— cats_ if you just need your slaves to work out the last couple work units you can turn it back on then
08:38 πŸ”— cats_ they'll still 444
08:38 πŸ”— JAA Yeah, I'll run the ones you gave me through ArchiveBot. We won't be grabbing anything further from catbox.moe through this project.
08:40 πŸ”— Smiley has quit IRC (Read error: Operation timed out)
08:40 πŸ”— Flashfire https://ffm.to is a shortener I just found through instagram
08:41 πŸ”— Flashfire Adding it to my dump for now will integrate into main list later
08:45 πŸ”— cats_ as long as archivebot doesn't have the same useragent
08:47 πŸ”— Smiley has joined #urlteam
08:59 πŸ”— JAA It doesn't. I don't remember the full one, but it starts with "ArchiveTeam ArchiveBot/".
09:13 πŸ”— justas has joined #urlteam
10:22 πŸ”— cats_ has quit IRC (Quit: Page closed)
10:24 πŸ”— SilSte has joined #urlteam
10:34 πŸ”— Zerote has joined #urlteam
12:51 πŸ”— justas is now known as jut
14:22 πŸ”— mtntmnky_ has joined #urlteam
14:24 πŸ”— mtntmnky has quit IRC (Remote host closed the connection)
15:44 πŸ”— tech234a has joined #urlteam
16:19 πŸ”— marked what's the advantage if urlteam switched to .warc format?
17:27 πŸ”— Somebody2 Sigh -- this is why I wanted to try and *find out* if there were any 4 or 5 character shortcodes.
17:28 πŸ”— Somebody2 Also, why it's better to actually check if the owners of shortening services are acessible first.
17:28 πŸ”— Somebody2 In any case, there's no need to run the job further.
17:32 πŸ”— Somebody2 and clearing the invalid ones out of dlvr-it again too
17:37 πŸ”— Terbium has quit IRC (Quit: Terbium)
17:37 πŸ”— Terbium has joined #urlteam
17:53 πŸ”— Terbium has quit IRC (Quit: Terbium)
17:56 πŸ”— Terbium has joined #urlteam
18:29 πŸ”— tech234a Here are some URLs I discovered on URL shorteners using my script. My script seems to struggle with less common URL shorteners. https://drive.google.com/open?id=1F-9LAj_jOs1JV_PTxU91n98ARr-LWEcP
18:34 πŸ”— tech234a LMK what else to run it against
19:04 πŸ”— morgan_ has joined #urlteam
19:39 πŸ”— Zerote has quit IRC (Ping timeout: 260 seconds)
20:28 πŸ”— icedice2 has joined #urlteam
20:28 πŸ”— icedice has quit IRC (Ping timeout: 252 seconds)
20:35 πŸ”— icedice2 has quit IRC (Quit: Leaving)
20:46 πŸ”— JAA marked: The short URL redirects would be available in the Wayback Machine. And we could also easily record that a certain shortcode did not exist at a certain time (by keeping the 404 or whatever the service replies with in the WARCs).
20:51 πŸ”— odemg has joined #urlteam
21:18 πŸ”— Zerote has joined #urlteam
22:43 πŸ”— Somebody2 Nice!
23:34 πŸ”— tech234a has quit IRC (Quit: Connection closed for inactivity)

irclogger-viewer